Skip to content

BUG: na_values dict form not working on index column  #57547

Closed
@anna-intellegens

Description

@anna-intellegens

Pandas version checks

  • I have checked that this issue has not already been reported.

    I have confirmed this bug exists on the latest version of pandas.

    I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO

from pandas._libs.parsers import STR_NA_VALUES
import pandas as pd

file_contents = """,x,y
MA,1,2
NA,2,1
OA,,3
"""

default_nan_values = STR_NA_VALUES | {"squid"}
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}

pd.read_csv(
    StringIO(file_contents),
    index_col=0,
    header=0,
    engine="c",
    dtype=dtype,
    names=names,
    na_values=nan_mapping,
    keep_default_na=False,
)

Issue Description

I'm trying to find a way to read in an index column as exact strings, but read in the rest of the columns as NaN-able numbers or strings. The dict form of na_values seems to be the only way implied in the documentation to allow this to happen, however, when I try this, it errors with the message:

Traceback (most recent call last):
  File ".../test.py", line 17, in <module>
    pd.read_csv(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
    return _read(filepath_or_buffer, kwds)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 624, in _read
    return parser.read(nrows)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1921, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 333, in read
    index, column_names = self._make_index(date_data, alldata, names)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 372, in _make_index
    index = self._agg_index(simple_index)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 504, in _agg_index
    arr, _ = self._infer_types(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 744, in _infer_types
    na_count = parsers.sanitize_objects(values, na_values)
TypeError: Argument 'na_values' has incorrect type (expected set, got dict)

This is unhelpful, as the docs imply this should work, and I can't find any other way to turn off nan detection in the index column without disabling it in the rest of the table (which is a hard requirement)

Expected Behavior

The pandas table should be read without error, leading to a pandas table a bit like the following:

       x    y
MA   1.0  2.0
NA   2.0  1.0
OA   NaN  3.0

Installed Versions

This has been tested on three versions of pandas v1.5.2, v2.0.2, and v2.2.0, all with similar results.

INSTALLED VERSIONS ------------------ commit : fd3f571 python : 3.10.11.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-18-generic Version : #18~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 7 11:40:03 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 2.2.0
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.2.1
Cython : None
pytest : 7.4.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

Activity

techSavvy1001

techSavvy1001 commented on Feb 22, 2024

@techSavvy1001

import io
import pandas as pd

file_contents = """
,x,y
MA,1,2
NA,2,1
OA,,3
"""

default_nan_values = set(["NA", "squid"])
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}

try:
df = pd.read_csv(
io.StringIO(file_contents),
index_col=0,
header=0,
engine="c",
dtype=dtype,
names=names,
na_values=nan_mapping,
keep_default_na=True,
)
print(df)
except Exception as e:
print(f"Error occurred: {e}")

rhshadrach

rhshadrach commented on Feb 27, 2024

@rhshadrach
Member

Thanks for the report, confirmed on main. Further investigations and PRs to fix are welcome!

added
IO CSVread_csv, to_csv
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Feb 27, 2024
tomhoq

tomhoq commented on Mar 5, 2024

@tomhoq
Contributor

take

asishm

asishm commented on Mar 5, 2024

@asishm
Contributor

replacing the None in names with anything else (string) works fine.

tomhoq

tomhoq commented on Mar 17, 2024

@tomhoq
Contributor

@thomas-intellegens Sorry to bother, but in the issue post you mention that

The dict form of na_values seems to be the only way implied in the documentation to allow having no na values on a specific column

In case you might remember, was the documentation this one?

Because otherwise, I cannot find, in the docs, where such property is mentioned.

Thank you

anna-intellegens

anna-intellegens commented on Mar 18, 2024

@anna-intellegens
Author

In case you might remember, was the documentation this one?

Yeah, this was the section I was reading. Many thanks for taking a look at this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    Participants

    @asishm@rhshadrach@tomhoq@anna-intellegens@techSavvy1001

    Issue actions

      BUG: na_values dict form not working on index column · Issue #57547 · pandas-dev/pandas