Description
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from io import StringIO
from pandas._libs.parsers import STR_NA_VALUES
import pandas as pd
file_contents = """,x,y
MA,1,2
NA,2,1
OA,,3
"""
default_nan_values = STR_NA_VALUES | {"squid"}
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}
pd.read_csv(
StringIO(file_contents),
index_col=0,
header=0,
engine="c",
dtype=dtype,
names=names,
na_values=nan_mapping,
keep_default_na=False,
)
Issue Description
I'm trying to find a way to read in an index column as exact strings, but read in the rest of the columns as NaN-able numbers or strings. The dict form of na_values seems to be the only way implied in the documentation to allow this to happen, however, when I try this, it errors with the message:
Traceback (most recent call last):
File ".../test.py", line 17, in <module>
pd.read_csv(
File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
return _read(filepath_or_buffer, kwds)
File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 624, in _read
return parser.read(nrows)
File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1921, in read
) = self._engine.read( # type: ignore[attr-defined]
File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 333, in read
index, column_names = self._make_index(date_data, alldata, names)
File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 372, in _make_index
index = self._agg_index(simple_index)
File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 504, in _agg_index
arr, _ = self._infer_types(
File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 744, in _infer_types
na_count = parsers.sanitize_objects(values, na_values)
TypeError: Argument 'na_values' has incorrect type (expected set, got dict)
This is unhelpful, as the docs imply this should work, and I can't find any other way to turn off nan detection in the index column without disabling it in the rest of the table (which is a hard requirement)
Expected Behavior
The pandas table should be read without error, leading to a pandas table a bit like the following:
x y
MA 1.0 2.0
NA 2.0 1.0
OA NaN 3.0
Installed Versions
This has been tested on three versions of pandas v1.5.2, v2.0.2, and v2.2.0, all with similar results.
pandas : 2.2.0
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.2.1
Cython : None
pytest : 7.4.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
Activity
techSavvy1001 commentedon Feb 22, 2024
import io
import pandas as pd
file_contents = """
,x,y
MA,1,2
NA,2,1
OA,,3
"""
default_nan_values = set(["NA", "squid"])
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}
try:
df = pd.read_csv(
io.StringIO(file_contents),
index_col=0,
header=0,
engine="c",
dtype=dtype,
names=names,
na_values=nan_mapping,
keep_default_na=True,
)
print(df)
except Exception as e:
print(f"Error occurred: {e}")
rhshadrach commentedon Feb 27, 2024
Thanks for the report, confirmed on main. Further investigations and PRs to fix are welcome!
tomhoq commentedon Mar 5, 2024
take
asishm commentedon Mar 5, 2024
replacing the
None
innames
with anything else (string) works fine.tomhoq commentedon Mar 17, 2024
@thomas-intellegens Sorry to bother, but in the issue post you mention that
In case you might remember, was the documentation this one?
Because otherwise, I cannot find, in the docs, where such property is mentioned.
Thank you
anna-intellegens commentedon Mar 18, 2024
Yeah, this was the section I was reading. Many thanks for taking a look at this
BUG: Fix na_values dict not working on index column (#57547) (#57965)
BUG: Fix na_values dict not working on index column (pandas-dev#57547) (