Closed
Description
Code Sample
This will create the error, but it is slow. I recommend downloading the file directly.
import pandas
filename = 'https://github.com/pandas-dev/pandas/files/2548189/debug.txt'
for chunk in pandas.read_csv(filename, chunksize=1000, names=range(2504)):
pass
Problem description
I get the following exception only while using the C engine. This is similar to #11166.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
return self.get_chunk()
File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
return self.read(nrows=size)
File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Expected Output
None. It should just loop through the file.
Output of pd.show_versions()
Both machines exhibit the exception.
RedHat
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.14.4.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 39.1.0
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Windows 7
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 18.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Metadata
Metadata
Assignees
Labels
Type
Projects
Relationships
Development
No branches or pull requests
Activity
TomAugspurger commentedon Nov 5, 2018
Have you been able to narrow down what exactly in the linked file is causing the exception?
dgrahn commentedon Nov 5, 2018
@TomAugspurger I have not. I'm unsure how to debug the C engine.
gfyoung commentedon Nov 5, 2018
@dgrahn : I have strong reason to believe that this file is actually malformed. Run this code:
This will output:
If the file was correctly formatted, it should be that there is only one row width.
dgrahn commentedon Nov 5, 2018
@gfyoung It's not formatted incorrectly. It's a jagged CSV because I didn't want to bloat the file with lots of empty columns. That's why I use the
names
parameter.gfyoung commentedon Nov 5, 2018
@dgrahn : Yes, it is, according to our definition. We need properly formatted CSV's, and that means having the same number of comma's across the board for all rows. Jagged CSV's unfortunately do not meet that criterion.
dgrahn commentedon Nov 5, 2018
@gfyoung It works when reading the entire CSV. How can I debug this for chunks? Neither saving the extra columns nor reading the entire file is a feasible option. This is already a subset of a 7 GB file.
gfyoung commentedon Nov 5, 2018
@dgrahn : Given that you mention that it's a subset, what do you mean by "entire CSV" ? Are you referring to the entire 7 GB file or all of
debug.txt
? On my end, I cannot read all ofdebug.txt
.dgrahn commentedon Nov 5, 2018
@gfyoung When I use the following, I'm able to read the entire CSV.
The debug file contains the first 7k lines of a file with more than 2.6M.
gfyoung commentedon Nov 5, 2018
@dgrahn : I'm not sure you actually answered my question. Let me rephrase:
Are you able to read the file that you posted to GitHub in its entirety (via
pd.read_csv
)?dgrahn commentedon Nov 5, 2018
@gfyoung I'm able to read the debug file using the below code. But it fails when introducing the chunks. Does that answer the question?
gfyoung commentedon Nov 5, 2018
Okay, got it. So I'm definitely not able to read all of
debug.txt
in its entirety (Ubuntu 64-bit,0.23.4
). What version ofpandas
are you using (and on which OS)?dgrahn commentedon Nov 5, 2018
@gfyoung Details are included in the original post. Both Windows 7 and RedHat. 0.23.4 on RedHat, 0.23.0 on Windows 7.
dgrahn commentedon Nov 5, 2018
Interestingly, when
chunksize=10
it fails around line 6,810. Whenchunksize=100
, it fails around 3100.More details.
gfyoung commentedon Nov 5, 2018
I saw, but I wasn't sure whether you meant that it worked on both environments.
13 remaining items
gfyoung commentedon Nov 6, 2018
@dgrahn : I was able to patch it and can now read your
debug.txt
dataset successfully! PR soon.dgrahn commentedon Nov 6, 2018
Thank you! Can you point me to directions on integrating that change? Should I use a nightly build?
gfyoung commentedon Nov 6, 2018
@dgrahn : My changes are still being reviewed for merging into
master
, but if you can install the branch immediately to test on your current files.BUG: Don't over-optimize memory with jagged CSV
BUG: Don't over-optimize memory with jagged CSV
BUG: Don't over-optimize memory with jagged CSV (#23527)
BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)