Skip to content

C error: Buffer overflow caught on CSV with chunksize #23509

Closed
@dgrahn

Description

@dgrahn

Code Sample

This will create the error, but it is slow. I recommend downloading the file directly.

import pandas
filename = 'https://github.com/pandas-dev/pandas/files/2548189/debug.txt'
for chunk in pandas.read_csv(filename, chunksize=1000, names=range(2504)):
    pass

Problem description

I get the following exception only while using the C engine. This is similar to #11166.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Expected Output

None. It should just loop through the file.

Output of pd.show_versions()

Both machines exhibit the exception.

RedHat
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.14.4.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 39.1.0
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Windows 7
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 18.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Activity

TomAugspurger

TomAugspurger commented on Nov 5, 2018

@TomAugspurger
Contributor

Have you been able to narrow down what exactly in the linked file is causing the exception?

dgrahn

dgrahn commented on Nov 5, 2018

@dgrahn
Author

@TomAugspurger I have not. I'm unsure how to debug the C engine.

gfyoung

gfyoung commented on Nov 5, 2018

@gfyoung
Member

@dgrahn : I have strong reason to believe that this file is actually malformed. Run this code:

with open("debug.txt", "r") as f:
   data = f.readlines()

lengths = set()

# Get row width
#
# Delimiter is definitely ","
for l in data:
   l = l.strip()
   lengths.add(len(l.split(",")))

print(lengths)

This will output:

{2304, 1154, 2054, 904, 1804, 654, 1554, 404, 2454, 1304, 154, 2204, 1054, 1954, 804, 1704, 554, 1454, 304, 2354, 1204, 54, 2104, 954, 1854, 704, 1604, 454, 2504, 1354, 204, 2254, 1104, 2004, 854, 1754, 604, 1504, 354, 2404, 1254, 104, 2154, 1004, 1904, 754, 1654, 504, 1404, 254}

If the file was correctly formatted, it should be that there is only one row width.

dgrahn

dgrahn commented on Nov 5, 2018

@dgrahn
Author

@gfyoung It's not formatted incorrectly. It's a jagged CSV because I didn't want to bloat the file with lots of empty columns. That's why I use the names parameter.

gfyoung

gfyoung commented on Nov 5, 2018

@gfyoung
Member

@dgrahn : Yes, it is, according to our definition. We need properly formatted CSV's, and that means having the same number of comma's across the board for all rows. Jagged CSV's unfortunately do not meet that criterion.

dgrahn

dgrahn commented on Nov 5, 2018

@dgrahn
Author

@gfyoung It works when reading the entire CSV. How can I debug this for chunks? Neither saving the extra columns nor reading the entire file is a feasible option. This is already a subset of a 7 GB file.

gfyoung

gfyoung commented on Nov 5, 2018

@gfyoung
Member

It works when reading the entire CSV.

@dgrahn : Given that you mention that it's a subset, what do you mean by "entire CSV" ? Are you referring to the entire 7 GB file or all of debug.txt ? On my end, I cannot read all of debug.txt.

dgrahn

dgrahn commented on Nov 5, 2018

@dgrahn
Author

@gfyoung When I use the following, I'm able to read the entire CSV.

pd.read_csv('debug.csv', names=range(2504))

The debug file contains the first 7k lines of a file with more than 2.6M.

gfyoung

gfyoung commented on Nov 5, 2018

@gfyoung
Member

@dgrahn : I'm not sure you actually answered my question. Let me rephrase:

Are you able to read the file that you posted to GitHub in its entirety (via pd.read_csv)?

dgrahn

dgrahn commented on Nov 5, 2018

@dgrahn
Author

@gfyoung I'm able to read the debug file using the below code. But it fails when introducing the chunks. Does that answer the question?

pd.read_csv('debug.csv', names=range(2504))
gfyoung

gfyoung commented on Nov 5, 2018

@gfyoung
Member

Okay, got it. So I'm definitely not able to read all of debug.txt in its entirety (Ubuntu 64-bit, 0.23.4). What version of pandas are you using (and on which OS)?

dgrahn

dgrahn commented on Nov 5, 2018

@dgrahn
Author

@gfyoung Details are included in the original post. Both Windows 7 and RedHat. 0.23.4 on RedHat, 0.23.0 on Windows 7.

dgrahn

dgrahn commented on Nov 5, 2018

@dgrahn
Author

Interestingly, when chunksize=10 it fails around line 6,810. When chunksize=100, it fails around 3100.

More details.

chunksize=1, no failure
chunksize=3, no failure
chunksize=4, failure=92-96
chunksize=5, failure=5515-5520
chunksize=10, failure= 6810-6820
chunksize=100, failure= 3100-3200
gfyoung

gfyoung commented on Nov 5, 2018

@gfyoung
Member

Details are included in the original post. Both Windows 7 and RedHat. 0.23.4 on RedHat, 0.23.0 on Windows 7.

I saw, but I wasn't sure whether you meant that it worked on both environments.

13 remaining items

gfyoung

gfyoung commented on Nov 6, 2018

@gfyoung
Member

@dgrahn : I was able to patch it and can now read your debug.txt dataset successfully! PR soon.

dgrahn

dgrahn commented on Nov 6, 2018

@dgrahn
Author

Thank you! Can you point me to directions on integrating that change? Should I use a nightly build?

gfyoung

gfyoung commented on Nov 6, 2018

@gfyoung
Member

@dgrahn : My changes are still being reviewed for merging into master, but if you can install the branch immediately to test on your current files.

added a commit that references this issue on Nov 11, 2018
015a193
added this to the 0.24.0 milestone on Nov 11, 2018
added a commit that references this issue on Nov 12, 2018
17f7822
added a commit that references this issue on Nov 12, 2018
011b79f
added a commit that references this issue on Nov 14, 2018
bb9f4eb
added a commit that references this issue on Nov 19, 2018
bf6986c
added 2 commits that reference this issue on Feb 28, 2019
8583846
83c3ce2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @dgrahn@jreback@TomAugspurger@gfyoung

        Issue actions

          C error: Buffer overflow caught on CSV with chunksize · Issue #23509 · pandas-dev/pandas