Closed
Description
Code Sample, a copy-pastable example if possible
import sys
m = int(sys.argv[1])
n = int(sys.argv[2])
with open('df.csv', 'wt') as f:
for i in range(n-1):
f.write('c' + str(i) + ',')
f.write('c' + str(n-1) + '\n')
for j in range(m):
for i in range(n-1):
f.write('1,')
f.write('1\n')
import psutil
print(psutil.Process().memory_info().rss / 1024**2)
import pandas as pd
df = pd.read_csv('df.csv')
print(df.shape)
print(psutil.Process().memory_info().rss / 1024**2)
import gc
del df
gc.collect()
print(psutil.Process().memory_info().rss / 1024**2)
Problem description
$ ~/miniconda3/bin/python3 g.py 1 1
11.60546875
(1, 1)
64.02734375
64.02734375
$ ~/miniconda3/bin/python3 g.py 5000000 15
11.58203125
(5000000, 15)
640.45703125
68.25
$ ~/miniconda3/bin/python3 g.py 5000000 20
11.84375
(5000000, 20)
1586.65625
823.71875 - !!!
$ ~/miniconda3/bin/python3 g.py 10000000 10
11.83984375
(10000000, 10)
830.92578125
67.984375
$ ~/miniconda3/bin/python3 g.py 10000000 15
11.89453125
(10000000, 15)
2344.3046875
1199.89453125 - !!!
Two issues:
- There is a "standard" leak after reading any CSV OR just creating by
pd.DataFrame()
- ~53Mb. - We see a large leak in some other cases.
cc @gfyoung
Output of pd.show_versions()
(same for 0.21, 0.22, 0.23)
pandas: 0.23.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Activity
gfyoung commentedon Jun 7, 2018
@kuraga : Thanks for the updated issue!
cc @jreback @jorisvandenbossche
kuraga commentedon Jun 13, 2018
Seems like it's not
pd.read_csv
issue only...nynorbert commentedon Jun 13, 2018
I have a similiar issue. I have tried to debug it with memory_profiler but I don't see the source of the leak.
The output of the profiler:
This snippet of the code is inside a loop and every time it increments the memory usage. I also tried to delete the history and history_group object and calling gc.collect() manually, but nothing seems to work.
Is it possible that this is some cyclic dependency between history and history_group? And if it is then why deleting both history_group and history was not solving the problem?
p.s: My pandas version is 0.23.1
nynorbert commentedon Jun 13, 2018
Sorry, I was wrong. Not the read_csv which consumes the memory rather than a drop:
And I think I found out that malloc_trim solves the problem, similar to this: #2659
@kuraga Maybe you should try it.
zhezherun commentedon Oct 8, 2018
I also noticed a memory leak in
read_csv
and ran it through valgrind, which said that the result of thekset_from_list
function was never freed. I was able to fix this leak locally by patching parsers.pyx and rebuilding pandas.@gfyoung, could you please review the patch below? It might also help with the leak discussed here, although I am not sure if it is the same leak or not. The patch
na_hashset
further down, closer to where it is used. Otherwise it will not be freed ifcontinue
is executed,na_hashset
is deleted if there is an exception,kset_from_list
before raising an exception.gfyoung commentedon Oct 8, 2018
@zhezherun : That's a good catch! Create a PR, and we can review.
kuraga commentedon Oct 23, 2018
Trying to patch is cool but fear that #2659 (comment)...
BUG: Fixing memory leaks in read_csv
BUG: Fixing memory leaks in read_csv
12 remaining items