Skip to content

Memory leak in pd.read_csv or DataFrame #21353

Closed
@kuraga

Description

@kuraga

Code Sample, a copy-pastable example if possible

import sys

m = int(sys.argv[1])
n = int(sys.argv[2])

with open('df.csv', 'wt') as f:
    for i in range(n-1):
        f.write('c' + str(i) + ',')
    f.write('c' + str(n-1) + '\n')
    for j in range(m):
        for i in range(n-1):
            f.write('1,')
        f.write('1\n')


import psutil

print(psutil.Process().memory_info().rss / 1024**2)

import pandas as pd
df = pd.read_csv('df.csv')

print(df.shape)
print(psutil.Process().memory_info().rss / 1024**2)

import gc
del df
gc.collect()

print(psutil.Process().memory_info().rss / 1024**2)

Problem description

$ ~/miniconda3/bin/python3 g.py 1 1
11.60546875
(1, 1)
64.02734375
64.02734375

$ ~/miniconda3/bin/python3 g.py 5000000 15
11.58203125
(5000000, 15)
640.45703125
68.25

$ ~/miniconda3/bin/python3 g.py 5000000 20
11.84375
(5000000, 20)
1586.65625
823.71875 - !!!

$ ~/miniconda3/bin/python3 g.py 10000000 10
11.83984375
(10000000, 10)
830.92578125
67.984375

$ ~/miniconda3/bin/python3 g.py 10000000 15
11.89453125
(10000000, 15)
2344.3046875
1199.89453125 - !!!

Two issues:

  1. There is a "standard" leak after reading any CSV OR just creating by pd.DataFrame() - ~53Mb.
  2. We see a large leak in some other cases.

cc @gfyoung

Output of pd.show_versions()

(same for 0.21, 0.22, 0.23)

pandas: 0.23.0 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Activity

gfyoung

gfyoung commented on Jun 7, 2018

@gfyoung
Member

@kuraga : Thanks for the updated issue!

cc @jreback @jorisvandenbossche

kuraga

kuraga commented on Jun 13, 2018

@kuraga
Author

Seems like it's not pd.read_csv issue only...

memory_leak_2

nynorbert

nynorbert commented on Jun 13, 2018

@nynorbert

I have a similiar issue. I have tried to debug it with memory_profiler but I don't see the source of the leak.
The output of the profiler:

 Line #    Mem usage    Increment   Line Contents
 ================================================
    187    261.3 MiB      0.0 MiB        if "history" in self.watch_list:
    188    491.9 MiB    230.6 MiB            self.history = pd.read_csv(self.path + '/' + self.history_files[self.current][1], delimiter=';', header=None)
    189    491.9 MiB      0.0 MiB            self.history_group = self.history.groupby([0])

This snippet of the code is inside a loop and every time it increments the memory usage. I also tried to delete the history and history_group object and calling gc.collect() manually, but nothing seems to work.
Is it possible that this is some cyclic dependency between history and history_group? And if it is then why deleting both history_group and history was not solving the problem?

p.s: My pandas version is 0.23.1

nynorbert

nynorbert commented on Jun 13, 2018

@nynorbert

Sorry, I was wrong. Not the read_csv which consumes the memory rather than a drop:

Line #    Mem usage    Increment   Line Contents
 ================================================
   265   1425.1 MiB      9.6 MiB                        self.history.drop(self.history_group.get_group(self.current_timestamp).index)

And I think I found out that malloc_trim solves the problem, similar to this: #2659

@kuraga Maybe you should try it.

zhezherun

zhezherun commented on Oct 8, 2018

@zhezherun
Contributor

I also noticed a memory leak in read_csv and ran it through valgrind, which said that the result of the kset_from_list function was never freed. I was able to fix this leak locally by patching parsers.pyx and rebuilding pandas.

@gfyoung, could you please review the patch below? It might also help with the leak discussed here, although I am not sure if it is the same leak or not. The patch

  • Moves the allocation of na_hashset further down, closer to where it is used. Otherwise it will not be freed if continue is executed,
  • Makes sure that na_hashset is deleted if there is an exception,
  • Also cleans up the allocation inside kset_from_list before raising an exception.
--- parsers.pyx	2018-08-01 19:57:16.000000000 +0100
+++ parsers.pyx	2018-10-08 15:25:32.124526087 +0100
@@ -1054,18 +1054,6 @@
 
             conv = self._get_converter(i, name)
 
-            # XXX
-            na_flist = set()
-            if self.na_filter:
-                na_list, na_flist = self._get_na_list(i, name)
-                if na_list is None:
-                    na_filter = 0
-                else:
-                    na_filter = 1
-                    na_hashset = kset_from_list(na_list)
-            else:
-                na_filter = 0
-
             col_dtype = None
             if self.dtype is not None:
                 if isinstance(self.dtype, dict):
@@ -1090,13 +1078,26 @@
                                               self.c_encoding)
                 continue
 
-            # Should return as the desired dtype (inferred or specified)
-            col_res, na_count = self._convert_tokens(
-                i, start, end, name, na_filter, na_hashset,
-                na_flist, col_dtype)
+            # XXX
+            na_flist = set()
+            if self.na_filter:
+                na_list, na_flist = self._get_na_list(i, name)
+                if na_list is None:
+                    na_filter = 0
+                else:
+                    na_filter = 1
+                    na_hashset = kset_from_list(na_list)
+            else:
+                na_filter = 0
 
-            if na_filter:
-                self._free_na_set(na_hashset)
+            try:
+                # Should return as the desired dtype (inferred or specified)
+                col_res, na_count = self._convert_tokens(
+                    i, start, end, name, na_filter, na_hashset,
+                    na_flist, col_dtype)
+            finally:
+                if na_filter:
+                    self._free_na_set(na_hashset)
 
             if upcast_na and na_count > 0:
                 col_res = _maybe_upcast(col_res)
@@ -2043,6 +2044,7 @@
 
         # None creeps in sometimes, which isn't possible here
         if not PyBytes_Check(val):
+            kh_destroy_str(table)
             raise ValueError('Must be all encoded bytes')
 
         k = kh_put_str(table, PyBytes_AsString(val), &ret)
gfyoung

gfyoung commented on Oct 8, 2018

@gfyoung
Member

@zhezherun : That's a good catch! Create a PR, and we can review.

added this to the 0.24.0 milestone on Oct 10, 2018
kuraga

kuraga commented on Oct 23, 2018

@kuraga
Author

Trying to patch is cool but fear that #2659 (comment)...

added 2 commits that reference this issue on Nov 19, 2018
e3ff389
36c1104

12 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csv

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @jreback@kuraga@TomAugspurger@bashtage@gfyoung

      Issue actions

        Memory leak in pd.read_csv or DataFrame · Issue #21353 · pandas-dev/pandas