Description
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({'A': [2, 3, np.nan], 'B': [5, np.nan, np.nan]}).astype(pd.SparseDtype('float64'))
print(df.dtypes)
df.sum(axis=1, skipna=True, min_count=1)
Issue Description
The pd.DataFrame.sum()
function ignores the min_count parameter for sparse dtype. A row of all missings is summed up to zero rather than missing.
df.sum(axis=1, min_count=1, skipna=True)
0 7.0
1 3.0
2 0.0
dtype: Sparse[float64, nan]
Contrast this to the same operation after a to_dense()
call, where the last element is correctly set to missing.
df.sparse.to_dense().sum(axis=1, min_count=1, skipna=True)
0 7.0
1 3.0
2 NaN
dtype: float64
Expected Behavior
The last element should be np.nan because all elements are np.nan
0 7.0
1 3.0
2 np.nan
dtype: Sparse[float64, nan]
Installed Versions
INSTALLED VERSIONS
commit : ba1cccd
python : 3.11.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_Belgium.1252
pandas : 2.1.0
numpy : 1.25.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3.1
Cython : 0.29.34
pytest : 7.4.2
hypothesis : None
sphinx : None
blosc : 1.11.1
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : 1.2.3
pyxlsb : None
Activity
Danferno commentedon Sep 14, 2023
There also seems to be a massive performance hit. The same sum command takes 5s on (extremely) sparse columns versus 0.04s on dense columns in my own usecase.
An example is below, showing an even steeper performance penalty
Timings:
Danferno commentedon Sep 15, 2023
The performance issue applies to the max method too. In fact, roundtripping through scipy is orders of magnitude faster.
Danferno commentedon Sep 15, 2023
I don't know if it's helpful, but I wrote my own function to mimic the behaviour of pd.DataFrame.sum() for two cases.
With the
allowMissing
option it mimicsskipna=True, min_count=1
, otherwise it mimicsskipna=False
. It should be fairly trivial to adjust it to any other min_count (replace max by sum > X) and for summing over rows (use csc matrix, change axis to 0).In terms of performance, this is on the same order of magnitude as the dense version, in my case taking twice as long rather than >3 orders of magnitude more.