Skip to content

BUG: pd.DataFrame.sum() min_count parameter does not work with Sparse[float64, nan] dtype #55123

Open
@Danferno

Description

@Danferno
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

    I have confirmed this bug exists on the latest version of pandas.

    I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({'A': [2, 3, np.nan], 'B': [5, np.nan, np.nan]}).astype(pd.SparseDtype('float64'))
print(df.dtypes)
df.sum(axis=1, skipna=True, min_count=1)

Issue Description

The pd.DataFrame.sum() function ignores the min_count parameter for sparse dtype. A row of all missings is summed up to zero rather than missing.

df.sum(axis=1, min_count=1, skipna=True)
0    7.0
1    3.0
2    0.0
dtype: Sparse[float64, nan]

Contrast this to the same operation after a to_dense() call, where the last element is correctly set to missing.

df.sparse.to_dense().sum(axis=1, min_count=1, skipna=True)
0    7.0
1    3.0
2    NaN
dtype: float64

Expected Behavior

The last element should be np.nan because all elements are np.nan

0    7.0
1    3.0
2    np.nan
dtype: Sparse[float64, nan]

Installed Versions

INSTALLED VERSIONS

commit : ba1cccd
python : 3.11.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_Belgium.1252

pandas : 2.1.0
numpy : 1.25.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3.1
Cython : 0.29.34
pytest : 7.4.2
hypothesis : None
sphinx : None
blosc : 1.11.1
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : 1.2.3
pyxlsb : None

Activity

added
Needs TriageIssue that has not been reviewed by a pandas team member
on Sep 13, 2023
Danferno

Danferno commented on Sep 14, 2023

@Danferno
ContributorAuthor

There also seems to be a massive performance hit. The same sum command takes 5s on (extremely) sparse columns versus 0.04s on dense columns in my own usecase.

An example is below, showing an even steeper performance penalty

import pandas as pd
import random
from time import perf_counter
df = pd.DataFrame({'A': range(int(1e4)), 'B': range(int(1e4))}).astype(pd.SparseDtype('float64'))
print(df.dtypes)

start_sparse = perf_counter()
df.sum(axis=1, skipna=True, min_count=1)
print(f'Sparse | {perf_counter() - start_sparse:<7.3f}s')

df = df.astype('float64')
start_dense = perf_counter()
df.sum(axis=1, skipna=True, min_count=1)
print(f'Dense | {perf_counter() - start_dense:<7.3f}s')

for col in df.columns:
    df.loc[df.sample(frac=0.9).index, col] = pd.NA
df = df.astype(pd.SparseDtype('float64'))
start_sparse2 = perf_counter()
df.sum(axis=1, skipna=True, min_count=1)
print(f'Sparse (0.1 density) | {perf_counter() - start_sparse2:<7.3f}s')

Timings:

Sparse | 1.058  s
Dense | 0.001  s
Sparse (0.1 density) | 0.669  s
Danferno

Danferno commented on Sep 15, 2023

@Danferno
ContributorAuthor

The performance issue applies to the max method too. In fact, roundtripping through scipy is orders of magnitude faster.

import pandas as pd
from time import perf_counter
from scipy.sparse import csr_array

# Create df
df = pd.DataFrame({'A': range(int(1e4)), 'B': range(int(1e4))})
for col in df.columns:
    df.loc[df.sample(frac=0.9).index, col] = pd.NA
df = df.astype(pd.SparseDtype('float64'))

# Pandas
start_pandas = perf_counter()
out_pandas = df.notna().max(axis=1)
print(f'Pandas | {perf_counter() - start_pandas:<7.3f}s')       # 0.413s

# Scipy roundtrip
start_scipy = perf_counter()
out_scipy = csr_array(df.notna()).max(axis=1).toarray().flatten()
out_scipy = pd.Series(out_scipy)
print(f'Scipy | {perf_counter() - start_scipy:<7.3f}s')         # 0.002s

assert out_pandas.eq(out_scipy).all()                           # True
Danferno

Danferno commented on Sep 15, 2023

@Danferno
ContributorAuthor

I don't know if it's helpful, but I wrote my own function to mimic the behaviour of pd.DataFrame.sum() for two cases.
With the allowMissing option it mimics skipna=True, min_count=1, otherwise it mimics skipna=False. It should be fairly trivial to adjust it to any other min_count (replace max by sum > X) and for summing over rows (use csc matrix, change axis to 0).

In terms of performance, this is on the same order of magnitude as the dense version, in my case taking twice as long rather than >3 orders of magnitude more.

def pandasSumSparseColumns(df:pd.DataFrame, allowMissing:bool=True):
    ''' Sum columns of a sparse dataframe. If allowMissing is True, it will consider missings as zeroes unless all elements are missing.
        If allowMissing is False, any rows that contain missings will result in a missing value.'''
    from scipy.sparse import csr_matrix

    # Check for empty
    if (df.empty):
        if (allowMissing):
            # return series of same shape as df but all missing
            return pd.Series(index=df.index) 
        else:
            # return series of same shape as df but all zeroes
            return pd.Series(index=df.index, data=0)
        
    # Check if all values in a row are missing
    if allowMissing:
        row_all_missing = (csr_matrix(df.isna()).min(axis=1) == 1).toarray().flatten()
    
    # Sum columns
    if allowMissing:
        csr = csr_matrix(df.fillna(0))
    else:
        csr = csr_matrix(df)      # fill missings with zeroes is missings allowed
    csr_sum = csr.sum(axis=1).A1

    # Optionally set all-missing rows to missing
    if allowMissing:
        csr_sum[row_all_missing] = np.nan

    # Return pandas series
    return pd.Series(csr_sum)
removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jbrockmendel@Danferno

        Issue actions

          BUG: pd.DataFrame.sum() min_count parameter does not work with Sparse[float64, nan] dtype · Issue #55123 · pandas-dev/pandas