Skip to content

DOC: Categorical gotchas says memory usage is O(nm), when it is actually O(n+m) #17705

Closed
@RauliRuohonen

Description

@RauliRuohonen

Code Sample, a copy-pastable example if possible

import pandas as pd

def make_series(categories, repeats):
    categories = [str(i) for i in range(categories)]
    return pd.Series(categories*repeats).astype('category')

def series_info(series):
    true_size = series.nbytes
    n = len(series.cat.categories)
    m = len(series.cat.codes)
    expected_size = n*8+m  # Note: not n*m*constant
    print('Number of categories (n): %d, length of data (m): %d, bytes: %d, '
          '8n+m: %d, nm: %d' % (n, m, true_size, expected_size, n*m))

series_info(make_series(123, 456))

# Output:
#  Number of categories (n): 123, length of data (m): 56088, bytes: 57072, 8n+m: 57072, nm: 6898824
#
# Note that 8n+m matches true size as it should. nm is much larger.

Problem description

The documentation says:

The memory usage of a Categorical is proportional to the number of categories times the length of the data.

If there are n categories and the length of data is m, this means that memory usage is O(nm). As the script above shows however, the true usage is O(n+m). The claimed usage O(nm) suggests that the implementation would store the categories using one-hot encoding, which would be strange to say the least.

Activity

jreback

jreback commented on Sep 28, 2017

@jreback
Contributor

well if you want to issue a PR to correct the documentation that would be fine.

added this to the Next Major Release milestone on Sep 28, 2017
berkay-dincer

berkay-dincer commented on Oct 2, 2017

@berkay-dincer
Contributor

@jreback @RauliRuohonen created PR #17736 to solve this issue.

modified the milestones: Next Major Release, 0.21.0 on Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @jreback@RauliRuohonen@berkay-dincer

      Issue actions

        DOC: Categorical gotchas says memory usage is O(nm), when it is actually O(n+m) · Issue #17705 · pandas-dev/pandas