Skip to content

Refactor - ArrayManager overview issue #39146

@jorisvandenbossche

Description

@jorisvandenbossche
Member

Related to the discussion in #10556, and following up on the mailing list discussion "A case for a simplified (non-consolidating) BlockManager with 1D blocks" (archive).

Initial proof of concept for a non-consolidating "ArrayManager" (storing the columns of a DataFrame as a list of 1D arrays instead of blocks) is merged in #36010.

This issue is meant to track the required follow-up work items to get this to a more feature-complete implementation.

  • In addition, the ArrayManager currently also uses an "apply_with_block" fallback for things that are right now implemented directly on the Block classes. Long term, all those cases should also be refactored so that the core functionality of the specific function can be shared between ArrayManager and BlockManager, without directly relying on the Block classes.
    • Some examples of this: replace, where, interpolate, shift, diff, downcast, putmask, ... (those could all be refactored one at a time).
  • There will also be tests that either are 1) BlockManager specific (using block internals to test) or 2) testing behaviour specific to DataFrames with BlockManager (we will probably want to change some aspects about eg copy/view, setitem-like ops, etc. Those changes all need to discussed separately of course, but might also require some skipped tests initially).
    Such tests can be skipped with eg @td.skip_array_manager_invalid_test.
  • Design questions:

    • What to do with Series, which now is a SingleBlockManager inheriting from BlockManager (should we also have a "SingleArrayManager"?) -> [ArrayManager] Add SingleArrayManager to back a Series #40152
      ... (probably more items will come up) ...
  • Performance

    • Currently, I didn't yet look at performance (I only ran a few of the ASV benchmarks, see top post of POC: ArrayManager -- array-based data manager for columnar store  #36010). I also think that we should first focus on getting a larger part of the functionality working (which will also make it easier to run benchmarks), but afterwards we will need to identify the different areas where performance improvements are needed.
  • Activity

    added
    RefactorInternal refactoring of code
    InternalsRelated to non-user accessible pandas implementation
    on Jan 13, 2021
    jorisvandenbossche

    jorisvandenbossche commented on Feb 26, 2021

    @jorisvandenbossche
    MemberAuthor

    One design question that was still left open is what to do with Series, for which we currently have a SingleBlockManager. The two obvious options I can think of:

    • Do something similar and also have a "SingleArrayManager" class
    • Directly store the array on the Series without involving a manager object

    A while ago I thought the second option could be an attractive simplification (because in the end, a Series "just" consists of an array and an index, so why needing a manager?). But that was probably a bit naive ;) The Manager still does quite some things, and moreover, doing a SingleArrayManager keeps the changes more limited (we can still see later if getting rid of Single(Block/Array)Manager is an option we want to explore, independent from the BlockManager vs ArrayManager debate) and for implementing certain features consistently between Series and DataFrame, having both with an underlying manager is probably useful.

    Now, for the actual SingleArrayManager, some thoughts:

    • For SingleBlockManager, that's actually a subclass of BlockManager. But many methods of the BlockManager are not written to work with SingleBlockManager, which means you have quite some methods on the SingleBlockManager that are never used / would error if used. Which I don't find a very nice design pattern. An alternative for SingleArrayManager could be to not subclass ArrayManager itself (but only a base Manager class). We could of course still have a mixin to share those parts that can be shared.
    • I was wondering if we could actually "just" reuse the ArrayManager for Series as well, in which case it would store a single array (list of arrays with length 1). So the underlying storage would be the same as for a DataFrame of 1 column. But, I suppose that storing the name of the Series as a len-1 Index probably gives quite some overhead (and in addition we already have many places that explicitly does "name resolution" to determine the name of the resulting series, so that might give bigger changes)

    I am currently testing out the approach of a separate SingleArrayManager class to see what is needed to implement it fully.

    cc @jreback @jbrockmendel @TomAugspurger

    55 remaining items

    Loading
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Metadata

    Metadata

    Assignees

    No one assigned

      Labels

      InternalsRelated to non-user accessible pandas implementationRefactorInternal refactoring of code

      Type

      No type

      Projects

      No projects

      Milestone

      No milestone

      Relationships

      None yet

        Development

        No branches or pull requests

          Participants

          @jorisvandenbossche@corleyma@jbrockmendel@mroeschke@rhshadrach

          Issue actions

            Refactor - ArrayManager overview issue · Issue #39146 · pandas-dev/pandas