Refactor - ArrayManager overview issue

Related to the discussion in https://github.com/pandas-dev/pandas/issues/10556, and following up on the mailing list discussion *"A case for a simplified (non-consolidating) BlockManager with 1D blocks"* ([archive](https://mail.python.org/pipermail/pandas-dev/2020-May/001219.html)).

Initial proof of concept for a non-consolidating "ArrayManager" (storing the columns of a DataFrame as a list of 1D arrays instead of blocks) is merged in https://github.com/pandas-dev/pandas/pull/36010.

This issue is meant to track the required follow-up work items to get this to a more feature-complete implementation.

- **Functionality**: get all tests passing
  - There are big chunks of tests failing because some larger sets of functionality is not yet implemented or relying on BlockManager internals (this last aspect is also covered in https://github.com/pandas-dev/pandas/issues/34669). Those bigger topics are:
    - [x] `quantile` / `describe` related (ArrayManager.quantile is not yet implemented) -> https://github.com/pandas-dev/pandas/pull/40189
    - [x] `equals` related (ArrayManager.equals is not yet implemented) -> #39721
    - [x] `groupby` related tests (there are still a few parts of groupby that directly uses the blocks) -> #39885, #40050
    - [x] `concat` related (`internals/concat.py` only deals with the simple case when no reindexing is needed for ArrayManager at the moment, the full functionality (similarly to what `concatenate_block_managers` / the `JoinUnits` now cover) still needs to be implemented) -> https://github.com/pandas-dev/pandas/pull/39612
    - [ ] indexing related (some of the ArrayManager methods like `setitem`, `iset`, `insert` are not yet fully implementated for all corner cases + get indexing tests passing)
    - [ ] IO related:
      - [x] JSON (https://github.com/pandas-dev/pandas/issues/27164, fixed in #41809 
      - [ ] pytables code still relies on block internals
  - In addition, the ArrayManager currently also uses an "apply_with_block" fallback for things that are right now implemented directly on the Block classes. Long term, all those cases should also be refactored so that the core functionality of the specific function can be shared between ArrayManager and BlockManager, without directly relying on the Block classes.
    - [ ] Some examples of this: `replace`, `where`, `interpolate`, `shift`, `diff`, `downcast`, `putmask`, ... (those could all be refactored one at a time). 
  - There will also be tests that either are 1) BlockManager specific (using block internals to test) or 2) testing behaviour specific to DataFrames with BlockManager (we will probably want to change some aspects about eg copy/view, setitem-like ops, etc. Those changes all need to discussed separately of course, but might also require some skipped tests initially). 
    Such tests can be skipped with eg ``@td.skip_array_manager_invalid_test``.
 
- **Design** questions:
  - [ ] What to do with Series, which now is a SingleBlockManager inheriting from BlockManager (should we also have a "SingleArrayManager"?) -> #40152
  - [ ] ... (probably more items will come up) ...
 
- **Performance**
  - Currently, I didn't yet look at performance (I only ran a few of the ASV benchmarks, see top post of https://github.com/pandas-dev/pandas/pull/36010). I also think that we should first focus on getting a larger part of the functionality working (which will also make it easier to run benchmarks), but afterwards we will need to identify the different areas where performance improvements are needed. 






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

Refactor - ArrayManager overview issue #39146

55 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Refactor - ArrayManager overview issue #39146

Description

Activity

jorisvandenbossche commented on Feb 26, 2021

55 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions