-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Closed
Labels
InternalsRelated to non-user accessible pandas implementationRelated to non-user accessible pandas implementationRefactorInternal refactoring of codeInternal refactoring of code
Description
Related to the discussion in #10556, and following up on the mailing list discussion "A case for a simplified (non-consolidating) BlockManager with 1D blocks" (archive).
Initial proof of concept for a non-consolidating "ArrayManager" (storing the columns of a DataFrame as a list of 1D arrays instead of blocks) is merged in #36010.
This issue is meant to track the required follow-up work items to get this to a more feature-complete implementation.
-
Functionality: get all tests passing
- There are big chunks of tests failing because some larger sets of functionality is not yet implemented or relying on BlockManager internals (this last aspect is also covered in Practical steps towards a simplified BlockManager #34669). Those bigger topics are:
quantile
/describe
related (ArrayManager.quantile is not yet implemented) -> ENH: ArrayManager.quantile #40189equals
related (ArrayManager.equals is not yet implemented) -> [ArrayManager] Implement .equals method #39721groupby
related tests (there are still a few parts of groupby that directly uses the blocks) -> [ArrayManager] GroupBy cython aggregations (no fallback) #39885, [ArrayManager] Remaining GroupBy tests (fix count, pass on libreduction for now) #40050concat
related (internals/concat.py
only deals with the simple case when no reindexing is needed for ArrayManager at the moment, the full functionality (similarly to whatconcatenate_block_managers
/ theJoinUnits
now cover) still needs to be implemented) -> [ArrayManager] REF: Implement concat with reindexing #39612indexing related (some of the ArrayManager methods likesetitem
,iset
,insert
are not yet fully implementated for all corner cases + get indexing tests passing)IO related:- pytables code still relies on block internals
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
- In addition, the ArrayManager currently also uses an "apply_with_block" fallback for things that are right now implemented directly on the Block classes. Long term, all those cases should also be refactored so that the core functionality of the specific function can be shared between ArrayManager and BlockManager, without directly relying on the Block classes.
- Some examples of this:
replace
,where
,interpolate
,shift
,diff
,downcast
,putmask
, ... (those could all be refactored one at a time).To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
- There will also be tests that either are 1) BlockManager specific (using block internals to test) or 2) testing behaviour specific to DataFrames with BlockManager (we will probably want to change some aspects about eg copy/view, setitem-like ops, etc. Those changes all need to discussed separately of course, but might also require some skipped tests initially).
Such tests can be skipped with eg@td.skip_array_manager_invalid_test
.
- There are big chunks of tests failing because some larger sets of functionality is not yet implemented or relying on BlockManager internals (this last aspect is also covered in Practical steps towards a simplified BlockManager #34669). Those bigger topics are:
-
Design questions:
- What to do with Series, which now is a SingleBlockManager inheriting from BlockManager (should we also have a "SingleArrayManager"?) -> [ArrayManager] Add SingleArrayManager to back a Series #40152... (probably more items will come up) ...To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
-
Performance
- Currently, I didn't yet look at performance (I only ran a few of the ASV benchmarks, see top post of POC: ArrayManager -- array-based data manager for columnar store #36010). I also think that we should first focus on getting a larger part of the functionality working (which will also make it easier to run benchmarks), but afterwards we will need to identify the different areas where performance improvements are needed.
chrish42, kylekeppler, corleyma and xhochy
Metadata
Metadata
Assignees
Labels
InternalsRelated to non-user accessible pandas implementationRelated to non-user accessible pandas implementationRefactorInternal refactoring of codeInternal refactoring of code
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
jorisvandenbossche commentedon Feb 26, 2021
One design question that was still left open is what to do with Series, for which we currently have a
SingleBlockManager
. The two obvious options I can think of:A while ago I thought the second option could be an attractive simplification (because in the end, a Series "just" consists of an array and an index, so why needing a manager?). But that was probably a bit naive ;) The Manager still does quite some things, and moreover, doing a SingleArrayManager keeps the changes more limited (we can still see later if getting rid of Single(Block/Array)Manager is an option we want to explore, independent from the BlockManager vs ArrayManager debate) and for implementing certain features consistently between Series and DataFrame, having both with an underlying manager is probably useful.
Now, for the actual
SingleArrayManager
, some thoughts:SingleBlockManager
, that's actually a subclass ofBlockManager
. But many methods of the BlockManager are not written to work with SingleBlockManager, which means you have quite some methods on the SingleBlockManager that are never used / would error if used. Which I don't find a very nice design pattern. An alternative forSingleArrayManager
could be to not subclass ArrayManager itself (but only a base Manager class). We could of course still have a mixin to share those parts that can be shared.I am currently testing out the approach of a separate SingleArrayManager class to see what is needed to implement it fully.
cc @jreback @jbrockmendel @TomAugspurger
55 remaining items