Description
I am opening this issue because I want (to try) to pursue this in GeoPandas to add a custom GeometryBlock
(work together with Matthew Rocklin in geopandas/geopandas#467, ultra short motivation: we want to store integers (pointers to C objects) in a column but box it to shapely python objects when the user interacts with the column (repr, accessing element, ..))
I am of course free to try this :-), but I wanted to raise this because it has some consequences. With the "allow external libraries" in the issue title, I mean the following:
- agree that this is 'OK' which means that we try to not break the Block API (to a certain extent of course, or try to change it backwards compatible)
- accept some changes to pandas to make this possible where needed (as long as they are only some internal clean-ups)
I don't think we plan many internal refactorings for pandas 0.x / 1.x, so on that regard the Block API should/could remain rather stable (of course for 2.0 this is a whole other issue).
So this issue can serve as general discussion for this (or if people have input or feedback) and as a reference for when changes in pandas are made for this.
cc @pandas-dev/pandas-core
Activity
jorisvandenbossche commentedon Aug 1, 2017
#17143 is an example issue of a small change.
Based on my first experiments, it seems that implementing the GeometryBlock is somehow feasible. The repr (with above PR), (re)indexing, slicing, accessing elements, some operations, .. are already working, although it is of course possible that this were just the easy parts and that the can of worms opens only now trying to fix the remaining problems :-)
TomAugspurger commentedon Aug 3, 2017
jorisvandenbossche commentedon Aug 3, 2017
I think it is perfectly reasonable to assume that the current GeometryBlock that would be included in geopandas will only work for pandas 1.x, and that we will have to rework this for pandas 2.x. Maybe the main constraint for pandas 2.x from this regard would be is not to support such blocks, but to at least have a similar (hopefully cleaner) mechanism to let external libraries extend pandas.
TomAugspurger commentedon Aug 3, 2017
Yes, I was going to suggest that, but I don't want to put more work on Wes and others' plate :) I wouldn't really consider this a hard requirement for the initial pandas 2, but at some point it would be good to have.
wesm commentedon Aug 3, 2017
I think we'll be able to make user defined types much simpler. For example, a Lattitude-Longitude type could be embedded in
struct<lattitude: double, longitude: double>
. Ultimately the block manager is going away, but I don't think this should prevent useful work from happening in current pandas.As an aside, it seems more and more likely that the optimal route for pandas2 will be a separate codebase, while factoring out reusable components of pandas 0.x that do not need to have knowledge of the low level internals.
mrocklin commentedon Aug 3, 2017
The GeoPandas case is a bit more complex than storing structs. We need to store (and track) pointers to an external library, GEOS. This is the library that backs essentially every geospatial system, including Postgres' PostGIS.
Currently our array-like-geometry object tracks references so that we can free the GEOS pointers at the appropriate time. Is handling pointers to external libraries within scope for Pandas 2? This is a bit atypical.
jbrockmendel commentedon Aug 4, 2017
It looks like the set of recognized
Block
subclasses is hard-coded ininternals.form_blocks
andinternals.make_block
. (It also looks like some of the logic in these two functions could be shared.) It wouldn't be too hard to have these functions refer to a registry that brave souls could experiment with.jschendel commentedon Jul 25, 2018
Can this be closed now that we have the extension array interface, and through that an
ExtensionBlock
? Or is this something we want in addition to that?3 remaining items