Skip to content

DISCUSSION: allow external libraries to define a custom Block #17144

Closed
@jorisvandenbossche

Description

@jorisvandenbossche
Member

I am opening this issue because I want (to try) to pursue this in GeoPandas to add a custom GeometryBlock (work together with Matthew Rocklin in geopandas/geopandas#467, ultra short motivation: we want to store integers (pointers to C objects) in a column but box it to shapely python objects when the user interacts with the column (repr, accessing element, ..))

I am of course free to try this :-), but I wanted to raise this because it has some consequences. With the "allow external libraries" in the issue title, I mean the following:

  • agree that this is 'OK' which means that we try to not break the Block API (to a certain extent of course, or try to change it backwards compatible)
  • accept some changes to pandas to make this possible where needed (as long as they are only some internal clean-ups)

I don't think we plan many internal refactorings for pandas 0.x / 1.x, so on that regard the Block API should/could remain rather stable (of course for 2.0 this is a whole other issue).

So this issue can serve as general discussion for this (or if people have input or feedback) and as a reference for when changes in pandas are made for this.

cc @pandas-dev/pandas-core

Activity

jorisvandenbossche

jorisvandenbossche commented on Aug 1, 2017

@jorisvandenbossche
MemberAuthor

#17143 is an example issue of a small change.

Based on my first experiments, it seems that implementing the GeometryBlock is somehow feasible. The repr (with above PR), (re)indexing, slicing, accessing elements, some operations, .. are already working, although it is of course possible that this were just the easy parts and that the can of worms opens only now trying to fix the remaining problems :-)

TomAugspurger

TomAugspurger commented on Aug 3, 2017

@TomAugspurger
Contributor
jorisvandenbossche

jorisvandenbossche commented on Aug 3, 2017

@jorisvandenbossche
MemberAuthor

(of course for 2.0 this is a whole other issue).

I think this deserves more than a parenthetical :) We want to avoid
introducing new APIs that will break with pandas 2, as your GeometryBlock
would (I think). That said, I think it's worthwhile, even if the internals
of geopandas will need to be updated for pandas 2.

I think it is perfectly reasonable to assume that the current GeometryBlock that would be included in geopandas will only work for pandas 1.x, and that we will have to rework this for pandas 2.x. Maybe the main constraint for pandas 2.x from this regard would be is not to support such blocks, but to at least have a similar (hopefully cleaner) mechanism to let external libraries extend pandas.

TomAugspurger

TomAugspurger commented on Aug 3, 2017

@TomAugspurger
Contributor

Maybe the main constraint for pandas 2.x from this regard would be is not to support such blocks, but to at least have a similar (hopefully cleaner) mechanism to let external libraries extend pandas.

Yes, I was going to suggest that, but I don't want to put more work on Wes and others' plate :) I wouldn't really consider this a hard requirement for the initial pandas 2, but at some point it would be good to have.

wesm

wesm commented on Aug 3, 2017

@wesm
Member

I think we'll be able to make user defined types much simpler. For example, a Lattitude-Longitude type could be embedded in struct<lattitude: double, longitude: double>. Ultimately the block manager is going away, but I don't think this should prevent useful work from happening in current pandas.

As an aside, it seems more and more likely that the optimal route for pandas2 will be a separate codebase, while factoring out reusable components of pandas 0.x that do not need to have knowledge of the low level internals.

mrocklin

mrocklin commented on Aug 3, 2017

@mrocklin
Contributor

The GeoPandas case is a bit more complex than storing structs. We need to store (and track) pointers to an external library, GEOS. This is the library that backs essentially every geospatial system, including Postgres' PostGIS.

Currently our array-like-geometry object tracks references so that we can free the GEOS pointers at the appropriate time. Is handling pointers to external libraries within scope for Pandas 2? This is a bit atypical.

jbrockmendel

jbrockmendel commented on Aug 4, 2017

@jbrockmendel
Member

It looks like the set of recognized Block subclasses is hard-coded in internals.form_blocks and internals.make_block. (It also looks like some of the logic in these two functions could be shared.) It wouldn't be too hard to have these functions refer to a registry that brave souls could experiment with.

jschendel

jschendel commented on Jul 25, 2018

@jschendel
Member

Can this be closed now that we have the extension array interface, and through that an ExtensionBlock? Or is this something we want in addition to that?

3 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignInternalsRelated to non-user accessible pandas implementation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @mrocklin@wesm@jorisvandenbossche@TomAugspurger@jschendel

        Issue actions

          DISCUSSION: allow external libraries to define a custom Block · Issue #17144 · pandas-dev/pandas