-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
TL;DR: Don't make PyArrow required - instead, set minimum NumPy version to 2.0 and use NumPy's StringDType.
Background
In PDEP-10, it was proposed that PyArrow become a required dependency. Several reasons were given, but the most significant reason was to adopt a proper string data type, as opposed to object
.
This was voted on and agreed upon, but there have been some important developments since then, so I think it's warranted to reconsider.
StringDType in NumPy
There's a proposal in NumPy to add a StringDType to NumPy itself. This was brought up in the PDEP-10 discussion, but at the time was not considered significant enough to delay the PyArrow requirement because:
- NumPy itself might not accept its StringDType proposal.
- NumPy's StringDType might not come with the algorithms pandas needs.
- pyarrow's strings might still be significantly faster.
- because pandas typically supports older NumPy versions (in addition to the latest release), it would be 2+ years until pandas could use NumPy's strings.
Let's tackle these in turn:
-
I caught up with Nathan Goldbaum (author of the StringDType proposal) today, and he's said that NEP55 will be accepted (although technically still in draft status, it has several supporters and no objectors and so realistically is going to change to "accepted" very soon).
-
The second concern was the algorithms. Here's an excerpt of the NEP I'd like to draw attention to:
In addition, we will add implementations for the comparison operators as well as an add loop that accepts two string
arrays, multiply loops that accept string and integer arrays, an isnan loop, and implementations for the str_len, isalpha,
isdecimal, isdigit, isnumeric, isspace, find, rfind, count, strip, lstrip, rstrip, and replace string ufuncs [universal functions] that will be newly
available in NumPy 2.0.So, NEP55 not only provides a NumPy StringDType, but also efficient string algorithms.
There's a pandas fork implementing this in pandas, which Nathan has been keeping up-to-date. Once the NumPy StringDType is merged into NumPy main (likely next week) it'll be much easier for pandas devs to test it out. Note: some parts of the fork don't yet use the ufuncs, but they will do soon, it's just a matter of updating things.
For any ufunc that's missing, Nathan's said that now that the string ufuncs framework exists in NumPy, it's relatively straightforward to add new ones (e.g. for
.str.partition
). There is real funding behind this work, so it's likely to keep moving quite fast. -
Nathan's said he doesn't have timings to hand for this comparison, and is about to go on holiday 🌴 He'll be able to provide timings in 1-2 weeks' time though.
-
Personally, I'd be fine with requiring NumPy 2.0 as the minimum NumPy version for pandas, if it means efficient string handling by default without the need for PyArrow. Also, Nathan Goldbaum's fork already implements this for pandas. So, no need to wait 2 years, it should just be a matter of months.
Feedback
The feedback issue makes for an interesting read: #54466.
Complaints seem to come mostly (as far as I can tell) from other package maintainers who are considering moving away from pandas (e.g. fairlearn).
This one surprised me, I don't think anyone had considered this one before? One could argue that it's VirusTotal's issue, but still, just wanted to bring visibility to it.
Tradeoffs
In the PDEP-10 PR it was mentioned that PyArrow could help reduce some maintenance work (which, despite some funding, still seems to be mostly volunteer-driven). Has this been investigated further? Is it still likely to be the case?
Furthermore, not requiring PyArrow would mean not being able to infer list
and struct
dtypes by default (at least, not without significant further work).
"No is temporary, yes is forever"
I'm not saying "never require PyArrow". I'm just saying, at this point in time, I don't think the requirement is justified. Of the proposed benefits, the most salient one is strings, and now there's a realistic alternative which doesn't require taking on an extra massive dependency.
I acknowledge that lately I've been more focused on other projects, and so don't want to come across as "I'm telling pandas what to do because I know best!" (I certainly don't).
Circumstances have changed since the PDEP-10 PR and vote, and personally I regret voting the way I did. Does anyone else feel the same?
Activity
mroeschke commentedon Jan 25, 2024
TLDR: I am +1 not making pyarrow a required dependency in pandas 3.0. I am -1 on making NumPy 2.0 the min version and numpy StringDtypes the default in pandas 3.0. Keep the status quo in 3.0.
A few thoughts:
numpy StringDtype will still be net new in 2.0. While I expect the new type to be robust and more performant than
object
, I think with any new feature it should be opt-in first before being made the default as the scope of edge case incompatibility is unknown. pyarrow strings have been around since 1.3 and was not until recently decided to become the default (I understand it's a different type system too).I have a biased belief that pyarrow type system with it's nullability and support for more types would be a net benefit for users, but I understand that the current numpy type system is "sufficient". It would be cool to allow users to use pyarrow types everywhere in pandas by default, but making that opt-in I think is a likely end state for pyarrow + pandas.
WillAyd commentedon Jan 25, 2024
I think we should still stick with PDEP-10 as is; even if user benefit 1 wasn't as drastic as envisioned, I still think benefits 2 and 3 help immensely.
Generally the story around pandas type system is very confusing; I am hopeful that moving towards the arrow type system solves that over time
jorisvandenbossche commentedon Jan 25, 2024
Personally I am in favor of keeping pyarrow optional (although I voted for the PDEP, because I find it more important to have a proper string dtype). But I also agree with Matt that it seems to fast to require numpy >= 2 for pandas (not only because the string dtype is very new, but also just because this will be annoying for the ecosystem to require such a new version of numpy that many other packages will not yet be compatible with).
If we want a simple alternative to keep pyarrow optional, I don't think we need to use numpy's new string dtype, though. We already have a object-dtype based StringDtype that can be the fallback when pyarrow is not installed. User still get the benefit of a new default, proper
string
dtype in 3.0 in all cases, but if they also want the performance improvements of the new string dtype, they need to have pyarrow installed. Then it's up to users to make that trade-off (and we can find ways to strongly encourage users to use pyarrow).I would also like to suggest another potential idea to consider: we could adopt Arrow's type (memory model) for strings, but without requiring pyarrow the package. Building on @WillAyd's work in #54506 using nanoarrow to use bitmaps in pandas, we could implement a basic StringArray in a similar way, and implement the basic required features in pandas itself (things like getitem, take, isna, unique/factorize), and then for the string-specific methods either use pyarrow if installed, or fallback to Python's string methods otherwise (or if we could vendor some code for this, progressively implement some string methods ourselves).
This of course requires a decent chunk of work in pandas itself. But with the advantages that this keeps compatibility with the Arrow type system (and zero-copy conversion to/from Arrow), and also already gives some advantages for the case pyarrow is not installed (improved memory usage, performance improvements for a subset of methods).
lithomas1 commentedon Jan 30, 2024
+1 on this as well. IMO, it's too early to require numpy 2.0 (since it's pretty hard to adapt to the changes).
cc @pandas-dev/pandas-core
datapythonista commentedon Jan 30, 2024
+1 on not requiring numpy 2 for pandas 3.
I'm fine to continue as planned with the PDEP. If we consider the option of another Arrow implementation replacing PyArrow, ot feels like using Arrow-rs is a better option than nanoarrow to me (at least an option also worth considering). Last time this was discussed it wasn't clear what would happen with the two Rust implementations, but now everybody (except Polars for now) is settled on Arrow-rs and Arrow2 is discontinued. So, things are stable.
If there is interest, I can research further and work on a prototype.
Dr-Irv commentedon Jan 30, 2024
I think we should wait for more feedback in #54466 . pandas 2.2 was released only 11 days ago. I say we give it a month, or maybe until the end of February, and then make a decision. The whole point of gaining feedback was to give us a chance to revisit the decision to make
pyarrow
a required dependency. Seems like our options at this point with pandas 3.0 are:pyarrow
as planned from PDEP-10numpy 2.0
and usenumpy
implementation for strings.pyarrow
- make it optional but allow people to get better string performance by opting in.Given the feedback so far and the arguments that @MarcoGorelli gives above and the other comments, I'm leaning towards (3), but I'd like to see more feedback from the community at large.
lithomas1 commentedon Jan 30, 2024
IMO, I think we should make a decision by the next dev call(Feb. 7th I think?).
I'm probably going to release 2.2.1 at most 2 weeks after numpy releases the 2.0rc (so probably around Feb. 14, assuming the numpy 2.0 releases on schedule on Feb 1), and I think we should decide whether to roll back the warning for 2.2.1, to avoid confusion.
datapythonista commentedon Jan 31, 2024
I did a quick test on how big it'd be a binary using Arrow-rs (Rust). In general in Rust only static linking is used, so just one
.so
and no dependencies would be needed. A sample library using Arrow-rs with the default components (arrow-json, arrow-ipc...) compiles to a file around 500kb. In that sense, the Arrow-rs approach would solve the installation and size issues. Of course this is not an option for pandas 3.0, and it requires a non-trivial amount of work.Something that can make this happen quicker and with less effort is implementing the same PyArrow API for Arrow-rs for the parts we need. In theory, that would allow to simply replace PyArrow by the new package and update the imports.
If there is interest in giving this a try, I'd personally change my vote here from requiring PyArrow in pandas 3, to keep the status quo for now.
simonjayhawkins commentedon Feb 3, 2024
I assume that the decision would be whether we plan to revise the PDEP and then go through the PDEP process again for the revised PDEP?
The PDEP process was created not only that decisions have sufficient discussion and visibility but also that once agreed people could then work towards the agreed changes/improvements without being VETOd by individual maintainers.
In this case, however, it maybe that several maintainers would vote differently now.
Does our process allow us to re vote on the existing PDEP? (given that the PDEP did include the provision to collect feedback from the community)
Does the outcome of any discussions/decisions on this affect whether the next pandas version is 3.0 or 2.3?
attack68 commentedon Feb 3, 2024
Agree with Simon, this concern was discussed as part of the original PDEP (#52711 (comment)) with some timelines discussed and the vote was still approved. I somewhat expected some of the pushback from developers of web-apps so am supportive of this new proposal and my original vote, but it needs to fit in with the governance established, and should possibly also be cautious of any development that has taken place in H2 '23 that has been done in anticipation of the implementation of the PDEP. I would expect the approved PDEP to continue to steer the development until formally agreed otherwise. I don't see a reason why a new PDEP could not be proposed to alter/amend the previous, particularly if there already seemed to be enough support to warrant one.
WillAyd commentedon Feb 3, 2024
With @jorisvandenbossche idea I wanted to try and implement an Extension-Array compatable StringArray using nanoarrow. Some python idioms like negative indexing aren't yet implemented, and there was a limitation around classmethods I haven't worked around, but otherwise I did get implement this here:
https://github.com/WillAyd/nanopandas/tree/7e333e25b1b4027e49b9d6ad2465591abf0c9b27
I also implented some of the optional interface items like
unique
,fillna
anddropna
alongside a few str accessor methodsOf course benchmarking this would take some effort, but I think most of the algorithms we would need are pretty simple.
simonjayhawkins commentedon Feb 3, 2024
I too was keen to keep pyarrow optional but voted for the PDEP for the benefits for other dtypes.
From the PDEP... "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object. Additionally, we will infer all dtypes that are listed below as well instead of storing as object."
IIRC I also made this point in the original discussion but there was pushback to having the object backed StringDType as the default if pyarrow is not installed that included not only concerns about performance but also regarding different behavior depending on if a dependency was installed. (The timelines for NumPy's StringDtype precluded that as an option to address the performance concerns)
However, I did not push this point once the proposal was expanded to dtypes other that strings.
Didn't we also discuss using, say nanoarrow? (Or am I mixing up the discussion on requiring pyarrow for the I/O interface.)
If this wasn't discussed then a new/further discussion around this option would add value ( #57073 (comment)) especially since @WillAyd is actively working on this.
WillAyd commentedon Feb 5, 2024
Another advantage of building on top of nanoarrow is we would have the ability to implement our own algorithms to fit the needs of pandas. Here is a quick benchmark of the nanopandas
isna()
implementation versus pandas:That's about a 200x speedup. Of course its not a fair comparison because the pandas arrow extension implementation calls
to_numpy()
, but we in theory would have more flexibility to avoid that copy to numpy if we take on more management of the lower level.74 remaining items