Description
This came up indirectly in #52509 and I think merits some brainstorming. In no particular order:
-
Circa 2018 there was discussion of stripping some (debug?) symbols from our C files. No idea if that went anywhere. cc @WillAyd
-
In the last couple years we have improved perf in some groupby reductions by using fused types in libgroupby to support more dtypes directly without casts. I think this significantly increased the size of libgroupby. We did something similar in libalgos and libhashtable. I think avoiding the casting is worth it, but we should acknowledge the tradeoffs.
-
Some stuff in _libs could plausibly live outside of cython without a ton of downside. ops_dispatch and reduction come to mind, though these are both quite small. More could move if we learn to live with circular dependencies.
-
This would be a PITA, but we could distribute some dtype-specific stuff separately e.g.
pip install pandas[sparse] pandas[interval] pandas[period]
and potentially see some big savings that way. This would really be a PITA, but would make a big dent. -
IIUC moving cython code back to plain C might get some mileage cc @WillAyd again? This wo
-
Avoid the numpy dependency. (grep finds 1105 "import numpy"s in pandas/, some of them in eg doctests. 33 "cimport numpy"s)
-
Avoid pytz dependency (xref DEPR: deprecate pytz support #46463 coming up shortly once we drop py38)
-
Avoid dateutil dependency
-
There was a discussion [citation needed] of distributing pandas without the tests. I guess that was a "no".
-
related DEV: reduce the size of the dev environment.yml #49998
Activity
MarcoGorelli commentedon Apr 13, 2023
Yeah, the only part of dateutil which I think we should be using is in
guess_datetime_format
. It's been on the back of my mind that, aside from inferring formats, we shouldn't be using it to parse, it's just too unreliablelithomas1 commentedon Apr 13, 2023
1 is done already (for cibuildwheel), and has been for a couple years now (with the old build system).
I think one other thing to consider is the size of the tests, IIUC some people were complaining about that a little while back.EDIT: Nvm, didn't see the bullet about tests.
mroeschke commentedon Apr 13, 2023
pd.test
and the EA tests as publicly advertisedjbrockmendel commentedon Apr 13, 2023
There are three things from dateutil we use: parser, tz, relativedelta. parser I pretty much agree with Marco we can/should move away from. tz IIUC is going to be subsumed by zoneinfo anyway (i.e. the private attributes we rely on will go away) eventually, so getting rid of our special support for that makes sense. relativedelta we use in a few places (some of which are unnecessary or broken, xref #52569); i havent given much thought to how we could avoid that
WillAyd commentedon Apr 13, 2023
Always a good conversation but I don't think we should change the way we think about C / Cython and fused types to account for this
jorisvandenbossche commentedon Apr 14, 2023
It seems that the tests make up about 1/3rd of the distributed package size, so that might be worth reconsidering
lithomas1 commentedon Apr 15, 2023
I can have a look at removing tests (moving them to a separate pandas-tests package). This is probably going to cause some friction, for developers, though (I will comment more on the other issue soon).
R.e. points 7 & 8, it is worth noting that pytz and dateutil hardly take up any space, as they are pure Python. (both are < 1MB). I would not worry about those too much.
One thing that might be interesting to try PGO/LTO on our C extensions.
This article
https://documentation.suse.com/sbp/all/html/SBP-GCC-10/index.html#sec-gcc10-spec (see section 7.1 Figure 6)
seems to suggest that it could result in a pretty nice decrease in size, but even if doesn't result in a size reduction it might be worth enabling for the other perf benefits.
While I don't think any other projects are doing this, I think Python itself is built with PGO/LTO, and there is an issue in one of the Python repos suggesting using BOLT on module .so libraries
(faster-cpython/ideas#449).
(One thing to note, though, is that it would dramatically increase the compile time, and maybe OOM the GHA runners used to build our wheels. There's also the question of what to use as profiling data for PGO)
(@WillAyd Do you think this is worth pursuing?)
WillAyd commentedon Apr 15, 2023
Looks interesting. A little out of my wheelhouse but if you have time/interest I say go for it. PGO looks particularly interesting, though I guess we'd have to decide how we want to best train the program for optimization
jbrockmendel commentedon Oct 25, 2023
On the pytz one, we would also need to roll a replacement for pytz.AmbiguousTimeError