-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Closed
Labels
Duplicate ReportDuplicate issue or pull requestDuplicate issue or pull requestInternalsRelated to non-user accessible pandas implementationRelated to non-user accessible pandas implementationTimezonesTimezone data dtypeTimezone data dtype
Description
At the moment, I don't see any way to have performant, timezone aware columns. This seems rather surprising, as generally when you have a column with timestamps, they will all likely be of the same timezone. Right now, as far as I can see you have two options:
- Columns that are basically just numpy datetime64[ns] types. These seem to be timezone unaware.If you make such a column timezone aware (by e.g., dataframe.time_column.dt.tz_localize('UTC')) it becomes a column of dtype object.
- A DatetimeIndex, which keeps track of timezone information at seemingly the column level (which is laudable). However, this only seems to really work used as an index. If I assign it to a column, it again gets converted to a dtype object, and things get slow.
Am I missing something? Is there a really good reason why a DatetimeIndex can't just be used, as is, in a column, without the dtype=object conversion?
Metadata
Metadata
Assignees
Labels
Duplicate ReportDuplicate issue or pull requestDuplicate issue or pull requestInternalsRelated to non-user accessible pandas implementationRelated to non-user accessible pandas implementationTimezonesTimezone data dtypeTimezone data dtype
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
jorisvandenbossche commentedon Jan 13, 2015
Your analysis is generally correct. The problem is, as you pointed out, that numpy does not have support for time zones, and the data in columns are stored as numpy arrays.
The DatetimeIndex provides some work-arounds to handle time zones at the level of the full index, and these workaround are not (yet) available for columns ('blocks'). But, as far as I understand, it would be possible to do something similar for the DatetimeBlock, but @jreback can shed more light on this.
I think it is a rather big enhancement, but if your are interested in working on this, certainly welcome! Or, for improving timezone support in numpy itself, they are certainly also looking for help.
quicknir commentedon Jan 13, 2015
Thank your @jorisvandenbossche, very helpful. Handling it at the DatetimeBlock level seems a bit messier, e.g. different columns could have different timezone or precision information. I was thinking of a solution more like Categorical. As far as I can see Categorical seems to have all the right basic infrastructure in place to duplicate to create a DatetimeColumn, but naturally this is a very superficial viewpoint.
Sure, I am interested in at least looking at what would be required. The timezone support on the numpy side I actually view as adequate, in the sense that I think the datetime64[ns] is perfectly adequate as a low-level data type, and has the enormous advantage of being just exactly a 64 bit integer and nothing else. This, I think, should be handled on the pandas side.
jreback commentedon Jan 13, 2015
dupe of #8260
@quicknir you are welcome to give this a go, its actually not that tricky, just inherit from
DatetimeBlock
. And theNonConsolidatingBlock
mixin (so these blocks are not combined with one another). This is by far the cleanest soln.support on numpy is non-existant, though there are some proposals. just look thru the pandas codebase and you will appreciate the enormity of what @wesm did with timezones.