-
-
Notifications
You must be signed in to change notification settings - Fork 19k
Description
Master currently has an (undocumented) (maybe-) API-breaking change from 0.23.4 when passed integer values
0.23.4
In [2]: i8data = np.arange(5) * 3600 * 10**9
In [3]: pd.DatetimeIndex(i8data, tz="US/Central")
Out[3]:
DatetimeIndex(['1970-01-01 00:00:00-06:00', '1970-01-01 01:00:00-06:00',
'1970-01-01 02:00:00-06:00', '1970-01-01 03:00:00-06:00',
'1970-01-01 04:00:00-06:00'],
dtype='datetime64[ns, US/Central]', freq=None)
Master
In [3]: pd.DatetimeIndex(i8data, tz="US/Central")
Out[3]:
DatetimeIndex(['1969-12-31 18:00:00-06:00', '1969-12-31 19:00:00-06:00',
'1969-12-31 20:00:00-06:00', '1969-12-31 21:00:00-06:00',
'1969-12-31 22:00:00-06:00'],
dtype='datetime64[ns, US/Central]', freq=None)
Attempt to explain the behavior: In 0.23.4, passing an ndarray[i8]
was equivalent to passing data.view("M8[ns]")
# 0.23.4
In [4]: pd.DatetimeIndex(i8data.view("M8[ns]"), tz="US/Central")
Out[4]:
DatetimeIndex(['1970-01-01 00:00:00-06:00', '1970-01-01 01:00:00-06:00',
'1970-01-01 02:00:00-06:00', '1970-01-01 03:00:00-06:00',
'1970-01-01 04:00:00-06:00'],
dtype='datetime64[ns, US/Central]', freq=None)
On master, integer values are treated as unix timestamps, while M8[ns] values are treated as wall-times in the given timezone.
# master
In [4]: pd.DatetimeIndex(i8data.view("M8[ns]"), tz="US/Central")
Out[4]:
DatetimeIndex(['1970-01-01 00:00:00-06:00', '1970-01-01 01:00:00-06:00',
'1970-01-01 02:00:00-06:00', '1970-01-01 03:00:00-06:00',
'1970-01-01 04:00:00-06:00'],
dtype='datetime64[ns, US/Central]', freq=None)
Reason for the change
There are four cases of interest:
In [4]: arr = np.arange(5) * 24 * 3600 * 10**9
In [5]: tz = 'US/Pacific'
In [6]: a = pd.DatetimeIndex(arr, tz=tz)
In [7]: b = pd.DatetimeIndex(arr.view('M8[ns]'), tz=tz)
In [8]: c = pd.DatetimeIndex._simple_new(arr, tz=tz)
In [9]: d = pd.DatetimeIndex._simple_new(arr.view('M8[ns]'), tz=tz)
In [10]: a
Out[10]:
DatetimeIndex(['1970-01-01 00:00:00-08:00', '1970-01-02 00:00:00-08:00',
'1970-01-03 00:00:00-08:00', '1970-01-04 00:00:00-08:00',
'1970-01-05 00:00:00-08:00'],
dtype='datetime64[ns, US/Pacific]', freq=None)
In [11]: b
Out[11]:
DatetimeIndex(['1970-01-01 00:00:00-08:00', '1970-01-02 00:00:00-08:00',
'1970-01-03 00:00:00-08:00', '1970-01-04 00:00:00-08:00',
'1970-01-05 00:00:00-08:00'],
dtype='datetime64[ns, US/Pacific]', freq=None)
In [12]: c
Out[12]:
DatetimeIndex(['1969-12-31', '1970-01-01', '1970-01-02', '1970-01-03',
'1970-01-04'],
dtype='datetime64[ns, US/Pacific]', freq=None)
In [13]: d
Out[13]:
DatetimeIndex(['1969-12-31', '1970-01-01', '1970-01-02', '1970-01-03',
'1970-01-04'],
dtype='datetime64[ns, US/Pacific]', freq=None)
In 0.23.4 we have a.equals(b)
and c.equals(d)
but no way to pass data in a way that was constructor-neutral. In master we now have a
match c
and d
. At some point in the refactoring process we changed that, but off the top of my head I don't remember when or if this was the precise motivation or just a side-benefit.
BTW _simple_new was also way too much:
if getattr(values, 'dtype', None) is None:
# empty, but with dtype compat
if values is None:
values = np.empty(0, dtype=_NS_DTYPE)
return cls(values, name=name, freq=freq, tz=tz,
dtype=dtype, **kwargs)
values = np.array(values, copy=False)
if is_object_dtype(values):
return cls(values, name=name, freq=freq, tz=tz,
dtype=dtype, **kwargs).values
elif not is_datetime64_dtype(values):
values = _ensure_int64(values).view(_NS_DTYPE)
Was this documented?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html mentions that it's "represented internally as int64".
The (imprecise) type on data
is "Optional datetime-like data"
I don't see anything in http://pandas.pydata.org/pandas-docs/stable/timeseries.html suggesting that integers can be passed to DatetimeIndex.
Activity
TomAugspurger commentedon Jan 2, 2019
@jbrockmendel could you fill in the "Reason for the change" section? IIUC, it was to simplify
DatetimeIndex._simple_new
?mroeschke commentedon Jan 2, 2019
Possibly related, I worked on this topic in #21216. Rational was that
DatetimeIndex(ints, tz=tz)
should behave similarly asTimestamp(int, tz=tz)
, and ints cannot necessarily represent wall times.I have an open issue #20257 to document passing integers into
Timestamp
which, similarly toDatetimeIndex
, will treat integers as epoch (unix) timestampsjbrockmendel commentedon Jan 2, 2019
@mroeschke thanks. IIRC consistency with to_datetime was also a consideration
jorisvandenbossche commentedon Jan 2, 2019
to_datetime
has notz
keyword, onlyutc=True
, but for UTC wall time or unix epochs are the same?jorisvandenbossche commentedon Jan 2, 2019
The
Timestamp
/DatetimeIndex
consistency is a good reason to change one of both (since they are inconsistent in 0.23.4).And I think the unix epoch way makes sense for integer input. Although I think it can also be confusing that
DatetimeIndex(int_array, ..)
andDatetimeIndex(int_array.view('M8[ns]), ..)
give different results.Regarding"was this documented": it might not have been documented clearly, but it is long standing behaviour: either we think people don't use it / shouldn't use it (but then why bother changing it? Shouldn't we then rather deprecate the whole ability of passing integer data, instead of changing the behaviour?), or either people do use it, and this change will break that usage.
Of course, a break for a limited number of people might be worth the trade-off for a big win within our code base. But is this consistency between
__new__
and_simple_new
that important code-technical?jorisvandenbossche commentedon Jan 4, 2019
Any other thoughts / replies here?
TomAugspurger commentedon Jan 4, 2019
I think this surprised me early on. That may have been because I was used to the old way; I'm not really sure.
My intent there was "If this was documented before, then we definitely can't change it." I wasn't advocating "It wasn't documented, so we can change it.". Just "It wasn't documented, so we can maybe change it" :)
I don't really have an opinion on the technical merits of wall time vs. epochs. I don't think I know enough to vote one way or the other.
jbrockmendel commentedon Jan 4, 2019
I think the overriding internal-consistency concern is the one @mroeschke reminded us of: this should behave like Timestamp constructor.
@TomAugspurger would "fixing" this behavior be difficult? Last time I tried something similar I got test_packers failures that I couldn't figure out.
TomAugspurger commentedon Jan 4, 2019
Demo of the inconsistency on 0.23.4
On master, Out[13] matches Out[11].
That consistency is certainly worth striving for.
If we wanted a graceful deprecation warning, then the DTI constructor would
.view("dateime64[ns]")
and pass that. If they want unix epochs (the future behavior) then they should ...?Putting that warning in place would give us an idea of how difficult this would be to change. I'll see what turns up. Since
_simple_new
already does the "right" (future) thing, it may not be too bad...jorisvandenbossche commentedon Jan 4, 2019
to_datetime
? (if it would finally have atz
option)The other option could also be to deprecate passing integers, if we find it too confusing what it should result in.
56 remaining items
jbrockmendel commentedon Feb 1, 2019
How close are we to consensus on this? It looks like @mroeschke, @jreback, and I have expressed preference for options 4 or 5, with a slight preference towards 5. @jorisvandenbossche has expressed a preference for option 6.
Anyone else want to weigh in?
jbrockmendel commentedon Jun 9, 2019
@TomAugspurger @jorisvandenbossche @mroeschke @jreback I'd like to settle on a desired long-term behavior and start the appropriate deprecation warnings before 0.25.0. How far are we from consensus?
jreback commentedon Jun 9, 2019
@jbrockmendel can u do. short summary of this and just confirm that 0.24.2
and master have the same behavior
TomAugspurger commentedon Jul 2, 2019
We discussed this on the call. IIRC someone volunteered / was volunteered to summarize, and maybe close this issue? (was I the one to volunteer?)
TomAugspurger commentedon Jan 6, 2020
IIUC, this can be closed since the changes have been made and the deprecation enforced.
reference to pandas-dev#24559