Skip to content

Changes to i8data for DatetimeIndex #24559

@TomAugspurger

Description

@TomAugspurger
Contributor

Master currently has an (undocumented) (maybe-) API-breaking change from 0.23.4 when passed integer values

0.23.4

In [2]: i8data = np.arange(5) * 3600 * 10**9

In [3]: pd.DatetimeIndex(i8data, tz="US/Central")
Out[3]:
DatetimeIndex(['1970-01-01 00:00:00-06:00', '1970-01-01 01:00:00-06:00',
               '1970-01-01 02:00:00-06:00', '1970-01-01 03:00:00-06:00',
               '1970-01-01 04:00:00-06:00'],
              dtype='datetime64[ns, US/Central]', freq=None)

Master

In [3]: pd.DatetimeIndex(i8data, tz="US/Central")
Out[3]:
DatetimeIndex(['1969-12-31 18:00:00-06:00', '1969-12-31 19:00:00-06:00',
               '1969-12-31 20:00:00-06:00', '1969-12-31 21:00:00-06:00',
               '1969-12-31 22:00:00-06:00'],
              dtype='datetime64[ns, US/Central]', freq=None)

Attempt to explain the behavior: In 0.23.4, passing an ndarray[i8] was equivalent to passing data.view("M8[ns]")

# 0.23.4
In [4]: pd.DatetimeIndex(i8data.view("M8[ns]"), tz="US/Central")
Out[4]:
DatetimeIndex(['1970-01-01 00:00:00-06:00', '1970-01-01 01:00:00-06:00',
               '1970-01-01 02:00:00-06:00', '1970-01-01 03:00:00-06:00',
               '1970-01-01 04:00:00-06:00'],
              dtype='datetime64[ns, US/Central]', freq=None)

On master, integer values are treated as unix timestamps, while M8[ns] values are treated as wall-times in the given timezone.

# master
In [4]: pd.DatetimeIndex(i8data.view("M8[ns]"), tz="US/Central")
Out[4]:
DatetimeIndex(['1970-01-01 00:00:00-06:00', '1970-01-01 01:00:00-06:00',
               '1970-01-01 02:00:00-06:00', '1970-01-01 03:00:00-06:00',
               '1970-01-01 04:00:00-06:00'],
              dtype='datetime64[ns, US/Central]', freq=None)

Reason for the change

There are four cases of interest:

In [4]: arr = np.arange(5) * 24 * 3600 * 10**9
In [5]: tz = 'US/Pacific'

In [6]: a = pd.DatetimeIndex(arr, tz=tz)
In [7]: b = pd.DatetimeIndex(arr.view('M8[ns]'), tz=tz)
In [8]: c = pd.DatetimeIndex._simple_new(arr, tz=tz)
In [9]: d = pd.DatetimeIndex._simple_new(arr.view('M8[ns]'), tz=tz)

In [10]: a
Out[10]: 
DatetimeIndex(['1970-01-01 00:00:00-08:00', '1970-01-02 00:00:00-08:00',
               '1970-01-03 00:00:00-08:00', '1970-01-04 00:00:00-08:00',
               '1970-01-05 00:00:00-08:00'],
              dtype='datetime64[ns, US/Pacific]', freq=None)

In [11]: b
Out[11]: 
DatetimeIndex(['1970-01-01 00:00:00-08:00', '1970-01-02 00:00:00-08:00',
               '1970-01-03 00:00:00-08:00', '1970-01-04 00:00:00-08:00',
               '1970-01-05 00:00:00-08:00'],
              dtype='datetime64[ns, US/Pacific]', freq=None)

In [12]: c
Out[12]: 
DatetimeIndex(['1969-12-31', '1970-01-01', '1970-01-02', '1970-01-03',
               '1970-01-04'],
              dtype='datetime64[ns, US/Pacific]', freq=None)

In [13]: d
Out[13]: 
DatetimeIndex(['1969-12-31', '1970-01-01', '1970-01-02', '1970-01-03',
               '1970-01-04'],
              dtype='datetime64[ns, US/Pacific]', freq=None)

In 0.23.4 we have a.equals(b) and c.equals(d) but no way to pass data in a way that was constructor-neutral. In master we now have a match c and d. At some point in the refactoring process we changed that, but off the top of my head I don't remember when or if this was the precise motivation or just a side-benefit.

BTW _simple_new was also way too much:

        if getattr(values, 'dtype', None) is None:
            # empty, but with dtype compat
            if values is None:
                values = np.empty(0, dtype=_NS_DTYPE)
                return cls(values, name=name, freq=freq, tz=tz,
                           dtype=dtype, **kwargs)
            values = np.array(values, copy=False)

        if is_object_dtype(values):
            return cls(values, name=name, freq=freq, tz=tz,
                       dtype=dtype, **kwargs).values
        elif not is_datetime64_dtype(values):
            values = _ensure_int64(values).view(_NS_DTYPE)

Was this documented?

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html mentions that it's "represented internally as int64".

The (imprecise) type on data is "Optional datetime-like data"

I don't see anything in http://pandas.pydata.org/pandas-docs/stable/timeseries.html suggesting that integers can be passed to DatetimeIndex.

Activity

added this to the 0.24.0 milestone on Jan 2, 2019
TomAugspurger

TomAugspurger commented on Jan 2, 2019

@TomAugspurger
ContributorAuthor

@jbrockmendel could you fill in the "Reason for the change" section? IIUC, it was to simplify DatetimeIndex._simple_new?

mroeschke

mroeschke commented on Jan 2, 2019

@mroeschke
Member

Possibly related, I worked on this topic in #21216. Rational was that DatetimeIndex(ints, tz=tz) should behave similarly as Timestamp(int, tz=tz), and ints cannot necessarily represent wall times.

I have an open issue #20257 to document passing integers into Timestamp which, similarly to DatetimeIndex, will treat integers as epoch (unix) timestamps

jbrockmendel

jbrockmendel commented on Jan 2, 2019

@jbrockmendel
Member

@mroeschke thanks. IIRC consistency with to_datetime was also a consideration

jorisvandenbossche

jorisvandenbossche commented on Jan 2, 2019

@jorisvandenbossche
Member

IIRC consistency with to_datetime was also a consideration

to_datetime has no tz keyword, only utc=True, but for UTC wall time or unix epochs are the same?

jorisvandenbossche

jorisvandenbossche commented on Jan 2, 2019

@jorisvandenbossche
Member

The Timestamp / DatetimeIndex consistency is a good reason to change one of both (since they are inconsistent in 0.23.4).
And I think the unix epoch way makes sense for integer input. Although I think it can also be confusing that DatetimeIndex(int_array, ..) and DatetimeIndex(int_array.view('M8[ns]), ..) give different results.

Regarding"was this documented": it might not have been documented clearly, but it is long standing behaviour: either we think people don't use it / shouldn't use it (but then why bother changing it? Shouldn't we then rather deprecate the whole ability of passing integer data, instead of changing the behaviour?), or either people do use it, and this change will break that usage.
Of course, a break for a limited number of people might be worth the trade-off for a big win within our code base. But is this consistency between __new__ and _simple_new that important code-technical?

jorisvandenbossche

jorisvandenbossche commented on Jan 4, 2019

@jorisvandenbossche
Member

Any other thoughts / replies here?

TomAugspurger

TomAugspurger commented on Jan 4, 2019

@TomAugspurger
ContributorAuthor

Although I think it can also be confusing that DatetimeIndex(int_array, ..) and DatetimeIndex(int_array.view('M8[ns]), ..) give different results.

I think this surprised me early on. That may have been because I was used to the old way; I'm not really sure.

Regarding"was this documented":

My intent there was "If this was documented before, then we definitely can't change it." I wasn't advocating "It wasn't documented, so we can change it.". Just "It wasn't documented, so we can maybe change it" :)

I don't really have an opinion on the technical merits of wall time vs. epochs. I don't think I know enough to vote one way or the other.

jbrockmendel

jbrockmendel commented on Jan 4, 2019

@jbrockmendel
Member

I think the overriding internal-consistency concern is the one @mroeschke reminded us of: this should behave like Timestamp constructor.

@TomAugspurger would "fixing" this behavior be difficult? Last time I tried something similar I got test_packers failures that I couldn't figure out.

TomAugspurger

TomAugspurger commented on Jan 4, 2019

@TomAugspurger
ContributorAuthor

Demo of the inconsistency on 0.23.4

In [9]: i8 = pd.Timestamp('2000', tz='CET').value

In [10]: pd.Timestamp(i8)
Out[10]: Timestamp('1999-12-31 23:00:00')

In [11]: pd.Timestamp(i8, tz="CET")
Out[11]: Timestamp('2000-01-01 00:00:00+0100', tz='CET')

In [12]: pd.DatetimeIndex(np.array([i8]))
Out[12]: DatetimeIndex(['1999-12-31 23:00:00'], dtype='datetime64[ns]', freq=None)

In [13]: pd.DatetimeIndex(np.array([i8]), tz="CET")
Out[13]: DatetimeIndex(['1999-12-31 23:00:00+01:00'], dtype='datetime64[ns, CET]', freq=None)

On master, Out[13] matches Out[11].

That consistency is certainly worth striving for.


If we wanted a graceful deprecation warning, then the DTI constructor would

  1. check for i8data & tz
  2. tell users to... what? What recourse would they have? If they want wall times (the previous behavior) then they can .view("dateime64[ns]") and pass that. If they want unix epochs (the future behavior) then they should ...?
  3. Update all of pandas to use the recommended behavior from 2 (which is TBD).

Putting that warning in place would give us an idea of how difficult this would be to change. I'll see what turns up. Since _simple_new already does the "right" (future) thing, it may not be too bad...

jorisvandenbossche

jorisvandenbossche commented on Jan 4, 2019

@jorisvandenbossche
Member

If they want unix epochs (the future behavior) then they should ...?

to_datetime ? (if it would finally have a tz option)

The other option could also be to deprecate passing integers, if we find it too confusing what it should result in.

56 remaining items

modified the milestones: 0.24.0, 0.25.0 on Jan 21, 2019
jbrockmendel

jbrockmendel commented on Feb 1, 2019

@jbrockmendel
Member

How close are we to consensus on this? It looks like @mroeschke, @jreback, and I have expressed preference for options 4 or 5, with a slight preference towards 5. @jorisvandenbossche has expressed a preference for option 6.

Anyone else want to weigh in?

jbrockmendel

jbrockmendel commented on Jun 9, 2019

@jbrockmendel
Member

@TomAugspurger @jorisvandenbossche @mroeschke @jreback I'd like to settle on a desired long-term behavior and start the appropriate deprecation warnings before 0.25.0. How far are we from consensus?

jreback

jreback commented on Jun 9, 2019

@jreback
Contributor

@jbrockmendel can u do. short summary of this and just confirm that 0.24.2
and master have the same behavior

TomAugspurger

TomAugspurger commented on Jul 2, 2019

@TomAugspurger
ContributorAuthor

We discussed this on the call. IIRC someone volunteered / was volunteered to summarize, and maybe close this issue? (was I the one to volunteer?)

modified the milestones: 0.25.0, 1.0 on Jul 17, 2019
TomAugspurger

TomAugspurger commented on Jan 6, 2020

@TomAugspurger
ContributorAuthor

IIUC, this can be closed since the changes have been made and the deprecation enforced.

added a commit that references this issue on Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignBlockerBlocking issue or pull request for an upcoming releaseDatetimeDatetime data dtype

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jreback@jorisvandenbossche@TomAugspurger@jbrockmendel@mroeschke

        Issue actions

          Changes to i8data for DatetimeIndex · Issue #24559 · pandas-dev/pandas