Skip to content

BUG: first() changes datetime64 data #9311

Closed
@chrisbyboston

Description

@chrisbyboston

I have a dataframe that contains a datetime64 column that seems to be losing some digits after a groupby.first(). I know that under the hood these are stored at nanosecond precision, but I need them to remain equal, since I'm merging back onto the original frame later on. Is this expected behavior?

import pandas as pd
from pandas import Timestamp

data = [
 {'a': 1,
  'dateCreated': Timestamp('2011-01-20 12:50:28.593448')},
 {'a': 1,
  'dateCreated': Timestamp('2011-01-15 12:50:28.502376')},
 {'a': 1,
  'dateCreated': Timestamp('2011-01-15 12:50:28.472790')},
 {'a': 1,
  'dateCreated': Timestamp('2011-01-15 12:50:28.445286')}]

df = pd.DataFrame(data)

Output is:

In [6]: df
Out[6]:
   a                dateCreated
0  1 2011-01-20 12:50:28.593448
1  1 2011-01-15 12:50:28.502376
2  1 2011-01-15 12:50:28.472790
3  1 2011-01-15 12:50:28.445286

In [7]: df.groupby('a').first()
Out[7]:
                    dateCreated
a
1 2011-01-20 12:50:28.593447936

When I compare the datetime64 in the first row to the datetime64 after the groupby.first(), the two are not equal.

Activity

shoyer

shoyer commented on Jan 21, 2015

@shoyer
Member

This looks like a bug to me.

I can reproduce this on master.

chrisbyboston

chrisbyboston commented on Jan 21, 2015

@chrisbyboston
Author

I have isolated it to this line.

https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L1818

I'm going to see if I can determine if this is a numpy bug or if we just need to pass some additional parameters to numpy.ndarray.astype.

chrisbyboston

chrisbyboston commented on Jan 21, 2015

@chrisbyboston
Author

Interesting...

numpy casting defaults to "unsafe" for backward compatibility reasons. When I pass in "safe":

result = result.astype('M8[ns]', casting='safe')

TypeError: Cannot cast array from dtype('float64') to dtype('<M8[ns]') according to the rule 'safe'

So numpy is even aware that this can't be done safely.

chrisbyboston

chrisbyboston commented on Jan 21, 2015

@chrisbyboston
Author

If I leave casting="safe", then all is well, as it raises an error and fails over safely:

def _groupby_function(name, alias, npfunc, numeric_only=True,
                      _convert=False):
    def f(self):
        self._set_selection_from_grouper()
        try:
            return self._cython_agg_general(alias, numeric_only=numeric_only)
        except AssertionError as e:
            raise SpecificationError(str(e))
        except Exception:
            result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
            if _convert:
                result = result.convert_objects()
            return result

    f.__doc__ = "Compute %s of group values" % name
    f.__name__ = name

    return f

Result:

   a                dateCreated
0  1 2011-01-20 12:50:28.593448
1  1 2011-01-15 12:50:28.502376
2  1 2011-01-15 12:50:28.472790
3  1 2011-01-15 12:50:28.445286


                 dateCreated
a
1 2011-01-20 12:50:28.593448

I'll work up a PR and a test... and see if I break all the things.

shoyer

shoyer commented on Jan 21, 2015

@shoyer
Member

The real question is why this is being cast to float64 in the first place. We might need to duplicate an internal cython routine so that there is an int64 version that we can safely use without needing any casting.

chrisbyboston

chrisbyboston commented on Jan 21, 2015

@chrisbyboston
Author

@shoyer I'll take a peek at when that gets converted to float64. I didn't catch that at all, but you're right. It should be able to safely convert an int64.

chrisbyboston

chrisbyboston commented on Jan 21, 2015

@chrisbyboston
Author
chrisbyboston

chrisbyboston commented on Jan 21, 2015

@chrisbyboston
Author

Alright, looks like the cython in generated.pyx needs to have functions for group_nth_ for int8, int16, int32, and int64.

Sound right to you @shoyer?

shoyer

shoyer commented on Jan 21, 2015

@shoyer
Member

@iwschris That file is generated from pandas/src/generate_code.py, and it looks like integers were explicitly excluded, probably because something did not think they were necessary for groupby functions: https://github.com/pydata/pandas/blob/master/pandas/src/generate_code.py#L2407

chrisbyboston

chrisbyboston commented on Jan 21, 2015

@chrisbyboston
Author

Cool. I have a branch started that should fix this, and I'll enable integers for now, and see how things go from there.

jreback

jreback commented on Jan 22, 2015

@jreback
Contributor

@iwschris this would be a nice fix

added this to the 0.16.0 milestone on Jan 22, 2015
chrisbyboston

chrisbyboston commented on Jan 22, 2015

@chrisbyboston
Author

First Attempt: Don't convert Integers To Floats in Cython

So, this is tightly wound around the idea that there is no integer nan. Once I enabled integers in generate_code.py, the resulting generated.pyx simply converted integers to float64, which is precisely the thing we're trying to solve here.

I modified the cython templates so that integers could remain integers, but obviously the problem of how to represent integer nan still remained. Just to see if it solved the problem at hand, on the integer groupby templates, I used 0 for nan, and the immediate problem was solved. Using 0 for nan is not an acceptable solution though, and I broke about 20 tests, where the integer to float64 conversion has already been assumed.

Suggested Solution: Coerce safely with a fall-back

This prioritizes data safety over performance, but only in a very narrow band where we know that there is an issue. I'm proposing that we change this line in core/internals.py to this:

result = result.astype('M8[ns]', casting='safe')

All the fallback code already exists, and the change breaks no tests.

Thoughts?

shoyer

shoyer commented on Jan 22, 2015

@shoyer
Member

@iwschris What does the fallback code look like? I'm guessing that skips Cython entirely?

I would rather be correct than fast, so that would probably be OK, but we should know what the performance impact is first. Are there any major regressions in the vbench performance suite? (see here for instructions)

Our usual approach to handling missing values for datetime64 is to check for values equal to iNaT. For example, take a look at the group_count functions in generate_code.py. This would be the preferred solution.

chrisbyboston

chrisbyboston commented on Jan 22, 2015

@chrisbyboston
Author

@shoyer that's right, it skips Cython.

I'll do the vbench stuff and post the results here.

As far as iNaT goes, the issue here is that by the time we get into those Cython functions the datetime64 has already been converted to an Integer.

shoyer

shoyer commented on Jan 22, 2015

@shoyer
Member

@iwschris it's actually probably fine to do iNaT comparisons even for non-datetime types. You'll notice some of the existing code does exactly that. iNaT is the smallest int64 number:

In [2]: pd.tslib.iNaT
Out[2]: -9223372036854775808
chrisbyboston

chrisbyboston commented on Jan 22, 2015

@chrisbyboston
Author

Excellent. I'll check it out then. Thanks!

chrisbyboston

chrisbyboston commented on Jan 22, 2015

@chrisbyboston
Author

Hah... the relevant part is here:

timeseries_custom_bday_cal_decr              |   0.0296 |   0.0197 |   1.5040 |
stat_ops_series_std                          |   0.6110 |   0.3927 |   1.5560 |
strings_title                                |  10.8200 |   6.6644 |   1.6236 |
groupby_last_float64                         |   6.6760 |   2.6947 |   2.4775 |
groupby_multi_count                          | 978.3040 |   5.6086 | 174.4281 |
groupby_last_datetimes                       | 2054.3030 |   9.3013 | 220.8609 |
groupby_first_datetimes                      | 2042.0907 |   8.5020 | 240.1893 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

...point taken. I'll keep working on the Cython :)

chrisbyboston

chrisbyboston commented on Jan 23, 2015

@chrisbyboston
Author

Cython is fixed, all tests passing, vbench is looking good as well. I'll have the PR in today.

chrisbyboston

chrisbyboston commented on Jan 23, 2015

@chrisbyboston
Author

Here it is: #9345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @chrisbyboston@jreback@shoyer

      Issue actions

        BUG: first() changes datetime64 data · Issue #9311 · pandas-dev/pandas