Closed
Description
I have a dataframe that contains a datetime64 column that seems to be losing some digits after a groupby.first(). I know that under the hood these are stored at nanosecond precision, but I need them to remain equal, since I'm merging back onto the original frame later on. Is this expected behavior?
import pandas as pd
from pandas import Timestamp
data = [
{'a': 1,
'dateCreated': Timestamp('2011-01-20 12:50:28.593448')},
{'a': 1,
'dateCreated': Timestamp('2011-01-15 12:50:28.502376')},
{'a': 1,
'dateCreated': Timestamp('2011-01-15 12:50:28.472790')},
{'a': 1,
'dateCreated': Timestamp('2011-01-15 12:50:28.445286')}]
df = pd.DataFrame(data)
Output is:
In [6]: df
Out[6]:
a dateCreated
0 1 2011-01-20 12:50:28.593448
1 1 2011-01-15 12:50:28.502376
2 1 2011-01-15 12:50:28.472790
3 1 2011-01-15 12:50:28.445286
In [7]: df.groupby('a').first()
Out[7]:
dateCreated
a
1 2011-01-20 12:50:28.593447936
When I compare the datetime64 in the first row to the datetime64 after the groupby.first(), the two are not equal.
Activity
shoyer commentedon Jan 21, 2015
This looks like a bug to me.
I can reproduce this on master.
chrisbyboston commentedon Jan 21, 2015
I have isolated it to this line.
https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L1818
I'm going to see if I can determine if this is a numpy bug or if we just need to pass some additional parameters to numpy.ndarray.astype.
chrisbyboston commentedon Jan 21, 2015
Interesting...
numpy casting defaults to "unsafe" for backward compatibility reasons. When I pass in "safe":
So numpy is even aware that this can't be done safely.
chrisbyboston commentedon Jan 21, 2015
If I leave casting="safe", then all is well, as it raises an error and fails over safely:
Result:
I'll work up a PR and a test... and see if I break all the things.
shoyer commentedon Jan 21, 2015
The real question is why this is being cast to float64 in the first place. We might need to duplicate an internal cython routine so that there is an int64 version that we can safely use without needing any casting.
chrisbyboston commentedon Jan 21, 2015
@shoyer I'll take a peek at when that gets converted to float64. I didn't catch that at all, but you're right. It should be able to safely convert an int64.
chrisbyboston commentedon Jan 21, 2015
It runs an ensure_float on this line: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1508
chrisbyboston commentedon Jan 21, 2015
Alright, looks like the cython in generated.pyx needs to have functions for group_nth_ for int8, int16, int32, and int64.
Sound right to you @shoyer?
shoyer commentedon Jan 21, 2015
@iwschris That file is generated from
pandas/src/generate_code.py
, and it looks like integers were explicitly excluded, probably because something did not think they were necessary for groupby functions: https://github.com/pydata/pandas/blob/master/pandas/src/generate_code.py#L2407chrisbyboston commentedon Jan 21, 2015
Cool. I have a branch started that should fix this, and I'll enable integers for now, and see how things go from there.
jreback commentedon Jan 22, 2015
@iwschris this would be a nice fix
chrisbyboston commentedon Jan 22, 2015
First Attempt: Don't convert Integers To Floats in Cython
So, this is tightly wound around the idea that there is no integer nan. Once I enabled integers in generate_code.py, the resulting generated.pyx simply converted integers to float64, which is precisely the thing we're trying to solve here.
I modified the cython templates so that integers could remain integers, but obviously the problem of how to represent integer nan still remained. Just to see if it solved the problem at hand, on the integer groupby templates, I used 0 for nan, and the immediate problem was solved. Using 0 for nan is not an acceptable solution though, and I broke about 20 tests, where the integer to float64 conversion has already been assumed.
Suggested Solution: Coerce safely with a fall-back
This prioritizes data safety over performance, but only in a very narrow band where we know that there is an issue. I'm proposing that we change this line in core/internals.py to this:
All the fallback code already exists, and the change breaks no tests.
Thoughts?
shoyer commentedon Jan 22, 2015
@iwschris What does the fallback code look like? I'm guessing that skips Cython entirely?
I would rather be correct than fast, so that would probably be OK, but we should know what the performance impact is first. Are there any major regressions in the vbench performance suite? (see here for instructions)
Our usual approach to handling missing values for datetime64 is to check for values equal to
iNaT
. For example, take a look at thegroup_count
functions in generate_code.py. This would be the preferred solution.chrisbyboston commentedon Jan 22, 2015
@shoyer that's right, it skips Cython.
I'll do the vbench stuff and post the results here.
As far as iNaT goes, the issue here is that by the time we get into those Cython functions the datetime64 has already been converted to an Integer.
shoyer commentedon Jan 22, 2015
@iwschris it's actually probably fine to do
iNaT
comparisons even for non-datetime types. You'll notice some of the existing code does exactly that. iNaT is the smallest int64 number:chrisbyboston commentedon Jan 22, 2015
Excellent. I'll check it out then. Thanks!
chrisbyboston commentedon Jan 22, 2015
Hah... the relevant part is here:
...point taken. I'll keep working on the Cython :)
chrisbyboston commentedon Jan 23, 2015
Cython is fixed, all tests passing, vbench is looking good as well. I'll have the PR in today.
chrisbyboston commentedon Jan 23, 2015
Here it is: #9345