-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
To summarize: should raise when loss of precision when casting from ints to floats for computations (everywhere!)
The wrong result is returned in the following case because of unnecessary casting of ints to floats.
max_int = np.iinfo(np.int64).max
min_int = np.iinfo(np.int64).min
assert max_int + min_int == -1
assert DataFrame([max_int,min_int], index=[datetime(2013,1,1),datetime(2013,1,1)]).resample('M', how=np.sum)[0][0] == -1
This is the offending code in groupby.py
that does the casting:
class NDFrameGroupBy(GroupBy):
def _cython_agg_blocks(self, how, numeric_only=True):
.....
if is_numeric:
values = com.ensure_float(values)
and even worse, sneakily casts the floats back to integers hiding the problem!
# see if we can cast the block back to the original dtype
result = block._try_cast_result(result)
It should be possible to perform this calculation without any casting. For example, the sum()
of a DataFrame returns the correct result:
assert DataFrame([max_int, min_int]).sum()[0] == -1
I am working with financial data that needs to stay as int to protect its precision. In my opinion, any operation that casts values to floats do a calculation, but then returns results as ints is very wrong. At the very least, the results should be returned as floats to show that a cast has been done. Even better would be to perform the calculations with ints whenever possible.
Activity
jreback commentedon May 29, 2013
Since numpy cannot support a native
NaN
for integers, this must be cast to floats (it IS possible to do this in some cases as integers, e.g. when you KNOW that nans will not result, but this makes the operation take extra time). As for casting back, that is consistent will all other pandas operations. You are free to start with a dtype offloat64
. Working with extreme range integers is IMO, not useful, and what you are seeing is machine roundoff error.cpcloud commentedon May 29, 2013
@jreback how annoying would it be to roll up a
NaI
likeNaT
? there have probably been discussions about this, but then we'd have a missing value for the types that pandas supports (int, floats, objects (can still use regular nan), and dates/times)jreback commentedon May 29, 2013
@cpcloud you could use the
NaT
value for a not-available for ints, BUT, this runs into issues on 32-bit (where you effectively have to use the maxint on 32-bit). That's why we ONLY usedatetime64[ns]
(as its always 64-bit). Another option is to do masked type stuff. None of these are good solutions unfortunately.cpcloud commentedon May 29, 2013
yeah i hate maskedarrays...(link doesn't work) :)
gerdemb commentedon May 29, 2013
If I understand you correctly, rather than trying to determine if an operation would create
NaN
values, pandas simply casts ints to floats in any case where this might happen because this performs better. Is that correct? In my trivial examples it's obvious that there is no missing data, but I could see how that may not always be obvious.I still think that in the case where an int as cast to a float to do an operation, pandas should return floats. It's very non-obvious that an operation that accepts ints and returns ints is actually doing floating point calculations in the background!
My example uses big integers, to demonstrate the problem, but it could occur with smaller numbers too.
int(float(i)) == i
for all integer values ofi
is not always true. In monetary applications this is important.Finally, changing the topic slightly, the behavior I'm actually looking for is to replace
NaN
with0
when working with ints. For example:I can't see how this is easily possible, but I'm not an expert here.
jreback commentedon May 29, 2013
@gerdemb
checking for Nans costs time, and in any event these types of routines are implemented only as floats (they could in theory be done as ints, but this ONLY works if there are NO Nans at all, which when doing resampling is very rare if you think about it, most of the time you will end up generating nan for missing periods)
pandas tries to return the input type, its just more intuitve that way (before 0.11 pandas used to return floats in almost all cases, THAT was a bit of an issue). Now that pandas supports many dtypes it tries to return the correct one, casting is done ONLY when there is not loss of result. (do your above operation with floats, it returns the same result, 0)
2nd question
will work; I don't think pandas supports a
fill=
in resample; in theory it could I supposejreback commentedon May 29, 2013
Here's another way of doing this w/o the casting
http://stackoverflow.com/questions/16807836/how-to-resample-a-timeseries-in-pandas-with-a-fill-value
gerdemb commentedon May 29, 2013
@jreback
I'm not sure what's wrong with returning floats when floats have been used to do the calculations, and I personally think it is much clearer than silently casting them back into ints with the subsequent loss of precsion, but it sounds like there may be other issues that I am not aware of.
Anyway, it seems that when using ints there are two possible scenarios: (1) the calculation is done entirely with ints without casting (2) the ints are cast to floats for calculation and then cast back to ints when the results are returned. When calling a method, how can I tell which one will happen?
Seems like this should at least be documented or even better a signal could be emitted when precision is lost by casting (perhaps by using a mechanism like decimal.Context which allows the user to decide which conditions they wish to trap).
Thank you for your answers to my second question. The stackoverflow question you linked was actually my post! Using a custom aggregation function as suggested in the answer does work correctly, but unfortunately the performance is about 30 times worse than
how='sum'
. I suppose this is because of the overhead of calling a Python function instead a C function?Finally, just want to say thanks for developing and supporting pandas. Despite my complaints, I've found it to be an incredibly useful tool in my work.
jreback commentedon May 29, 2013
@gerdemb
yes, unfortunately using a lambda is much slower; that said, it might be worth it to build this type of filling in (e.g. provide a
fill_value=
to resample, but a bit non-trivialusing
ints
in general is problematic because of the lack of a nativenan
as have mentionedhowever, pandas does try to maintain dtype where possible; that is why the cast back happens. And as I said there is NO loss of precision (IF numpy allows it, which it does in this case), e.g.
gerdemb commentedon May 29, 2013
@jreback
It's not true that there is no loss of precision when casting from int to float. The code in my first post demonstrates the problem and how it can effect the results of a resample operation. Another trivial example:
The value of
a
has more precision than is possible to store in an float64. Your example uses a value of 1 which is small enough to store exactly as a float64. Additionally, I believe it is casting the int64 to a float64 and then comparing the two floats which doesn't really prove anything.cpcloud commentedon May 29, 2013
In general, sure (but by definition this is true).
@jreback pandas does lose precision with these values. For example,
It (
numpy
) should be able to handle up to2 ** 62
without overflow...jreback commentedon May 29, 2013
i think u are misunderstanding what pandas is actually doing. the conversion from float back to int occurs ONLY if numpy compares the values as equal (and there are no Nan's and the original dtype was int to begin with)
this is considered safe casting
your above example fails as the numbers don't compare equal in the first place
however, if there is a loss of precision, eg your example
then I think the conversion is 'wrong' in numpy and so pandas is wrong as well
but that is a very narrow case and I don't think near the end points of values it's guaranteed to work in any event
that said if could figure out a reasonable way to determine if there possibly would be a loss of precision then could raise an error ( and let the user decide what to do ). as u suggested above
here's a numpy related discussinon:
numpy/numpy#3118
I think we could do the test before casting, essentially this:
I don't believe this is expensive to do either
only issue is prob should do this in lots of places :)
The cast back does this test so that's not a problem (or it will leave it as float)
gerdemb commentedon May 30, 2013
@jreback
There are TWO casts being done in the resample example that I originally posted. The first is as cast from int to float the second is a cast back to int from float. It doesn't mater if the second cast from float to int occurs only if numpy compares the values as equal, because we've already lost the precision in the FIRST cast from int to float. The code @cpcloud posted shows another example of the same problem. You said "your above example fails as the numbers don't compare equal in the first place" which is exactly the point! We CANNOT guarantee that a cast from int to float will not lose precision! (More precisely, if we are working with 64-bit types, I believe that if the number of bits in the integer exceeds 53, then it cannot be represented exactly as float as that would exceed the float's significand precision.)
Anyway, I propose the following possible solutions:
Personally, I am strongly in favor of 3, but I realize there may be other opinions. My reasons are:
NaN
s can be inserted.In short, I believe this solution has the least confusing behavior, performs the fastest and requires the least amount of code.
Thoughts?
jreback commentedon May 30, 2013
nice summary. The issue is the FIRST cast is not 'safe' (as I defined above), while SECOND is. As far as your solutions.
I know you like option 3 (cast most things to float and leave), but
float
dtype). In fact see this: http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#dtype-gotchas, there ARE times when we can preserve the dtype even though the implementation actually 'changes' it.astype
check is pretty cheap (however, a-priori checking fornan
to determne if we need to cast is not a good option)So we are left with a gotchas doc, or providing a warning (or exception).
I think providing a
LossOfPrecisionWarning
(a sub-class ofUserWarning
) might be a good solution, as well as documenting; may not get to this to 0.12Would you do a PR for a short section in http://pandas.pydata.org/pandas-docs/dev/gotchas.html for 0.11.1?
gerdemb commentedon May 30, 2013
@jreback
I've been using pandas for quite a while, but this is my first time venturing into the "development" side of things and I'm not sure how changes are introduced to the code, but I'd be curious to hear the opinions of other developers or users on this issue.
Finally, I realize my use-case is probably unusual and that is why no one has reported this problem before. I'm going to redesign my application to use Decimal objects instead of integers which will be a big performance hit, but I don't want to risk finding any more "gotchas" like this.
6 remaining items
jreback commentedon Jan 9, 2017
closing in favor of #11199
gfyoung commentedon Jun 12, 2017
Resurrected in #16674.