Closed
Description
Should pandas preserve the input dtype on when doing
boolean indexing, if possible?
Its a pretty limited case, in that you have to have all values in a column non-null
and they all have to be integers (just casted as floats)
this is actually a little tricky to implement as this has to be done column by column (and its possibly that blocks have to be split)
the question here is will this be not what the user expects? (currently all dtypes
on boolean output operations are cast to float64/object)
In [24]: df = pd.DataFrame(dict(
a = pd.Series([1]*3,dtype='int32'),
b = pd.Series([1]*3,dtype='float32')),
index=range(3))
In [25]: df.ix[2,1] = 0
In [26]: df
Out[26]:
a b
0 1 1
1 1 1
2 1 0
In [27]: df.dtypes
Out[27]:
a int32
b float32
Dtype: object
In [28]: df[df>0]
Out[28]:
a b
0 1 1
1 1 1
2 1 NaN
##### if it is possible to preserve column a as the original int input dtype
which in this case it is #####
In [29]: df[df>0].dtypes
Out[29]:
a float64
b float32
Dtype: object
### possible output ###
Out[29]:
a int32
b float32
Dtype: object
Activity
cpcloud commentedon Feb 4, 2013
I get dtype conversion on DataFrame construction with dicts; what version of pandas are you using?
jreback commentedon Feb 4, 2013
This is a development version (either 0.10.2 or 0.11-dev). The above code won't work on 0.10.1 (well it 'works', but it converts dtypes). Dtypes are preserved in some limited cases in 0.10.1. This is what #2708 is all about.
I am asking this:
if you input an integer dtype and you perform an operation that that results in an integerlike number, but
we since we have to round trip it thru floats (mainly because it can have
nan
in the array), BUT that the result on a particular column in a DataFrame CAN be converted to integer - SHOULD we convert it back to the original dtype?This is not currently the case in 0.10.1 or lower
stephenwlin commentedon Feb 4, 2013
@jreback -- i'm a bit confused, because i've checked out your dtypes branch and see that you have implemented a parameter try_cast to DataFrame.where that seems to do what you're talking about. Are you just asking whether it should be turned on by default from getitem/setitem?
jreback commentedon Feb 4, 2013
thats exactly what I am asking. Its actually not fully implemented, because it can happen that an IntBlock needs to split to multiple dtypes (not hard, just didn't do it yet).
I turned it off because I had a few failing tests - basically the 'user' is expected always to convert to float64. It is important to try to make the dtype back to int, where possible?
stephenwlin commentedon Feb 4, 2013
i'm too new to this to be able to be an authority of how things should be, so don't take my opinion too seriously...but i'm personally inclined to think that try_cast should be off by default, because having it on means that the dtypes of the result depends on what values happen to match a boolean condition, which is a bit odd: it makes more sense to me that the type of the result of an operation should only depend on the types of its inputs, not the types and their particular values within those types.
i know this rule doesn't hold true for a lot of pandas behavior right now though, so maybe my concern isn't really apropos. (it probably also betrays my biases coming from statically typed languages)
cpcloud commentedon Feb 4, 2013
Personally, I've found that this doesn't matter for me, but it seems like it makes sense to keep the dtype from boolean indexing if possible.
stephenwlin commentedon Feb 4, 2013
@jreback: By the way, you mentioned the case of an
IntBlock
needing to be split into separate blocks based on casting post-pass. How are you thinking about doing this? If you implement it such that each block separately decides how to split itself into multiple blocks in the casting post-pass and does so independently, the I could imagine you might end up doing three copying passes on the data: once for thewhere
op, another to split the resulting multi-column blocks into separate smaller blocks, and a final one to consolidate (in case you end up splitting an int into an int and a float, where there was already another float column)Instead of that, you could implement
where
at theBlockManager
level and having it do column-by-column upcasting (as necessary), masking, and consolidation together with the minimal amount of copying. To precompute the new blocks you need, you can implement a function on each block which returns the dtype necessary (or maybe the necessaryBlock
subclass, not entirely sure the best semantics) to hold its existing type and the other value. Then, you can allocate one uninitialized array for each new type necessary (consolidating ahead of time) andputmask
into each one from self/other appropriately (optimizing for the case where it's an inplace operation and the result block is the same size and location as the original block, in which case you can reuse the existing array).That might seem like overkill, but reducing the amount of copying would make a difference in the case of a large amount of data, so I'm willing to work on that if you think it's a good idea, unless you were planning on doing something similar yourself already.
jreback commentedon Feb 4, 2013
I don't believe there are any copies in putmask, except with a int to float conversion which does an astype (which copies). you can determine before u do this whether it will create multiple int and /or float blocks (this is in the block level putmask btw), so I think will still just be 1 copy. where has at least 1, and an int to float conversion could add a copy. consolidation could add a copy as well (it's a vstack which I think copies)
in the latest commit what I did was let these routines possibly return more than 1 block, so code is easy.
since the amount of copying is somewhat dependent on the other blocks and/or conversions I am not sure how much extra effort we should do here.
but u r right about doing things at the block manager level; u do have more information and so can create blocks that are already consilidated - I set it up with all of the key methods doing an 'apply' on their blocks and producing new ones. definitely could be optimized
certainly open to having u take a crack at it
this pr is pretty much done
stephenwlin commentedon Feb 4, 2013
yup, no copies in
putmask
, right. but definitely one inwhere
, definitely one inastype
when casting between different types, and most likely invstack
(it might be optimized to avoid it in the case that two arrays happen to already be aligned in memory appropriately, but i doubt it would be the case here)anyway, ok, i'll work on it and based off your branch, unless you're planning on doing a big commit still.
stephenwlin commentedon Feb 4, 2013
(actually, just tested it, apparently
vstack
copies even if given inputs that were split from the same original array usingvsplit
, so definitely a copy here)jreback commentedon Feb 4, 2013
great!
btw in theory u can pass copy = False to astype and its creates a view with the new dtype (of course if u then putmask it will copy the underlying data)
and the approach of trying to create an already consolidated block should prob work well
jreback commentedon Feb 4, 2013
u could start by creating a vbench (there might be some already for blocking, not sure)
and then see if u can improve it
stephenwlin commentedon Feb 4, 2013
yep, will do
jreback commentedon Feb 4, 2013
@stephenwlin
while you are at it, I am pretty sure this is related (and might now be fixed because of the putmask changes....)
#2746
stephenwlin commentedon Feb 4, 2013
will do
11 remaining items