ENH: should boolean indexing preserve input dtypes where possible? #2794

New issue

Closed

ENH: should boolean indexing preserve input dtypes where possible?#2794

Labels

Enhancement

jreback

Contributor

Should pandas preserve the input dtype on when doing
boolean indexing, if possible?

Its a pretty limited case, in that you have to have all values in a column non-null
and they all have to be integers (just casted as floats)

this is actually a little tricky to implement as this has to be done column by column (and its possibly that blocks have to be split)

the question here is will this be not what the user expects? (currently all dtypes
on boolean output operations are cast to float64/object)

In [24]: df = pd.DataFrame(dict(
  a = pd.Series([1]*3,dtype='int32'), 
  b = pd.Series([1]*3,dtype='float32')),
index=range(3))

In [25]: df.ix[2,1] = 0

In [26]: df
Out[26]: 
   a  b
0  1  1
1  1  1
2  1  0

In [27]: df.dtypes
Out[27]: 
a      int32
b    float32
Dtype: object

In [28]: df[df>0]
Out[28]: 
   a   b
0  1   1
1  1   1
2  1 NaN

##### if it is possible to preserve column a as the original int input dtype
which in this case it is #####
In [29]: df[df>0].dtypes
Out[29]: 
a    float64
b    float32
Dtype: object

### possible output ###
Out[29]: 
a    int32
b    float32
Dtype: object

cpcloud

Member

I get dtype conversion on DataFrame construction with dicts; what version of pandas are you using?

jreback

ContributorAuthor

This is a development version (either 0.10.2 or 0.11-dev). The above code won't work on 0.10.1 (well it 'works', but it converts dtypes). Dtypes are preserved in some limited cases in 0.10.1. This is what #2708 is all about.

I am asking this:

if you input an integer dtype and you perform an operation that that results in an integerlike number, but
we since we have to round trip it thru floats (mainly because it can have nan in the array), BUT that the result on a particular column in a DataFrame CAN be converted to integer - SHOULD we convert it back to the original dtype?

This is not currently the case in 0.10.1 or lower

stephenwlin

Contributor

@jreback -- i'm a bit confused, because i've checked out your dtypes branch and see that you have implemented a parameter try_cast to DataFrame.where that seems to do what you're talking about. Are you just asking whether it should be turned on by default from getitem/setitem?

jreback

ContributorAuthor

thats exactly what I am asking. Its actually not fully implemented, because it can happen that an IntBlock needs to split to multiple dtypes (not hard, just didn't do it yet).

I turned it off because I had a few failing tests - basically the 'user' is expected always to convert to float64. It is important to try to make the dtype back to int, where possible?

stephenwlin

Contributor

i'm too new to this to be able to be an authority of how things should be, so don't take my opinion too seriously...but i'm personally inclined to think that try_cast should be off by default, because having it on means that the dtypes of the result depends on what values happen to match a boolean condition, which is a bit odd: it makes more sense to me that the type of the result of an operation should only depend on the types of its inputs, not the types and their particular values within those types.

i know this rule doesn't hold true for a lot of pandas behavior right now though, so maybe my concern isn't really apropos. (it probably also betrays my biases coming from statically typed languages)

cpcloud

Member

Personally, I've found that this doesn't matter for me, but it seems like it makes sense to keep the dtype from boolean indexing if possible.

stephenwlin

Contributor

@jreback: By the way, you mentioned the case of an IntBlock needing to be split into separate blocks based on casting post-pass. How are you thinking about doing this? If you implement it such that each block separately decides how to split itself into multiple blocks in the casting post-pass and does so independently, the I could imagine you might end up doing three copying passes on the data: once for the where op, another to split the resulting multi-column blocks into separate smaller blocks, and a final one to consolidate (in case you end up splitting an int into an int and a float, where there was already another float column)

Instead of that, you could implement where at the BlockManager level and having it do column-by-column upcasting (as necessary), masking, and consolidation together with the minimal amount of copying. To precompute the new blocks you need, you can implement a function on each block which returns the dtype necessary (or maybe the necessary Block subclass, not entirely sure the best semantics) to hold its existing type and the other value. Then, you can allocate one uninitialized array for each new type necessary (consolidating ahead of time) and putmask into each one from self/other appropriately (optimizing for the case where it's an inplace operation and the result block is the same size and location as the original block, in which case you can reuse the existing array).

That might seem like overkill, but reducing the amount of copying would make a difference in the case of a large amount of data, so I'm willing to work on that if you think it's a good idea, unless you were planning on doing something similar yourself already.

jreback

ContributorAuthor

I don't believe there are any copies in putmask, except with a int to float conversion which does an astype (which copies). you can determine before u do this whether it will create multiple int and /or float blocks (this is in the block level putmask btw), so I think will still just be 1 copy. where has at least 1, and an int to float conversion could add a copy. consolidation could add a copy as well (it's a vstack which I think copies)

in the latest commit what I did was let these routines possibly return more than 1 block, so code is easy.
since the amount of copying is somewhat dependent on the other blocks and/or conversions I am not sure how much extra effort we should do here.

but u r right about doing things at the block manager level; u do have more information and so can create blocks that are already consilidated - I set it up with all of the key methods doing an 'apply' on their blocks and producing new ones. definitely could be optimized

certainly open to having u take a crack at it

this pr is pretty much done

stephenwlin

Contributor

yup, no copies in putmask, right. but definitely one in where, definitely one in astype when casting between different types, and most likely in vstack (it might be optimized to avoid it in the case that two arrays happen to already be aligned in memory appropriately, but i doubt it would be the case here)

anyway, ok, i'll work on it and based off your branch, unless you're planning on doing a big commit still.

stephenwlin

Contributor

(actually, just tested it, apparently vstack copies even if given inputs that were split from the same original array using vsplit, so definitely a copy here)

jreback

ContributorAuthor

great!

btw in theory u can pass copy = False to astype and its creates a view with the new dtype (of course if u then putmask it will copy the underlying data)

and the approach of trying to create an already consolidated block should prob work well

jreback

ContributorAuthor

u could start by creating a vbench (there might be some already for blocking, not sure)
and then see if u can improve it

stephenwlin

Contributor

yep, will do

jreback

ContributorAuthor

@stephenwlin

while you are at it, I am pretty sure this is related (and might now be fixed because of the putmask changes....)

#2746

stephenwlin

Contributor

will do

11 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

Enhancement

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: should boolean indexing preserve input dtypes where possible? #2794

11 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

ENH: should boolean indexing preserve input dtypes where possible? #2794

Description

Activity

cpcloud commented on Feb 4, 2013

jreback commented on Feb 4, 2013

stephenwlin commented on Feb 4, 2013

jreback commented on Feb 4, 2013

stephenwlin commented on Feb 4, 2013

cpcloud commented on Feb 4, 2013

stephenwlin commented on Feb 4, 2013

jreback commented on Feb 4, 2013

stephenwlin commented on Feb 4, 2013

stephenwlin commented on Feb 4, 2013

jreback commented on Feb 4, 2013

jreback commented on Feb 4, 2013

stephenwlin commented on Feb 4, 2013

jreback commented on Feb 4, 2013

stephenwlin commented on Feb 4, 2013

11 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions