Skip to content

Quick dataframe shift #5609

@halleygithub

Description

@halleygithub

Quick implementation of dataframe shift

def quick_shift(df, N=1):
    '''Quick implementation of dataframe shift'''

    shift_value = np.roll(df.values,N,axis=0)
    shift_value[0:N] = np.NaN
    return DataFrame(shift_value, index=df.index, columns=df.columns)
DataFrame.quick_shift = quick_shift

Perf benchmark:

df = DataFrame(np.random.rand(10000,500))

%timeit df1 = df.quick_shift(1)
10 loops, best of 3: 55.3 ms per loop

%timeit df2 = df.shift(1)
1 loops, best of 3: 238 ms per loop

Related to #4095

Activity

jreback

jreback commented on Nov 28, 2013

@jreback
Contributor

this would need to be done in core/internals.py

needs to handle non trivial shifts (eg something that is not so straightforward like this) and multi dtypes

jreback

jreback commented on Jan 25, 2014

@jreback
Contributor

@halleygithub putting on the plate for 0.14. Care to do a PR for this?

halleygithub

halleygithub commented on Jan 28, 2014

@halleygithub
Author

@jreback , as I am not familiar with github, so I am not sure what can I do . But you are free to handle the code as you want. :)

jreback

jreback commented on Jan 28, 2014

@jreback
Contributor

it's a nice excuse to get familiar

see here: http://pandas.pydata.org/developers.html

gouthambs

gouthambs commented on Feb 15, 2014

@gouthambs
Contributor

@jreback I was wondering if I can take a stab at this?

jreback

jreback commented on Feb 15, 2014

@jreback
Contributor

of course

lmk if u need help

gouthambs

gouthambs commented on Feb 15, 2014

@gouthambs
Contributor

@jreback I see shift function in core/internals.py in the Block class and in core/generic.py in the NDFrame class. Can you point me to a place where I can understand some basic code flow? Or can you give me some quick pointers?

jreback

jreback commented on Feb 15, 2014

@jreback
Contributor

core/generic /shift sets up what to do eg how many periods to shift and such and translates it to move the data up 3 or whatever; this then calls the internals shift

core/internals/ Block/shift is where to make this change

simply swap out the implantation for the new
make sure tests pass
write a vbench for this (pandas/vb_suite) - find a file with similar methods and stick it their
u can even use the above benchmarks

gouthambs

gouthambs commented on Feb 15, 2014

@gouthambs
Contributor

@jreback Thanks. I will do that.

jreback

jreback commented on Feb 16, 2014

@jreback
Contributor

FYI, I just fixed a bug in shift, see here: #6373

The soln proposed above is still valid in any case....make a change in core/internals/Block/shift

gouthambs

gouthambs commented on Feb 17, 2014

@gouthambs
Contributor

Sure. I will take your changes and merge with mine.

gouthambs

gouthambs commented on Feb 18, 2014

@gouthambs
Contributor

@jreback I swapped the current shift with a roll, in order to see if I get a speed up. I tried swapping in core/internals.py. But this did not get any performance improvements. On the other hand when I use a roll in core/generic.py (just as a test), I see good speedup. The two implementations are shown below:

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index d607be6..753b146 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -3169,9 +3169,10 @@ class NDFrame(PandasObject):
             return self

         if freq is None and not len(kwds):
-            block_axis = self._get_block_manager_axis(axis)
-            indexer = com._shift_indexer(len(self._get_axis(axis)), periods)
-            new_data = self._data.shift(indexer=indexer, periods=periods, axis=block_axis)
+            #block_axis = self._get_block_manager_axis(axis)
+            #indexer = com._shift_indexer(len(self._get_axis(axis)), periods)
+            #new_data = self._data.shift(indexer=indexer, periods=periods, axis=block_axis)
+            new_data    = np.roll(self.values,periods,axis=axis)
         else:
             return self.tshift(periods, freq, **kwds)

diff --git a/pandas/core/internals.py b/pandas/core/internals.py
index c89aac0..e56697b 100644
--- a/pandas/core/internals.py
+++ b/pandas/core/internals.py
@@ -931,7 +931,8 @@ class Block(PandasObject):
     def shift(self, indexer, periods, axis=0):
         """ shift the block by periods, possibly upcast """

-        new_values = self.values.take(indexer, axis=axis)
+        #new_values = self.values.take(indexer, axis=axis)
+        new_values = np.roll(self.values,periods, axis=axis)
         # convert integer to float if necessary. need to do a lot more than
         # that, handle boolean etc also
         new_values, fill_value = com._maybe_upcast(new_values)

Do you have any ideas why np.roll speeds up the NDFrame but not the Block class?

The timings for the above mentioned example are as follows:

  • Original Implementation: 242ms
  • Quick Shift : 27 ms
  • Roll in core/internals : 239 ms
  • Roll in core/generic:27 ms
jreback

jreback commented on Feb 18, 2014

@jreback
Contributor

you can't do it in generic.py because that only handles a single dtype. The orientation needs to be changed as the blocks store values 'flipped' from what you think they are. You need to adjust the roll (which IIRC is just an axis swap) to account for this.

In internals you are effectively using axis 1 for the shift (on a row shift), that's what the block_axis is.
transpose the values before roll, then transpose again after the setting

jreback

jreback commented on Feb 18, 2014

@jreback
Contributor

[goat-jreback-~/pandas] git diff

diff --git a/pandas/core/internals.py b/pandas/core/internals.py
index e68fc8d..d33386e 100644
--- a/pandas/core/internals.py
+++ b/pandas/core/internals.py
@@ -946,11 +946,10 @@ class Block(PandasObject):
     def shift(self, indexer, periods, axis=0):
         """ shift the block by periods, possibly upcast """

-        new_values = self.values.take(indexer, axis=axis)
+        new_values, fill_value = com._maybe_upcast(self.values)
         # convert integer to float if necessary. need to do a lot more than
         # that, handle boolean etc also
-        new_values, fill_value = com._maybe_upcast(new_values)
-
+        new_values = np.roll(self.values.T,periods, axis=axis)
         axis_indexer = [ slice(None) ] * self.ndim
         if periods > 0:
             axis_indexer[axis] = slice(None,periods)
@@ -959,7 +958,7 @@ class Block(PandasObject):
             axis_indexer[axis] = slice(periods,None)
         new_values[tuple(axis_indexer)] = fill_value

-        return [make_block(new_values, self.items, self.ref_items,
+        return [make_block(new_values.T, self.items, self.ref_items,
                            ndim=self.ndim, fastpath=True)]

     def eval(self, func, other, raise_on_error=True, try_cast=False):
jreback

jreback commented on Feb 18, 2014

@jreback
Contributor

you can also remove _shift_indexer from core/common.py as I think that's the only thing that uses it

6 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @jreback@halleygithub@gouthambs

      Issue actions

        Quick dataframe shift · Issue #5609 · pandas-dev/pandas