Skip to content

Implement high performance rolling_rank #9481

Closed
@PH82

Description

@PH82

xref SO issue here

Im looking to set the rolling rank on a dataframe. Having posted, discussed and analysed the code it looks like the suggested way would be to use the pandas Series.rank function as an argument in rolling_apply. However on large datasets the performance is particularly poor. I have tried different implementations and using bottlenecks rank method orders of magnitude faster, but that only offers the average option for ties. It is also still some way off the performance of rolling_mean. I have previously implemented a rolling rank function which monitors changes on a moving window (in a similar way to algos.roll_mean I believe) rather that recalculating the rank from scratch on each window. Below is an example to highlight the performance, it should be possible to implement a rolling rank with comparable performance to rolling_mean.

python: 2.7.3
pandas: 0.15.2
scipy: 0.10.1
bottleneck: 0.7.0

rollWindow = 240
df = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H'))
df.iloc[-3:-1]['A'] = 7.5
df.iloc[-1]['A'] = 5.5

df["SER_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankOnSeries)
 #28.9secs (allows competition/min ranking for ties)

df["SCIPY_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankSciPy)
 #70.89secs (allows competition/min ranking for ties)

df["BNECK_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankBottleneck)
 #3.64secs (only provides average ranking for ties)

df["ASRT_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankArgSort)
 #3.56secs (only provides competition/min ranking for ties, not necessarily correct result)

df["MEAN"] = pd.rolling_mean(df['A'], window=rollWindow)
 #0.008secs

def rollingRankOnSeries (array):
    s = pd.Series(array)
    return s.rank(method='min', ascending=False)[len(s)-1]

def rollingRankSciPy (array):
     return array.size + 1 - sc.rankdata(array)[-1]

def rollingRankBottleneck (array):
    return array.size + 1 - bd.rankdata(array)[-1]

def rollingRankArgSort (array):
    return array.size - array.argsort().argsort()[-1]
```python

I think this is likely to be a common request for users looking to use pandas for analysis on large datasets and thought it would be a useful addition to the pandas moving statistics/moments suite?

Activity

added
Numeric OperationsArithmetic, Comparison, and Logical operations
PerformanceMemory or execution speed performance
on Feb 13, 2015
jreback

jreback commented on Feb 13, 2015

@jreback
Contributor

I have added it to the master issue. Its actuallly pretty straightforward to do this. Note that all that is necessary is a new method, rolling_idxmax, which is essentially rolling_max but it tracks the index as well; so all that is needed is a modification to algos.roll_max2 to also record the index.

want to do a pull-request here?

added this to the 0.17.0 milestone on Feb 13, 2015
PH82

PH82 commented on Feb 16, 2015

@PH82
Author

Yes that sounds like what I was thinking, pandas.stats.moments.rolling_rank with the same arguments as rank, allowing the user to select a competition method (min, avg....) and whether ranks should be ascending. Not sure I'm familiar enough with the pandas code yet to make any modifications my self to implement the best solution in the environment for this though. Happy to test though.

cing

cing commented on Apr 14, 2015

@cing

I got rolling_indmax and rolling_indxmin working just as @jreback said, recording the index. No extra memory allocation or computation on top of the normal algorithm, just an additional return array from roll_max2/roll_min2 holding the indices. At the PyCon sprint @jreback suggested I return an int64 from the Cython code to store the indices, but I'm pretty sure it has to be a float64 due to possible NaN's, right?

Also, @PH82 requested rolling_idxaverage but is this really necessary?

shoyer

shoyer commented on Apr 14, 2015

@shoyer
Member

@cing you can use -1 to mark missing values values as long as you only use positive indices otherwise (that's pretty standard in pandas)

PH82

PH82 commented on Jul 21, 2015

@PH82
Author

@cing, The average I mention is to determine how to rank ties. Common methods for how to rank ties are:

Values Rank (Tie Method=Min) Rank (Tie Method=Max) Rank (Tie Method=Avg)
3 4 4 4
1 1 1 1
2 2 3 2.5
4 5 5 5
2 2 3 2.5
reopened this on Jul 21, 2015
cing

cing commented on Jul 21, 2015

@cing

I'll have to take a look at this again, some of the test cases didn't succeed for rolling_idxmax and rolling_idxmin. I didn't implement a rolling_idxavg though, is that truly useful?

PH82

PH82 commented on Jul 21, 2015

@PH82
Author

I would say so, for my immediate purposes no, but it is a valid tie method
and quite likely that it would be used in the future, in fact I can think
of scenarios where I would use it. The rolling_rank I think should
implement as much of the non-rolling method as possible.

28 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceWindowrolling, ewma, expanding

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @cing@cottrell@jreback@mcherkassky@shoyer

      Issue actions

        Implement high performance rolling_rank · Issue #9481 · pandas-dev/pandas