Skip to content

TypeError: rank() got an unexpected keyword argument 'numeric_only' #11759

Closed
@nbonnotte

Description

@nbonnotte
Contributor
In [19]: df = DataFrame({'a':['A1', 'A1', 'A1'], 'b':['B1','B1','B2'], 'c':1})

In [20]: df.set_index('a').groupby('b').rank(method='first')
Out[20]: 
    c
a    
A1  1
A1  2
A1  1

In [21]: df.set_index('a').groupby('c').rank(method='first')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-6b8d4cae9d91> in <module>()
----> 1 df.set_index('a').groupby('c').rank(method='first')

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in rank(self, axis, numeric_only, method, na_option, ascending, pct)

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in wrapper(*args, **kwargs)
    618                     # mark this column as an error
    619                     try:
--> 620                         return self._aggregate_item_by_item(name, *args, **kwargs)
    621                     except (AttributeError):
    622                         raise ValueError

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in _aggregate_item_by_item(self, func, *args, **kwargs)
   3076             # GH6337
   3077             if not len(result_columns) and errors is not None:
-> 3078                 raise errors
   3079 
   3080         return DataFrame(result, columns=result_columns)

TypeError: rank() got an unexpected keyword argument 'numeric_only'

I'm trying to obtain what I would get with a row_number() in SQL...

Notice that if I replace the value in the 'c' column with the string '1', then even df.set_index('a').groupby('b').rank(method='first') fails.

Am I doing something wrong?

Activity

jreback

jreback commented on Dec 4, 2015

@jreback
Contributor

you are trying to rank on a string column, which is not supported.

But should give a better message I would think.

In [20]: df.set_index('a').groupby('c').first()
Out[20]: 
    b
c    
1  B1
added this to the 0.18.0 milestone on Dec 4, 2015
nbonnotte

nbonnotte commented on Dec 27, 2015

@nbonnotte
ContributorAuthor

That's weird, because .rank() work with method='average' (the default value) but not with method='first'.

In [2]: df = DataFrame({'a':['A1', 'A1', 'A1'], 'b':['B1','B1','B2'], 'c':1})

In [3]: df.set_index('a').groupby('c').rank()
Out[3]: 
      b
a      
A1  1.5
A1  1.5
A1  3.0

In [4]: df.set_index('a').groupby('c').rank(method='first')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-6b8d4cae9d91> in <module>()
----> 1 df.set_index('a').groupby('c').rank(method='first')

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in rank(self, axis, numeric_only, method, na_option, ascending, pct)

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in wrapper(*args, **kwargs)
    582                     # mark this column as an error
    583                     try:
--> 584                         return self._aggregate_item_by_item(name, *args, **kwargs)
    585                     except (AttributeError):
    586                         raise ValueError

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in _aggregate_item_by_item(self, func, *args, **kwargs)
   3017             # GH6337
   3018             if not len(result_columns) and errors is not None:
-> 3019                 raise errors
   3020 
   3021         return DataFrame(result, columns=result_columns)

TypeError: rank() got an unexpected keyword argument 'numeric_only'

I'm looking into it.

nbonnotte

nbonnotte commented on Dec 28, 2015

@nbonnotte
ContributorAuthor

I think I understand what is going on.

DataFrameGroupBy.rank is created as part of a whitelist of operators, and its signature is taken from DataFrame.rank, which uses a wrapper obtained with DataFrame._make_wrapper. There, different things are tried to produce the result.

With method='average', the first try succeeds.

With method='first', the first two trys raise an exception with the message "first not supported for non-numeric data", which is good, but then at the last try the method NDFrame._aggregate_item_by_item is called. Things go wrong here, as it uses SeriesGroupBy.rank, the signature of which is taken from Series.rank. And the parameter numeric_only does not exist there, hence the error.

There is a design flaw here:

  • either the DataFrame and Series (and Panel, I guess) versions of rank (and the like) should always have the same signature
  • or the DataFrameGroupBy.rank should not use `SeriesGroupBy.rank

I'll think about a solution that is as minimalist as possible, solves the initial issue, and if possible addresses this flaw. If I can't, I'll just add a hack somewhere to solve the initial issue.

jreback

jreback commented on Dec 28, 2015

@jreback
Contributor

the right way to fix this is to move Series.rank and DataFrame.rank into generic.py and make the signature uniform.

You then accept numeric_only=None in the Series.rank (and raise NotImplementedError if its not None).

Further need to add axis as a parameter (the _get_axis_name handles the case where the axis is > than the ndim FYI).

you can raise if ndim>2 as well

nbonnotte

nbonnotte commented on Dec 28, 2015

@nbonnotte
ContributorAuthor

So now, SeriesGroupBy.rank has the right signature, and Series._make_wrapper is used, so again there is a call to ._aggregate_item_by_item()... except that this method comes from NDFrameGroupBy, and SeriesGroupBy does not inherit from NDFrameGroupBy, so now an AttributeError is raised. This is caught and transformed into a simple ValueError, with the following comment:

related to : GH #3688
try item-by-item
this can be called recursively, so need to raise ValueError if
we don't have this method to indicated to aggregate to
mark this column as an error

Indeed, the first call to _aggregate_item_by_item (the one that called SeriesGroupBy.rank... still following?) uses this ValueError to simply discard the column, and we end up with an empty dataframe with the example I gave in the beginning.

I'm going to prevent the call to SeriesGroupBy._aggregate_item_by_item (instead of asking for forgiveness), so that the exceptions can be sorted and a meaningful error message can be given to the user.

kuanche

kuanche commented on Dec 30, 2015

@kuanche

Hi guys!
Dealing with the exact same issue- any tips on what to try instead?

nbonnotte

nbonnotte commented on Dec 30, 2015

@nbonnotte
ContributorAuthor

What are you trying to do, exactly?

16 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jreback@kuanche@nbonnotte

        Issue actions

          TypeError: rank() got an unexpected keyword argument 'numeric_only' · Issue #11759 · pandas-dev/pandas