Skip to content

DOC: Series.nlargest sorts index with duplicates #24268

@dataseller

Description

@dataseller

I think the description of this method is incorrect. The effect of parameter "keep" says, if we pass "first" to it, means to take the first duplicate value, while we pass "last" to it, means to take the last duplicate value.
However, its real effect is to sort the duplicate values by their indexes.
Hope we put it right soon.

Activity

chris-b1

chris-b1 commented on Dec 13, 2018

@chris-b1
Contributor

I don't think the docstring is wrong, but could be updated to note that the resulting index is sorted

s = pd.Series([3, 3, 3, 2, 1], index=['a', 'b', 'c', 'd', 'e'])

s.nlargest(2, keep='first')
Out[27]: 
a    3
b    3
dtype: int64

s.nlargest(2, keep='last')
Out[28]: 
c    3
b    3
dtype: int64


# unsorted
s = pd.Series([3, 3, 3, 2, 1], index=['e', 'd', 'c', 'b', 'a'])
s.nlargest(2, keep='last')
Out[30]: 
c    3
d    3
changed the title [-]The mistake of Series.nlargest[/-] [+]DOC: Series.nlargest sorts index with duplicates[/+] on Dec 13, 2018
added this to the Contributions Welcome milestone on Dec 13, 2018
songmoo

songmoo commented on Dec 14, 2018

@songmoo

can I try to fix it?

nfreundlich

nfreundlich commented on Dec 14, 2018

@nfreundlich

can I try to fix it?

did you take the issue?

thoo

thoo commented on Dec 24, 2018

@thoo
Contributor

@chris-b1 I think the description for keep could be more clear. Currently it said:

            - ``first`` : take the first occurrences based on the index order
            - ``last`` : take the last occurrences based on the index order

The index order might not be the same as the order of the index of the Series if the Series is not ordered.
For instance,

# unsorted
s = pd.Series([3, 3, 3, 2, 1], index=['e', 'd', 'c', 'b', 'a'])
s.nlargest(2, keep='last')
Out[30]: 
c    3
d    3

I would naturally assume the index order for keep='last' would be 'd' and 'e' instead of 'c' and 'd' by thinking of ordering index.

Omnomnominous

Omnomnominous commented on Jan 15, 2019

@Omnomnominous

@thoo I think the description is correct insofar as it is order of the items in the index array. It's possible that it could be reworded to be a bit more clear like so:

            - ``first`` : return the first n occurrences in given index order
            - ``last`` : return the last n occurrences in reverse index order

@chris-b1 One thing that I noticed while looking into this is that when you call s.nlargest(2,keep="last") it doesn't return the items sorted by index, but in reverse order by index. For example:

>>>s = pd.Series([3, 1, 3, 2, 3,], index=['e', 'a', 'c', 'b', 'd'])
>>>s.nlargest(3, keep='last')
d    3
c    3
e    3

I feel as if this is a bug in it's own way and that it should instead be returning in the original index order i.e.

//should return
e    3
c    3
d    3
added a commit that references this issue on Mar 5, 2019
076b5a8
added a commit that references this issue on Mar 9, 2019
dcf7137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @chris-b1@thoo@nfreundlich@Omnomnominous@songmoo

      Issue actions

        DOC: Series.nlargest sorts index with duplicates · Issue #24268 · pandas-dev/pandas