Skip to content

Undocumented feature: partial string slicing #16917

Closed
@tdpetrou

Description

@tdpetrou
Contributor

Code Sample, a copy-pastable example if possible

>>> index = ['abe', 'adam', 'andrew', 'ben', 'brad', 'cal', 'chad', 'dan']
>>> data = [0] * len(index)
>>> df = pd.DataFrame(index=index, data=data, columns=['col'])
>>> df
        col
abe       0
adam      0
andrew    0
ben       0
brad      0
cal       0
chad      0
dan       0

>>> df.loc['ac':'d']
        col
adam      0
andrew    0
ben       0
brad      0
cal       0
chad      0

Problem description

Partial string slicing is documented for datetimeindexes but is nowhere to be found for string indexes. The index must be ordered for it to work. I couldn't find a single example of this anywhere online. Is this type of slicing encouraged? Should it be documented?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.3.0.post

Activity

gfyoung

gfyoung commented on Jul 14, 2017

@gfyoung
Member

Admittedly, this behavior is a little less intuitive because Python string comparisons aren't as easily comprehensible of datetime objects. That being said, I'm indifferent about allowing (in which case, document) or disallowing it (in which case, forbid).

toobaz

toobaz commented on Jul 14, 2017

@toobaz
Member

Maybe I'm missing something, but I see no "partial" string slicing: I just see slicing with start/stop bounds not present in the index, not differently from what happens with

In [11]: df.loc['academia':'dadaism']
Out[11]: 
        col
adam      0
andrew    0
ben       0
brad      0
cal       0
chad      0

This is something perfectly natural when you are dealing with datetimes or ints, a bit less with strings, as @gfyoung rightly suggests; still, I don't think there is anything it makes sense to forbid. The docs, by the way, never state that start and stop bounds must be present, except maybe (implicitly) when they say that they are included in the results. So unless I'm missing something we should just change the sentence "When slicing, the start bound is included, AND the stop bound is included" with "When slicing, the start bound, AND the stop bound are included, if present.".

Analogously, few lines later, "(note that contrary to usual python slices, both the start and the stop are included!)" should become "(note that contrary to usual python slices, both the start and the stop are included, if present!)"

gfyoung

gfyoung commented on Jul 14, 2017

@gfyoung
Member

@toobaz : it's a matter of perspective. In the examples provided by @tdpetrou , "ac" and "d" can be viewed as partial strings because they are all shorter than any of the string indices provided.

That being said, given my indifference, do either of you (@toobaz and @tdpetrou ) have any preference? This is really up to users at this point, unless other maintainers have strong opinions about this.

toobaz

toobaz commented on Jul 14, 2017

@toobaz
Member

@toobaz : it's a matter of perspective. In the examples provided by @tdpetrou , "ac" and "d" can be viewed as partial strings because they are all shorter than any of the string indices provided.

Sure, sorry, I wasn't clear :-) I see @tdpetrou 's perspective, I was just suggesting that this is not panda's perspective, and that there is nothing really unexpected going on.

It's like saying "positional indexing with prime numbers is undocumented" - positional indexing just works with any integer number, and you only want to document the general case.

gfyoung

gfyoung commented on Jul 14, 2017

@gfyoung
Member

@toobaz : No worries. It seems like you have no issues with allowing this behavior, so long as it's documented. Let's see what @tdpetrou has to say about it as well. Otherwise, either one of you is more than welcome to document (and test) this behavior in a PR!

tdpetrou

tdpetrou commented on Jul 14, 2017

@tdpetrou
ContributorAuthor

I like this behavior and think it should remain. I disagree with @toobaz and think this behavior is completely unexpected. There is nowhere in the documentation that partial strings (or inexact strings or however you want to term them) work in this manner. This specific behavior only works when the index is sorted and fails with a KeyError when not.

The normal behavior of slicing with .loc is to select all indexes from start to stop and include stop. This is done without regard to lexicographic ordering. If either start or stop is not in the index raise a KeyError.

This specific behavior that I brought up works as such. If index is ordered, either increasing or decreasing, then select all indexes lexicographically greater than or equal to start and less than or equal to stop (or vice versa if monotonic decreasing). Do not raise KeyError if either is not in index.

toobaz

toobaz commented on Jul 14, 2017

@toobaz
Member

I like this behavior and think it should remain. I disagree with @toobaz and think this behavior is completely unexpected. There is nowhere in the documentation that partial strings (or inexact strings or however you want to term them) work in this manner. This specific behavior only works when the index is sorted and fails with a KeyError when not.

Nothing specific to strings (and "inexact" is not a better term than "partial"). Compare pd.Series(index=[1,3,2,5]).loc[0:6] (KeyError) to pd.Series(index=[1,2,3,5]).loc[0:6] (works).

But maybe you are right that this is not documented, and in this case, a PR to the docs would probably be a very good thing.

tdpetrou

tdpetrou commented on Jul 14, 2017

@tdpetrou
ContributorAuthor

This is a very specific and different behavior than normally expected which I already outlined above. It would be helpful to the users to see an example. I am using the term 'lexicographic slicing' in my book.

toobaz

toobaz commented on Jul 14, 2017

@toobaz
Member

I don't follow you - I can't see what "lexicographic" means when there aren't multiple levels.

gfyoung

gfyoung commented on Jul 14, 2017

@gfyoung
Member

@tdpetrou @toobaz : We can worry about exact semantics later in the PR. Maintainers will have the last word in the end 😄

@tdpetrou : if you're interested, you should put up a PR describing this behavior both in the docstring + the documentation (under the doc/ folder). Also, tests for this behavior will be needed.

tdpetrou

tdpetrou commented on Jul 14, 2017

@tdpetrou
ContributorAuthor

@gfyoung I'll see if I have time to do this over the weekend but anyone else is welcome to take it.

toobaz

toobaz commented on Jul 14, 2017

@toobaz
Member

OK, just to summarize: my understanding is that .loc[start,stop]

  1. returns everything between start and stop in the index if start and stop are in the index
  2. return everything sorting between start and stop if the index is sorted (and start and stop can be compared against its elements)
  3. fails otherwise

Part 2. is probably undocumented/untested.

39 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsIndexingRelated to indexing on series/frames, not to indexes themselvesUsage Question

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @jreback@toobaz@tdpetrou@gfyoung

      Issue actions

        Undocumented feature: partial string slicing · Issue #16917 · pandas-dev/pandas