Description
Code Sample, a copy-pastable example if possible
>>> index = ['abe', 'adam', 'andrew', 'ben', 'brad', 'cal', 'chad', 'dan']
>>> data = [0] * len(index)
>>> df = pd.DataFrame(index=index, data=data, columns=['col'])
>>> df
col
abe 0
adam 0
andrew 0
ben 0
brad 0
cal 0
chad 0
dan 0
>>> df.loc['ac':'d']
col
adam 0
andrew 0
ben 0
brad 0
cal 0
chad 0
Problem description
Partial string slicing is documented for datetimeindexes but is nowhere to be found for string indexes. The index must be ordered for it to work. I couldn't find a single example of this anywhere online. Is this type of slicing encouraged? Should it be documented?
Output of pd.show_versions()
pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.3.0.post
Activity
gfyoung commentedon Jul 14, 2017
Admittedly, this behavior is a little less intuitive because Python string comparisons aren't as easily comprehensible of datetime objects. That being said, I'm indifferent about allowing (in which case, document) or disallowing it (in which case, forbid).
toobaz commentedon Jul 14, 2017
Maybe I'm missing something, but I see no "partial" string slicing: I just see slicing with start/stop bounds not present in the index, not differently from what happens with
This is something perfectly natural when you are dealing with datetimes or ints, a bit less with strings, as @gfyoung rightly suggests; still, I don't think there is anything it makes sense to forbid. The docs, by the way, never state that start and stop bounds must be present, except maybe (implicitly) when they say that they are included in the results. So unless I'm missing something we should just change the sentence "When slicing, the start bound is included, AND the stop bound is included" with "When slicing, the start bound, AND the stop bound are included, if present.".
Analogously, few lines later, "(note that contrary to usual python slices, both the start and the stop are included!)" should become "(note that contrary to usual python slices, both the start and the stop are included, if present!)"
gfyoung commentedon Jul 14, 2017
@toobaz : it's a matter of perspective. In the examples provided by @tdpetrou , "ac" and "d" can be viewed as partial strings because they are all shorter than any of the string indices provided.
That being said, given my indifference, do either of you (@toobaz and @tdpetrou ) have any preference? This is really up to users at this point, unless other maintainers have strong opinions about this.
toobaz commentedon Jul 14, 2017
Sure, sorry, I wasn't clear :-) I see @tdpetrou 's perspective, I was just suggesting that this is not
panda
's perspective, and that there is nothing really unexpected going on.It's like saying "positional indexing with prime numbers is undocumented" - positional indexing just works with any integer number, and you only want to document the general case.
gfyoung commentedon Jul 14, 2017
@toobaz : No worries. It seems like you have no issues with allowing this behavior, so long as it's documented. Let's see what @tdpetrou has to say about it as well. Otherwise, either one of you is more than welcome to document (and test) this behavior in a PR!
tdpetrou commentedon Jul 14, 2017
I like this behavior and think it should remain. I disagree with @toobaz and think this behavior is completely unexpected. There is nowhere in the documentation that partial strings (or inexact strings or however you want to term them) work in this manner. This specific behavior only works when the index is sorted and fails with a
KeyError
when not.The normal behavior of slicing with
.loc
is to select all indexes from start to stop and include stop. This is done without regard to lexicographic ordering. If either start or stop is not in the index raise aKeyError
.This specific behavior that I brought up works as such. If index is ordered, either increasing or decreasing, then select all indexes lexicographically greater than or equal to start and less than or equal to stop (or vice versa if monotonic decreasing). Do not raise
KeyError
if either is not in index.toobaz commentedon Jul 14, 2017
Nothing specific to strings (and "inexact" is not a better term than "partial"). Compare
pd.Series(index=[1,3,2,5]).loc[0:6]
(KeyError
) topd.Series(index=[1,2,3,5]).loc[0:6]
(works).But maybe you are right that this is not documented, and in this case, a PR to the docs would probably be a very good thing.
tdpetrou commentedon Jul 14, 2017
This is a very specific and different behavior than normally expected which I already outlined above. It would be helpful to the users to see an example. I am using the term 'lexicographic slicing' in my book.
toobaz commentedon Jul 14, 2017
I don't follow you - I can't see what "lexicographic" means when there aren't multiple levels.
gfyoung commentedon Jul 14, 2017
@tdpetrou @toobaz : We can worry about exact semantics later in the PR. Maintainers will have the last word in the end 😄
@tdpetrou : if you're interested, you should put up a PR describing this behavior both in the docstring + the documentation (under the
doc/
folder). Also, tests for this behavior will be needed.tdpetrou commentedon Jul 14, 2017
@gfyoung I'll see if I have time to do this over the weekend but anyone else is welcome to take it.
toobaz commentedon Jul 14, 2017
OK, just to summarize: my understanding is that
.loc[start,stop]
start
andstop
in the index ifstart
andstop
are in the indexstart
andstop
if the index is sorted (andstart
andstop
can be compared against its elements)Part 2. is probably undocumented/untested.
39 remaining items