-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Closed
Closed
Copy link
Labels
API DesignEnhancementIO DataIO issues that don't fit into a more specific labelIO issues that don't fit into a more specific labelIO Excelread_excel, to_excelread_excel, to_excel
Milestone
Description
related #4468
Add keyword argument to ExcelFile parser to take an integer / list for the rows to interpret as header rows. If more than one, interpret as hierarchical columns / MultiIndex.
Presumably this would also allow you to round-trip Data Frame with hierarchical columns.
Basically, if you have something spanning two columns, just converts to two cells with the data in the original cell, ending up just like what you need for csv reader.
Metadata
Metadata
Assignees
Labels
API DesignEnhancementIO DataIO issues that don't fit into a more specific labelIO issues that don't fit into a more specific labelIO Excelread_excel, to_excelread_excel, to_excel
Activity
jreback commentedon Aug 26, 2013
http://pandas.pydata.org/pandas-docs/dev/io.html#reading-columns-with-a-multiindex
(but for csv), and ths might/probably needs special handling for excel
jtratner commentedon Aug 26, 2013
only special handling would be converting merged cells into repeated entries like csv, so this is relatively minor.
I.e.
just needs to change to something like
under the hood
jtratner commentedon Aug 26, 2013
so, really, a function that takes in merged cell and splits it into individual cells all with the same value would be sufficient to take advantage of csv's existing behavior.
jreback commentedon Aug 26, 2013
@jtratner I think that is right, in your example
header=[0,1]
if it produces the output you put then should parse to a mirelated is the reverse (in
to_excel
); again exists into_csv
, but would need porting (then could round-trip)jtratner commentedon Aug 26, 2013
@cancan101 interested in implementing this? just a minor modification of your
get_effective_cell
function.cancan101 commentedon Aug 26, 2013
I can take a look at this. I am equally interested in solving this for HTML files, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI
jtratner commentedon Aug 26, 2013
Yeah, that's basically the same thing, you just want to end up with the following arrays.
So if you have something like:
You want to convert that into 2 cells with text 'Span2'
cancan101 commentedon Aug 26, 2013
Exactly. I am going to create a similar issues to this one for HTML. FWIW, It would be great to merge the IO backends so that functionality like this can be shared. See: #4682. Shoot closing the other issue.
jtratner commentedon Aug 26, 2013
@cancan101 well, I believe they mostly are, they just pass to a TextReader which does the majority of the work. (so, for example, the ExcelFile reader has to do some magic to convert all the values to a list of lists that can be passed to text reader). I think you could do both of these in the same issue and then refactor the multiindex creation methods from read_csv out for something they can all use - check out code around here for how it works under the hood (I think):
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L703
cancan101 commentedon Aug 26, 2013
I believe that I looked at the parsers backends and that some, but not all, of the parsers use
TextReader
. I believe that HTML parser does not use TextReadercancan101 commentedon Aug 26, 2013
Now comes the other reason for improving the ExcelParser and/or the HTML parser: parsing hierarchical row indexes. A good example of this would be (different link from above): http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSE.
This table has 4 major sections (they can be identified as lines with no other data):
Within the first of those sections are a number of lines items and a section total.
A good feature of the parser would be to extract this structure from the table. Obviously this is non-trivial.
59 remaining items
ENH: read_excel MultiIndex pandas-dev#4679
ENH: read_excel MultiIndex pandas-dev#4679
Merge pull request #10967 from chris-b1/excel-read-multiindex
ENH: read_excel MultiIndex pandas-dev#4679