Skip to content

ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

@jtratner

Description

@jtratner
Contributor

related #4468

Add keyword argument to ExcelFile parser to take an integer / list for the rows to interpret as header rows. If more than one, interpret as hierarchical columns / MultiIndex.

Presumably this would also allow you to round-trip Data Frame with hierarchical columns.

Basically, if you have something spanning two columns, just converts to two cells with the data in the original cell, ending up just like what you need for csv reader.

Activity

jreback

jreback commented on Aug 26, 2013

@jreback
Contributor

http://pandas.pydata.org/pandas-docs/dev/io.html#reading-columns-with-a-multiindex

(but for csv), and ths might/probably needs special handling for excel

jtratner

jtratner commented on Aug 26, 2013

@jtratner
ContributorAuthor

only special handling would be converting merged cells into repeated entries like csv, so this is relatively minor.

I.e.

____________________
| bar    |   baz    |
| A | B  | C | D | E|
|___________________|

just needs to change to something like

[['bar', 'bar', 'baz', 'baz', 'baz'], ['A', 'B', 'C', 'D', 'E']]

under the hood

jtratner

jtratner commented on Aug 26, 2013

@jtratner
ContributorAuthor

so, really, a function that takes in merged cell and splits it into individual cells all with the same value would be sufficient to take advantage of csv's existing behavior.

jreback

jreback commented on Aug 26, 2013

@jreback
Contributor

@jtratner I think that is right, in your example header=[0,1] if it produces the output you put then should parse to a mi

related is the reverse (in to_excel); again exists in to_csv, but would need porting (then could round-trip)

jtratner

jtratner commented on Aug 26, 2013

@jtratner
ContributorAuthor

@cancan101 interested in implementing this? just a minor modification of your get_effective_cell function.

cancan101

cancan101 commented on Aug 26, 2013

@cancan101
Contributor

I can take a look at this. I am equally interested in solving this for HTML files, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI

jtratner

jtratner commented on Aug 26, 2013

@jtratner
ContributorAuthor

Yeah, that's basically the same thing, you just want to end up with the following arrays.

>>> data = [
    ['Three months ended April 30','Three months ended April 30',
     'Six months ended April 30', 'Six months ended April 30'],
    ['2013', '2012', '2013', '2012']
]
>>> MultiIndex.from_arrays(data)
MultiIndex
[(u'Three months ended April 30', u'2013'), (u'Three months ended April 30', u'2012'), (u'Six months ended April 30', u'2013'), (u'Six months ended April 30', u'2012')]

So if you have something like:

<td colspan=2>Span2</td>

You want to convert that into 2 cells with text 'Span2'

cancan101

cancan101 commented on Aug 26, 2013

@cancan101
Contributor

Exactly. I am going to create a similar issues to this one for HTML. FWIW, It would be great to merge the IO backends so that functionality like this can be shared. See: #4682. Shoot closing the other issue.

jtratner

jtratner commented on Aug 26, 2013

@jtratner
ContributorAuthor

@cancan101 well, I believe they mostly are, they just pass to a TextReader which does the majority of the work. (so, for example, the ExcelFile reader has to do some magic to convert all the values to a list of lists that can be passed to text reader). I think you could do both of these in the same issue and then refactor the multiindex creation methods from read_csv out for something they can all use - check out code around here for how it works under the hood (I think):

https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L703

cancan101

cancan101 commented on Aug 26, 2013

@cancan101
Contributor

I believe that I looked at the parsers backends and that some, but not all, of the parsers use TextReader. I believe that HTML parser does not use TextReader

cancan101

cancan101 commented on Aug 26, 2013

@cancan101
Contributor

Now comes the other reason for improving the ExcelParser and/or the HTML parser: parsing hierarchical row indexes. A good example of this would be (different link from above): http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSE.

This table has 4 major sections (they can be identified as lines with no other data):

  1. Net revenue:
  2. Costs and expenses:
  3. Net earnings per share:
  4. Weighted-average shares used to compute net earnings per share:

Within the first of those sections are a number of lines items and a section total.

A good feature of the parser would be to extract this structure from the table. Obviously this is non-trivial.

59 remaining items

modified the milestones: 0.17.0, Next Major Release on Sep 3, 2015
added a commit that references this issue on Sep 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Participants

      @cancan101@cpcloud@jreback@jtratner@hayd

      Issue actions

        ENH: Excel - allow for multiple rows to be treated as hierarchical columns · Issue #4679 · pandas-dev/pandas