Skip to content

New DataFrame feature: listify() and unlistify() #10511

Closed
@dov

Description

@dov

While building up a DataFrame in several steps, I found it difficult to add a new "perpendicular" column, i.e. a column that adds another dimension to already existing columns. To solve this problem I got the idea that this may be done in two steps:

  1. Add a new column whose values at each cell is a python list of the values the new column takes on.
  2. Unlistify the column thereby creating a new row for each element in the above lists.

I.e. I propose two new DataFrame methods, listify() and unlistify().

listify(df, column): Takes as input a dataframe and the name of a column. It will do a groupby of the df for all columns except column and generate a single row where the values in the column cell is a list of the column column values.

unlistify(df, column): Takes as input a dataframe and the name of a column. It will iterate over the values of the contents of column for each row and generate a new row for each value.

The functions may be expanded to support multiple columns. listify() may e.g. support a post processing function, that will be applied on the list.

The following python code illustrates these two functions. But obviously the functionality may be implemented more efficienctly on the C-level.

#!/usr/bin/python

import pandas as pd

def listify(df, column):
  matches = [i for i,n in enumerate(df.columns)
             if n==column]

  if len(matches)==0:
    raise Exception('Failed to find column named ' + column +'!')
  if len(matches)>1:
    raise Exception('More than one column named ' + column +'!')

  old_index = df.index
  col_idx = matches[0] + len(old_index.shape) # Since we will reset the index

  column_names = list(df.index.names) + list(df.columns)
  gb_cols = [c for c in column_names
             if c!= column]

  # Helper function to generate the squashed dataframe
  def fnc(d):
    row = list(d.values[0])
    return pd.DataFrame([row[:col_idx]
                         + [[v[col_idx] for v in list(d.values)]]
                         + row[col_idx+1:]])

  return (df
          .reset_index()
          .groupby(gb_cols)
          .apply(fnc)
          .rename(columns = lambda i : column_names[i])
          .set_index(old_index.names)
          )

def unlistify(df, column):
  matches = [i for i,n in enumerate(df.columns)
             if n==column]

  if len(matches)==0:
    raise Exception('Failed to find column named ' + column +'!')
  if len(matches)>1:
    raise Exception('More than one column named ' + column +'!')

  col_idx = matches[0]

  # Helper function to expand and repeat the column col_idx
  def fnc(d):
    row = list(d.values[0])
    bef = row[:col_idx]
    aft = row[col_idx+1:]
    col = row[col_idx]
    z = [bef + [c] + aft for c in col]
    return pd.DataFrame(z)

  col_idx += len(df.index.shape) # Since we will push reset the index
  index_names = list(df.index.names)
  column_names = list(index_names) + list(df.columns)
  return (df
          .reset_index()
          .groupby(level=0,as_index=0)
          .apply(fnc)
          .rename(columns = lambda i :column_names[i])
          .set_index(index_names)
          )


# Examples of how to listify and unlistify a column.

df = pd.DataFrame([[11,range(5),10],
                   [22,range(3),20]],
                   columns = ['A','B','C']).set_index('C')
print 'org'
print df
print '--'
df = unlistify(df,'B')
print 'unlistify(df,B)'
print df
print '--'
df = listify(df,'B')
print 'listify(df,B)'
print df

The corresponding output:

org
     A                B
C                      
10  11  [0, 1, 2, 3, 4]
20  22        [0, 1, 2]
--
unlistify(df,B)
     A  B
C        
10  11  0
10  11  1
10  11  2
10  11  3
10  11  4
20  22  0
20  22  1
20  22  2
--
listify(df,B)
     A                B
C                      
10  11  [0, 1, 2, 3, 4]
20  22        [0, 1, 2]

Activity

shoyer

shoyer commented on Jul 5, 2015

@shoyer
Member

How do you create these data frames with nested lists? My sense is that there is almost always a better way.

I'm reluctant to expand the dataframe API for new methods unless they are broadly useful -- this is a large part of why we added the .pipe method.

dov

dov commented on Jul 5, 2015

@dov
Author

My use case is the use of a dataframe through various part of an experiment, containing planning, execution, data collection and analysis. At the planning stage when building the initial data frame I still don't know what one of the columns -- containing "perpendicular" (data independent from all the other columns) data will contain. Once I know it, I want to insert the new data. I.e. I want to go from:

     A 
C                      
10  11
20  22

to:
A B
C
10 11 0
10 11 1
10 11 2
10 11 3
10 11 4
20 22 0
20 22 1
20 22 2

With the unlistify() function this is trivial:

df['B'] = [range(5),range(3)]
df = df.unlistify('B')

But perhaps there is a different simple way that I have missed?

I don't see how the pipe() method is relevant as it seems that it is just syntactic sugar for doing right-hand chaining of operators instead of left-hand function calls. I agree that it can be used together with my listify() and unlistify() implementations. If my proposal is not accepted, I will certainly use my implemtations, with or without pipe.

shoyer

shoyer commented on Jul 5, 2015

@shoyer
Member

Maybe you could show an example of how you make the dataframe? Presumably you're not reading it in from CSV. I guess my thought is that you might be able to easily create the "unlisted" dataframe in the first place.

Pipe is indeed a side point here. Mostly I mentioned it to point out that we are trying to put other libraries on equal footings to what we put in pandas proper.

shoyer

shoyer commented on Jul 5, 2015

@shoyer
Member

For example, suppose the number of items in the lists depends on the other columns, e.g., suppose df['c'] = [list(range(a, b)) for a, b in zip(df.a, df.b)]. It might seem natural to use unlistify in this context. But you might just as easily write df = pd.concat([pd.DataFrame({'a': a, 'b': b, 'c': list(range(a, b))}) for a, b in zip(df.a, df.b)].

dov

dov commented on Jul 5, 2015

@dov
Author

Thanks for the pd.concat() example. It is indeed simpler than my use of groupby(). But still if e.g. df has 10 columns before adding a new column, the use of pd.concat() becomes quite cumbersome.

In any case, I think it is a legitimate use case to start off with a a few dimensions and then unlisting new dimensions to add additional complexity. I don't think it is justified to force the user to declare all dimenions in advance. Initially I thought of adding a function to MultiIndex.add_factor() that would allow adding a new factor (like adding a new dimension to MultiIndex.from_product()). But I think the unlistify() concept is more general.

jreback

jreback commented on Jul 5, 2015

@jreback
Contributor

@dov you are just using a much less-efficient form of multi-index. You lose all performance and indexing with the list/unlistify. Mainly the mixed-list structures force you to be in object dtype. Further I am not sure that the list-structure actually allows you to do any operations easily. Everything first requires conversions, again leading to less that desirable performance / storage characteristics.

Compare to a regular multi-index

In [24]: df.reset_index().set_index(['C','B'])
Out[24]: 
       A
C  B    
10 0  11
   1  11
   2  11
   3  11
   4  11
20 0  22
   1  22
   2  22

In [28]: df2 = df.reset_index().set_index(['C','B'])

In [29]: df2.loc[10]
Out[29]: 
    A
B    
0  11
1  11
2  11
3  11
4  11

In [30]: df2.dtypes
Out[30]: 
A    int64
dtype: object

In [31]: df2.index
Out[31]: 
MultiIndex(levels=[[10, 20], [0, 1, 2, 3, 4]],
           labels=[[0, 0, 0, 0, 0, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2]],
           names=[u'C', u'B'])


Not sure why one would prefer a non-native structure that would have not any advantages (and several key disadvantages) over the multi-index structure.

Maybe you could shed some light on why you are not using a multi-index structure.

dov

dov commented on Jul 5, 2015

@dov
Author

I think the difference is that I see the DataFrame as something evolving in which you don't see the whole picture at the time of its construction. I will try to give an example. Let's say that you have an xy-table with a camera pictures of a plate filled with micro organisms. The field of view is much smaller than the plate. The goal is to image processing and classification of the micro organisms in the images.

Here are the steps that needs to be carried out:

  1. Create a plan containing x,y coordinates of where to snapshot the images. These may be stored in two columns of a dataframe.
  2. Take the images. The references to the images may be stored in a third column of the dataframe in 1.
  3. Do image detection. Each image generates a list of microorganisms with properties such as size x,y position etc. Each organism's properties should be stored on a separate line in a dataframe.

Of course the dataframe in 3 could be created with references back to the image dataframe of 1. But it may make more sense to expand the dataframe in 1 to make room for the detected image property.

In this sense the dataframe is like a logbook for the experiment. It grows and possibly contracts as the experiment progresses.

This is my goal. The idea of the unlistify() function was just a means of doing this. shoyer showed me the same functionality can be achieved through pd.concat().

shoyer

shoyer commented on Jul 6, 2015

@shoyer
Member

Of course the dataframe in 3 could be created with references back to the image dataframe of 1. But it may make more sense to expand the dataframe in 1 to make room for the detected image property.

I think the idiomatic way to do such an operation in pandas would be to use Database-style DataFrame joining/merging.

libbkmz

libbkmz commented on Apr 22, 2016

@libbkmz

I'll save it here for better search for this issue.

This SO answer explain how to do this thing really easily.
http://stackoverflow.com/questions/27263805/pandas-when-cell-contents-are-lists-create-a-row-for-each-element-in-the-list

jreback

jreback commented on Apr 22, 2016

@jreback
Contributor

as said above, this is not good data management, nor likely to be supported by pandas. this leads to really inefficient representations of data.

added
ReshapingConcat, Merge/Join, Stack/Unstack, Explode
on Apr 22, 2016
libbkmz

libbkmz commented on Apr 22, 2016

@libbkmz

@jreback Yeah, I agree with you. But, when I'm getting data from MongoDB with really nested structure, and I want to make some columns for data processing, this is only way to preprocess data and save shape references between origianl data and preprocessed dataset

jreback

jreback commented on Apr 22, 2016

@jreback
Contributor

@libbkmz then you need a proper layer in between. you can certainly pre-process using pandas, but de-listifying (ala exploding) encourages poor data representation in the first place :<

NirantK

NirantK commented on Jan 8, 2018

@NirantK

Hello @jreback , just wanted to ask if you want to close this issue? Do you want to tag it as won't fix for now?

chaxor

chaxor commented on Jan 17, 2018

@chaxor

I would think that the number of people that run into this problem of having lists or iterable structures within cells of a dataframe would give credence to this issue. I fully understand that it is less efficient and there are likely better dataframe architectures. However, I (and looking at the number of stackoverflow questions on this - many, many others) have run into this an astounding number of times.
Many times this is not even the developer that makes this decision but is handed some datatype which translates to a dataframe with this structure. Also, you can always make the argument of efficiency / performance for architecture differences and ultimately end up not using interpreted languages at all, but here we are. Python is used in many communities because of it's ability to do higher level functions, not because of it's performance.

Therefore, I would think it highly useful to have at least an unlistify method for dataframes, as it would make things much faster than trying to find the most efficient / least amount of lines of code stackoverflow question every time you run into this issue in different situations. That way people at least have a very simple method baked into pandas to 'correct' some of the bad architecture yielded by other data structures given to them.

jreback

jreback commented on Jan 17, 2018

@jreback
Contributor

@chaxor

I would think that the number of people that run into this problem of having lists or iterable structures within cells of a dataframe would give credence to this issue.

not sure what to make of that statement. pandas has almost 2200 issues, and no full time folks working on. People prioritize what they will. Do you want to submit a pull-request to fix this? if so great.

dmarinav

dmarinav commented on Jan 31, 2018

@dmarinav

Pandas might have 2200 issues (like any other software or app nowadays), but I believe this one is very important when dealing with unstructured data. I work with unstructured data, and listify() and unlistify() can be very handy here and save lots of valuable time.

jreback

jreback commented on Jan 31, 2018

@jreback
Contributor

@dmarinav and you are welcome to submit a fix. what folks choose to work on is pretty much up to them.

TomAugspurger

TomAugspurger commented on Feb 1, 2018

@TomAugspurger
Contributor

FWIW, I can see someone building a JSONArray on top of #19268 (probably not within pandas). I think that + a custom .json accessor would make for a useful library. Users could store nested data in that array type, and have a bunch of methods at only apply to nested data, like listify and unlistify.

summerela

summerela commented on Jun 15, 2018

@summerela

I end up having to do this kind of thing all the time.. and it's a complete PITA. Having a feature that allows us to "explode" a column containing lists into multiple rows would be wonderful.

orenovadia

orenovadia commented on Jun 12, 2019

@orenovadia

+1

I stumble upon a need for this many times (and made package with just this function).

I think that explode might be a better name for unlistify (see spark).

added this to the 0.25.0 milestone on Jul 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @dov@jreback@libbkmz@shoyer@TomAugspurger

      Issue actions

        New DataFrame feature: listify() and unlistify() · Issue #10511 · pandas-dev/pandas