Description
While building up a DataFrame in several steps, I found it difficult to add a new "perpendicular" column, i.e. a column that adds another dimension to already existing columns. To solve this problem I got the idea that this may be done in two steps:
- Add a new column whose values at each cell is a python list of the values the new column takes on.
- Unlistify the column thereby creating a new row for each element in the above lists.
I.e. I propose two new DataFrame methods, listify() and unlistify().
listify(df, column)
: Takes as input a dataframe and the name of a column. It will do a groupby of the df for all columns except column and generate a single row where the values in the column cell is a list of the column column
values.
unlistify(df, column)
: Takes as input a dataframe and the name of a column. It will iterate over the values of the contents of column
for each row and generate a new row for each value.
The functions may be expanded to support multiple columns. listify()
may e.g. support a post processing function, that will be applied on the list.
The following python code illustrates these two functions. But obviously the functionality may be implemented more efficienctly on the C-level.
#!/usr/bin/python
import pandas as pd
def listify(df, column):
matches = [i for i,n in enumerate(df.columns)
if n==column]
if len(matches)==0:
raise Exception('Failed to find column named ' + column +'!')
if len(matches)>1:
raise Exception('More than one column named ' + column +'!')
old_index = df.index
col_idx = matches[0] + len(old_index.shape) # Since we will reset the index
column_names = list(df.index.names) + list(df.columns)
gb_cols = [c for c in column_names
if c!= column]
# Helper function to generate the squashed dataframe
def fnc(d):
row = list(d.values[0])
return pd.DataFrame([row[:col_idx]
+ [[v[col_idx] for v in list(d.values)]]
+ row[col_idx+1:]])
return (df
.reset_index()
.groupby(gb_cols)
.apply(fnc)
.rename(columns = lambda i : column_names[i])
.set_index(old_index.names)
)
def unlistify(df, column):
matches = [i for i,n in enumerate(df.columns)
if n==column]
if len(matches)==0:
raise Exception('Failed to find column named ' + column +'!')
if len(matches)>1:
raise Exception('More than one column named ' + column +'!')
col_idx = matches[0]
# Helper function to expand and repeat the column col_idx
def fnc(d):
row = list(d.values[0])
bef = row[:col_idx]
aft = row[col_idx+1:]
col = row[col_idx]
z = [bef + [c] + aft for c in col]
return pd.DataFrame(z)
col_idx += len(df.index.shape) # Since we will push reset the index
index_names = list(df.index.names)
column_names = list(index_names) + list(df.columns)
return (df
.reset_index()
.groupby(level=0,as_index=0)
.apply(fnc)
.rename(columns = lambda i :column_names[i])
.set_index(index_names)
)
# Examples of how to listify and unlistify a column.
df = pd.DataFrame([[11,range(5),10],
[22,range(3),20]],
columns = ['A','B','C']).set_index('C')
print 'org'
print df
print '--'
df = unlistify(df,'B')
print 'unlistify(df,B)'
print df
print '--'
df = listify(df,'B')
print 'listify(df,B)'
print df
The corresponding output:
org
A B
C
10 11 [0, 1, 2, 3, 4]
20 22 [0, 1, 2]
--
unlistify(df,B)
A B
C
10 11 0
10 11 1
10 11 2
10 11 3
10 11 4
20 22 0
20 22 1
20 22 2
--
listify(df,B)
A B
C
10 11 [0, 1, 2, 3, 4]
20 22 [0, 1, 2]
Activity
shoyer commentedon Jul 5, 2015
How do you create these data frames with nested lists? My sense is that there is almost always a better way.
I'm reluctant to expand the dataframe API for new methods unless they are broadly useful -- this is a large part of why we added the
.pipe
method.dov commentedon Jul 5, 2015
My use case is the use of a dataframe through various part of an experiment, containing planning, execution, data collection and analysis. At the planning stage when building the initial data frame I still don't know what one of the columns -- containing "perpendicular" (data independent from all the other columns) data will contain. Once I know it, I want to insert the new data. I.e. I want to go from:
to:
A B
C
10 11 0
10 11 1
10 11 2
10 11 3
10 11 4
20 22 0
20 22 1
20 22 2
With the unlistify() function this is trivial:
But perhaps there is a different simple way that I have missed?
I don't see how the
pipe()
method is relevant as it seems that it is just syntactic sugar for doing right-hand chaining of operators instead of left-hand function calls. I agree that it can be used together with my listify() and unlistify() implementations. If my proposal is not accepted, I will certainly use my implemtations, with or without pipe.shoyer commentedon Jul 5, 2015
Maybe you could show an example of how you make the dataframe? Presumably you're not reading it in from CSV. I guess my thought is that you might be able to easily create the "unlisted" dataframe in the first place.
Pipe is indeed a side point here. Mostly I mentioned it to point out that we are trying to put other libraries on equal footings to what we put in pandas proper.
shoyer commentedon Jul 5, 2015
For example, suppose the number of items in the lists depends on the other columns, e.g., suppose
df['c'] = [list(range(a, b)) for a, b in zip(df.a, df.b)]
. It might seem natural to use unlistify in this context. But you might just as easily writedf = pd.concat([pd.DataFrame({'a': a, 'b': b, 'c': list(range(a, b))}) for a, b in zip(df.a, df.b)]
.dov commentedon Jul 5, 2015
Thanks for the
pd.concat()
example. It is indeed simpler than my use ofgroupby()
. But still if e.g. df has 10 columns before adding a new column, the use of pd.concat() becomes quite cumbersome.In any case, I think it is a legitimate use case to start off with a a few dimensions and then unlisting new dimensions to add additional complexity. I don't think it is justified to force the user to declare all dimenions in advance. Initially I thought of adding a function to
MultiIndex.add_factor()
that would allow adding a new factor (like adding a new dimension to MultiIndex.from_product()). But I think theunlistify()
concept is more general.jreback commentedon Jul 5, 2015
@dov you are just using a much less-efficient form of multi-index. You lose all performance and indexing with the list/unlistify. Mainly the mixed-list structures force you to be in
object
dtype. Further I am not sure that the list-structure actually allows you to do any operations easily. Everything first requires conversions, again leading to less that desirable performance / storage characteristics.Compare to a regular multi-index
Not sure why one would prefer a non-native structure that would have not any advantages (and several key disadvantages) over the multi-index structure.
Maybe you could shed some light on why you are not using a multi-index structure.
dov commentedon Jul 5, 2015
I think the difference is that I see the DataFrame as something evolving in which you don't see the whole picture at the time of its construction. I will try to give an example. Let's say that you have an xy-table with a camera pictures of a plate filled with micro organisms. The field of view is much smaller than the plate. The goal is to image processing and classification of the micro organisms in the images.
Here are the steps that needs to be carried out:
Of course the dataframe in 3 could be created with references back to the image dataframe of 1. But it may make more sense to expand the dataframe in 1 to make room for the detected image property.
In this sense the dataframe is like a logbook for the experiment. It grows and possibly contracts as the experiment progresses.
This is my goal. The idea of the unlistify() function was just a means of doing this. shoyer showed me the same functionality can be achieved through
pd.concat()
.shoyer commentedon Jul 6, 2015
I think the idiomatic way to do such an operation in pandas would be to use Database-style DataFrame joining/merging.
libbkmz commentedon Apr 22, 2016
I'll save it here for better search for this issue.
This SO answer explain how to do this thing really easily.
http://stackoverflow.com/questions/27263805/pandas-when-cell-contents-are-lists-create-a-row-for-each-element-in-the-list
jreback commentedon Apr 22, 2016
as said above, this is not good data management, nor likely to be supported by pandas. this leads to really inefficient representations of data.
libbkmz commentedon Apr 22, 2016
@jreback Yeah, I agree with you. But, when I'm getting data from MongoDB with really nested structure, and I want to make some columns for data processing, this is only way to preprocess data and save shape references between origianl data and preprocessed dataset
jreback commentedon Apr 22, 2016
@libbkmz then you need a proper layer in between. you can certainly pre-process using pandas, but de-listifying (ala
exploding
) encourages poor data representation in the first place :<NirantK commentedon Jan 8, 2018
Hello @jreback , just wanted to ask if you want to close this issue? Do you want to tag it as
won't fix
for now?chaxor commentedon Jan 17, 2018
I would think that the number of people that run into this problem of having lists or iterable structures within cells of a dataframe would give credence to this issue. I fully understand that it is less efficient and there are likely better dataframe architectures. However, I (and looking at the number of stackoverflow questions on this - many, many others) have run into this an astounding number of times.
Many times this is not even the developer that makes this decision but is handed some datatype which translates to a dataframe with this structure. Also, you can always make the argument of efficiency / performance for architecture differences and ultimately end up not using interpreted languages at all, but here we are. Python is used in many communities because of it's ability to do higher level functions, not because of it's performance.
Therefore, I would think it highly useful to have at least an
unlistify
method for dataframes, as it would make things much faster than trying to find the most efficient / least amount of lines of code stackoverflow question every time you run into this issue in different situations. That way people at least have a very simple method baked into pandas to 'correct' some of the bad architecture yielded by other data structures given to them.jreback commentedon Jan 17, 2018
@chaxor
not sure what to make of that statement. pandas has almost 2200 issues, and no full time folks working on. People prioritize what they will. Do you want to submit a pull-request to fix this? if so great.
dmarinav commentedon Jan 31, 2018
Pandas might have 2200 issues (like any other software or app nowadays), but I believe this one is very important when dealing with unstructured data. I work with unstructured data, and listify() and unlistify() can be very handy here and save lots of valuable time.
jreback commentedon Jan 31, 2018
@dmarinav and you are welcome to submit a fix. what folks choose to work on is pretty much up to them.
TomAugspurger commentedon Feb 1, 2018
FWIW, I can see someone building a JSONArray on top of #19268 (probably not within pandas). I think that + a custom
.json
accessor would make for a useful library. Users could store nested data in that array type, and have a bunch of methods at only apply to nested data, likelistify
andunlistify
.summerela commentedon Jun 15, 2018
I end up having to do this kind of thing all the time.. and it's a complete PITA. Having a feature that allows us to "explode" a column containing lists into multiple rows would be wonderful.
orenovadia commentedon Jun 12, 2019
+1
I stumble upon a need for this many times (and made package with just this function).
I think that
explode
might be a better name forunlistify
(see spark).