55

I have a DataFrame df with a column containing labels for each row (in addition to some relevant data for each row). I have a dictionary labeldict with keys equal to the possible labels and values equal to 2-tuples of information related to that label. I'd like to tack two new columns onto my frame, one for each part of the 2-tuple corresponding to the label for each row.

Here is the setup:

import pandas as pd
import numpy as np

np.random.seed(1)
n = 10

labels = list('abcdef')
colors = ['red', 'green', 'blue']
sizes = ['small', 'medium', 'large']

labeldict = {c: (np.random.choice(colors), np.random.choice(sizes)) for c in labels}

df = pd.DataFrame({'label': np.random.choice(labels, n), 
                   'somedata': np.random.randn(n)})

I can get what I want by running:

df['color'], df['size'] = zip(*df['label'].map(labeldict))
print df

  label  somedata  color    size
0     b  0.196643    red  medium
1     c -1.545214  green   small
2     a -0.088104  green   small
3     c  0.852239  green   small
4     b  0.677234    red  medium
5     c -0.106878  green   small
6     a  0.725274  green   small
7     d  0.934889    red  medium
8     a  1.118297  green   small
9     c  0.055613  green   small

But how can I do this if I don't want to manually type out the two columns on the left side of the assignment? I.e. how can I create multiple new columns on the fly. For example, if I had 10-tuples in labeldict instead of 2-tuples, this would be a real pain as currently written. Here are a couple things that don't work:

# set up attrlist for later use
attrlist = ['color', 'size']

# non-working idea 1)
df[attrlist] = zip(*df['label'].map(labeldict))

# non-working idea 2)
df.loc[:, attrlist] = zip(*df['label'].map(labeldict))

This does work, but seems like a hack:

for a in attrlist:
    df[a] = 0
df[attrlist] = zip(*df['label'].map(labeldict))

Better solutions?

smci
  • 32,567
  • 20
  • 113
  • 146
8one6
  • 13,078
  • 12
  • 62
  • 84
  • 2
    FYI your last method will work in 0.13 (without initially creating the column) – Jeff Dec 29 '13 at 21:55
  • Thanks Jeff. I have a [similar question](http://stackoverflow.com/questions/25896241/creating-columns-and-assigning-a-constant-vector-to-every-row) inspired by your comment here. I am having a similar problem to the one stated here – Amelio Vazquez-Reina Sep 17 '14 at 16:58
  • Seems like the answer is pythonically simple: https://stackoverflow.com/a/34074894/3494126 – Ufos Oct 08 '19 at 09:57

6 Answers6

29

Just use result_type='expand' in pandas apply

df
Out[78]: 
   a  b
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand')

df
Out[80]: 
   a  b  mean  std  max
0  0  1   0.5  0.5  1.0
1  2  3   2.5  0.5  3.0
2  4  5   4.5  0.5  5.0
3  6  7   6.5  0.5  7.0
4  8  9   8.5  0.5  9.0

and here some copy paste code

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5,2), columns=['a','b'])
print('df',df, sep='\n')
print()
def mathOperationsTuple(arr):
    return np.mean(arr), np.std(arr), np.amax(arr)

df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand')
print('df',df, sep='\n')
Markus Dutschke
  • 9,341
  • 4
  • 63
  • 58
  • Use of `apply` was what I wanted but although this answer was helpful it made the assignment and function interdependent based on the order of the columns used as input and output. I ended up using the approach from @EricNess instead to capture this mapping entirely within the helper function. – deaks Feb 12 '20 at 23:24
  • Better use this anser: https://stackoverflow.com/a/76098907/7128154 – Markus Dutschke Apr 25 '23 at 08:08
13

You can use merge instead:

>>> ld = pd.DataFrame(labeldict).T
>>> ld.columns = ['color', 'size']
>>> ld.index.name = 'label'
>>> df.merge(ld.reset_index(), on='label')
  label  somedata  color    size
0     b  1.462108    red  medium
1     c -2.060141  green   small
2     c  1.133769  green   small
3     c  0.042214  green   small
4     e -0.322417    red  medium
5     e -1.099891    red  medium
6     e -0.877858    red  medium
7     e  0.582815    red  medium
8     f -0.384054    red   large
9     d -0.172428    red  medium
alko
  • 46,136
  • 12
  • 94
  • 102
  • 4
    you can get away without the reset_index using merge's left_index and right_on. – Andy Hayden Dec 30 '13 at 00:50
  • Note that `merge` generates a new dataframe instead of adding columns to an existing one. Also if there is no index column in the dataframe of your problem, you can't use merge. – Louis Yang Dec 19 '20 at 09:18
8

Instead of doing what you're doing with labeldict, you could make that information into a DataFrame and then join it with your original one:

>>> labeldf = pandas.DataFrame([(np.random.choice(colors), np.random.choice(sizes)) for c in labels], columns=['color', 'size'], index=labels)
>>> df.join(labeldf, on='label')
  label  somedata  color    size
0     a -1.709973    red  medium
1     b  0.099109   blue  medium
2     a -0.427323    red  medium
3     b  0.474995   blue  medium
4     b -2.819208   blue  medium
5     d -0.998888    red   small
6     b  0.713357   blue  medium
7     d  0.331989    red   small
8     e -0.906240  green   large
9     c -0.501916   blue   large
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
3

If you want to add multiple columns to a DataFrame as part of a method chain, you can use apply. The first step is to create a function that will transform a row represented as a Series into the form you want. Then you can call apply to use this function on each row.

def append_label_attributes(row: pd.Series, labelmap: dict) -> pd.Series:
    result = row.copy()
    result['color'] = labelmap[result['label']][0]
    result['size'] = labelmap[result['label']][1]
    return result

df = (
    pd.DataFrame(
    {
        'label': np.random.choice(labels, n),
        'somedata': np.random.randn(n)}
    )
    .apply(append_label_attributes, axis='columns', labelmap=labeldict)
)
Eric Ness
  • 10,119
  • 15
  • 48
  • 51
2

This should work:

df[['color','size']] = list(df['label'].apply(labeldict))
AfterFray
  • 1,751
  • 3
  • 17
  • 22
Margalit
  • 59
  • 1
0

pandas apply method support this out of the box

code

demo.py

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), columns=['a', 'b'])
df2 = df[['a', 'b']].apply(lambda ser: pd.Series({
    'mean': ser.mean(),
    'std': ser.std(ddof=0),
    'max': np.amax(ser.to_numpy())
}), axis=1)
df2 = pd.concat((df, df2), axis=1)

output

>>> from demo import df, df2
>>> df
   a  b
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> df2
   a  b  mean  std  max
0  0  1   0.5  0.5  1.0
1  2  3   2.5  0.5  3.0
2  4  5   4.5  0.5  5.0
3  6  7   6.5  0.5  7.0
4  8  9   8.5  0.5  9.0

background

A few years after writing my first answer to this question I got a bit better overview of the mechanics in pandas.

The func argument in apply takes a function: pd.Series -> Any whereas from the type of Any the exact processing of the returned value into the DataFrame is deduced.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function.

If you pass a lambda function: pd.Series -> pd.Series, you get exactly the required behavior.

Note

apply can also be applied on pd.groupby outputs, achieving a more flexible alternative to .aggregate. You can find an example in the question: Python pandas groupby aggregate on multiple columns, then pivot

Markus Dutschke
  • 9,341
  • 4
  • 63
  • 58