python – How to apply a function to two columns of Pandas dataframe

The Question :

441 people think this question is useful

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :

f = lambda x, y : my_function_expression.

Now I want to apply the f to df‘s two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :

df['col_3'] = df[['col_1','col_2']].apply(f)  
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'

How to do ?

** Add detail sample as below ***

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

  ID  col_1  col_2            col_3
0  1      0      1       ['a', 'b']
1  2      2      4  ['c', 'd', 'e']
2  3      3      5  ['d', 'e', 'f']

The Question Comments :
  • can you apply f directly to columns: df[‘col_3’] = f(df[‘col_1’],df[‘col_2’])
  • would be useful to know what f is doing
  • no, df[‘col_3’] = f(df[‘col_1’],df[‘col_2’]) not work. For f only accepts scalar input , not vector inputs. OK, you can assume f = lambda x,y : x+y . (of course, my real f is not that simple, otherwise i can directly df[‘col_3’] = df[‘col_1’] + df[‘col_2’] )
  • I found a related Q&A at below url, but my issue is calculating a new column by two existing columns, not 2 from 1 . stackoverflow.com/questions/12356501/…
  • I think my response stackoverflow.com/a/52854800/5447172 answers this in the most Pythonic / Pandanic way, with no workarounds or numeric indexing. It produces exactly the output you required in your example.

The Answer 1

352 people think this answer is useful

Here’s an example using apply on the dataframe, which I am calling with axis = 1.

Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.

In [49]: df
Out[49]: 
          0         1
0  1.000000  0.000000
1 -0.494375  0.570994
2  1.000000  0.000000
3  1.876360 -0.229738
4  1.000000  0.000000

In [50]: def f(x):    
   ....:  return x[0] + x[1]  
   ....:  

In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]: 
0    1.000000
1    0.076619
2    1.000000
3    1.646622
4    1.000000

Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.

The Answer 2

256 people think this answer is useful

There is a clean, one-line way of doing this in Pandas:

df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)

This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.

Example with data (based on original question):

import pandas as pd

df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)

Output of print(df):

  ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:

df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)

The Answer 3

103 people think this answer is useful

A simple solution is:

df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)

The Answer 4

43 people think this answer is useful

A interesting question! my answer as below:

import pandas as pd

def sublst(row):
    return lst[row['J1']:row['J2']]

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(sublst,axis=1)
print df

Output:

  ID  J1  J2
0  1   0   1
1  2   2   4
2  3   3   5
  ID  J1  J2      J3
0  1   0   1     [a]
1  2   2   4  [c, d]
2  3   3   5  [d, e]

I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.

One more brief version:

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df

The Answer 5

25 people think this answer is useful

The method you are looking for is Series.combine. However, it seems some care has to be taken around datatypes. In your example, you would (as I did when testing the answer) naively call

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

However, this throws the error:

ValueError: setting an array element with a sequence.

My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

The Answer 6

12 people think this answer is useful

The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[[‘col1′,’col2’]] returns a single dataframe with two columns, not two separate columns.

You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.

You need this function signature because the syntax is .apply(f) So f needs to take the single thing = dataframe and not two things which is what your current f expects.

Since you haven’t provided the body of f I can’t help in anymore detail – but this should provide the way out without fundamentally changing your code or using some other methods rather than apply

The Answer 7

12 people think this answer is useful

I’m going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it’s great for functions you don’t control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, ‘foo’).

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])


df

ID  col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

The Answer 8

12 people think this answer is useful

Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let’s walk through a simple example:

df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                  columns=['a', 'b', 'c'])
df
   a  b  c
0  4  0  0
1  2  0  1
2  2  2  2
3  1  2  2
4  3  0  0

There are three possible outcomes with returning a list from apply

1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.

df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
0    [0, 1]
1    [0, 1]
2    [0, 1]
3    [0, 1]
4    [0, 1]
dtype: object

2) When the length of the returned list is equal to the number of columns then a DataFrame is returned and each column gets the corresponding value in the list.

df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
   a  b  c
0  0  1  2
1  0  1  2
2  0  1  2
3  0  1  2
4  0  1  2

3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.

i = 0
def f(x):
    global i
    if i == 0:
        i += 1
        return list(range(3))
    return list(range(4))

df.apply(f, axis=1) 
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)

Answering the problem without apply

Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.

Create larger dataframe

df1 = df.sample(100000, replace=True).reset_index(drop=True)

Timings

# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# zip - similar to @Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Thomas answer

%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The Answer 9

7 people think this answer is useful

I’m sure this isn’t as fast as the solutions using Pandas or Numpy operations, but if you don’t want to rewrite your function you can use map. Using the original example data –

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list

We could pass as many arguments as we wanted into the function this way. The output is what we wanted

ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

The Answer 10

2 people think this answer is useful

My example to your questions:

def get_sublist(row, col1, col2):
    return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')

The Answer 11

2 people think this answer is useful

If you have a huge data-set, then you can use an easy but faster(execution time) way of doing this using swifter:

import pandas as pd
import swifter

def fnc(m,x,c):
    return m*x+c

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

The Answer 12

1 people think this answer is useful

I suppose you don’t want to change get_sublist function, and just want to use DataFrame’s apply method to do the job. To get the result you want, I’ve wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

def get_sublist_list(cols):
    return [get_sublist(cols[0],cols[1])]

def unlist(list_of_lists):
    return list_of_lists[0]

df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)

df

If you don’t use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it’ll raise ValueError: could not broadcast input array from shape (3) into shape (2), as @Ted Petrou had mentioned.

Add a Comment