Skip to content Skip to sidebar Skip to footer

Pandas Split Column

Given the following data frame: import pandas as pd import numpy as np df = pd.DataFrame({ 'A' : ['a', 'b','c', 'd'], 'B' : ['Y>`abcd', 'abcd','efgh', 'Y>`efgh'

Solution 1:

You can use str.extract with fillna, last drop column B by drop:

df[['C','D']] = df['B'].str.extract('(.*)>`(.*)', expand=True)
df['D'] = df['D'].fillna(df['B'])
df['C'] = df['C'].fillna('')
df = df.drop('B', axis=1)

printdf

   A  C     D
0  a  Y  abcd
1  b     abcd
2  c     efgh
3  d  Y  efgh

Next solution use str.split with mask and numpy.where:

df[['C','D']] =  df['B'].str.split('>`', expand=True) 
mask = pd.notnull(df['D'])
df['D'] = df['D'].fillna(df['C'])
df['C'] = np.where(mask, df['C'], '')
df = df.drop('B', axis=1) 

Timings:

In large DataFrame is extract solution 100 times faster, in small 1.5 times:

len(df)=4:

In [438]:%timeita(df)100loops,best of 3:2.96msperloopIn [439]:%timeitb(df1)1000 loops,best of 3:1.86msperloopIn [440]:%timeitc(df2)Theslowestruntook4.44timeslongerthanthefastest.Thiscouldmeanthatanintermediateresultisbeingcached1000 loops,best of 3:1.89msperloopIn [441]:%timeitd(df3)Theslowestruntook4.62timeslongerthanthefastest.Thiscouldmeanthatanintermediateresultisbeingcached1000 loops,best of 3:1.82msperloop

len(df)=4k:

In[443]: %timeita(df)
1loops, bestof3: 799msperloopIn[444]: %timeitb(df1)
Theslowestruntook4.19timeslongerthanthefastest. Thiscouldmeanthatanintermediateresultisbeingcached100loops, bestof3: 7.37msperloopIn[445]: %timeitc(df2)
1loops, bestof3: 552msperloopIn[446]: %timeitd(df3)
100loops, bestof3: 9.55msperloop

Code:

import pandas as pd
df = pd.DataFrame({
       'A' : ['a', 'b','c', 'd'],
       'B' : ['Y>`abcd', 'abcd','efgh', 'Y>`efgh']
    })
#for test 4k    
df = pd.concat([df]*1000).reset_index(drop=True)
df1,df2,df3 = df.copy(),df.copy(),df.copy()

defb(df):
    df[['C','D']] = df['B'].str.extract('(.*)>`(.*)', expand=True)
    df['D'] = df['D'].fillna(df['B'])
    df['C'] = df['C'].fillna('')
    df = df.drop('B', axis=1)
    return df

defa(df):
    df = pd.concat([df, df.B.str.split('>').apply(
    lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) iflen(l) == 2else \
        pd.Series({'C': '', 'D': l[0]}))], axis=1)
    del df['B']
    return df

defc(df):
    df[['C','D']] = df['B'].str.split('>`').apply(lambda x: pd.Series(['']*(2-len(x)) + x))
    df = df.drop('B', axis=1)    
    return df   

defd(df):
    df[['C','D']] =  df['B'].str.split('>`', expand=True) 
    mask = pd.notnull(df['D'])
    df['D'] = df['D'].fillna(df['C'])
    df['C'] = np.where(mask, df['C'], '')
    df = df.drop('B', axis=1) 
    return df  

Solution 2:

Performing a str.split followed by an apply returning a pd.Series will create the new columns:

>>> df.B.str.split('>').apply(
    lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) iflen(l) == 2else \
        pd.Series({'C': '', 'D': l[0]}))
    C   D
0   Y   abcd
1       abcd
2       efgh
3   Y   efgh

So you can concat this to the DataFrame, and del the original column:

df = pd.concat([df, df.B.str.split('>').apply(
    lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) iflen(l) == 2else \
        pd.Series({'C': '', 'D': l[0]}))],
    axis=1)
del df['B']
>>> df
    A   C   D
0   a   Y   abcd
1   b       abcd
2   c       efgh
3   d   Y   efgh

Solution 3:

I would use a one liner:

df['B'].str.split('>`').apply(lambda x: pd.Series(['']*(2-len(x)) + x))

#   0     1#0  Y  abcd#1     abcd#2     efgh#3  Y  efgh

Solution 4:

The simplest and most memory efficient way of doing it is:

df[['C', 'D']] = df.B.str.split('>`', expand=True)

Post a Comment for "Pandas Split Column"