Pandas Split Column
Given the following data frame: import pandas as pd import numpy as np df = pd.DataFrame({ 'A' : ['a', 'b','c', 'd'], 'B' : ['Y>`abcd', 'abcd','efgh', 'Y>`efgh'
Solution 1:
You can use str.extract
with fillna
, last drop column B
by drop
:
df[['C','D']] = df['B'].str.extract('(.*)>`(.*)', expand=True)
df['D'] = df['D'].fillna(df['B'])
df['C'] = df['C'].fillna('')
df = df.drop('B', axis=1)
printdf
A C D
0 a Y abcd
1 b abcd
2 c efgh
3 d Y efgh
Next solution use str.split
with mask
and numpy.where
:
df[['C','D']] = df['B'].str.split('>`', expand=True)
mask = pd.notnull(df['D'])
df['D'] = df['D'].fillna(df['C'])
df['C'] = np.where(mask, df['C'], '')
df = df.drop('B', axis=1)
Timings:
In large DataFrame
is extract
solution 100
times faster, in small 1.5
times:
len(df)=4
:
In [438]:%timeita(df)100loops,best of 3:2.96msperloopIn [439]:%timeitb(df1)1000 loops,best of 3:1.86msperloopIn [440]:%timeitc(df2)Theslowestruntook4.44timeslongerthanthefastest.Thiscouldmeanthatanintermediateresultisbeingcached1000 loops,best of 3:1.89msperloopIn [441]:%timeitd(df3)Theslowestruntook4.62timeslongerthanthefastest.Thiscouldmeanthatanintermediateresultisbeingcached1000 loops,best of 3:1.82msperloop
len(df)=4k
:
In[443]: %timeita(df)
1loops, bestof3: 799msperloopIn[444]: %timeitb(df1)
Theslowestruntook4.19timeslongerthanthefastest. Thiscouldmeanthatanintermediateresultisbeingcached100loops, bestof3: 7.37msperloopIn[445]: %timeitc(df2)
1loops, bestof3: 552msperloopIn[446]: %timeitd(df3)
100loops, bestof3: 9.55msperloop
Code:
import pandas as pd
df = pd.DataFrame({
'A' : ['a', 'b','c', 'd'],
'B' : ['Y>`abcd', 'abcd','efgh', 'Y>`efgh']
})
#for test 4k
df = pd.concat([df]*1000).reset_index(drop=True)
df1,df2,df3 = df.copy(),df.copy(),df.copy()
defb(df):
df[['C','D']] = df['B'].str.extract('(.*)>`(.*)', expand=True)
df['D'] = df['D'].fillna(df['B'])
df['C'] = df['C'].fillna('')
df = df.drop('B', axis=1)
return df
defa(df):
df = pd.concat([df, df.B.str.split('>').apply(
lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) iflen(l) == 2else \
pd.Series({'C': '', 'D': l[0]}))], axis=1)
del df['B']
return df
defc(df):
df[['C','D']] = df['B'].str.split('>`').apply(lambda x: pd.Series(['']*(2-len(x)) + x))
df = df.drop('B', axis=1)
return df
defd(df):
df[['C','D']] = df['B'].str.split('>`', expand=True)
mask = pd.notnull(df['D'])
df['D'] = df['D'].fillna(df['C'])
df['C'] = np.where(mask, df['C'], '')
df = df.drop('B', axis=1)
return df
Solution 2:
Performing a str.split
followed by an apply
returning a pd.Series
will create the new columns:
>>> df.B.str.split('>').apply(
lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) iflen(l) == 2else \
pd.Series({'C': '', 'D': l[0]}))
C D
0 Y abcd
1 abcd
2 efgh
3 Y efgh
So you can concat
this to the DataFrame, and del
the original column:
df = pd.concat([df, df.B.str.split('>').apply(
lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) iflen(l) == 2else \
pd.Series({'C': '', 'D': l[0]}))],
axis=1)
del df['B']
>>> df
A C D
0 a Y abcd
1 b abcd
2 c efgh
3 d Y efgh
Solution 3:
I would use a one liner:
df['B'].str.split('>`').apply(lambda x: pd.Series(['']*(2-len(x)) + x))
# 0 1#0 Y abcd#1 abcd#2 efgh#3 Y efgh
Solution 4:
The simplest and most memory efficient way of doing it is:
df[['C', 'D']] = df.B.str.split('>`', expand=True)
Post a Comment for "Pandas Split Column"