Expand Pandas Dataframe Based On Range In A Column
I have a pandas dataframe like this: Name SICs Agric 0100-0199 Agric 0910-0919 Agric 2048-2048 Food 2000-2009 Food 2010-2019 Soda 2097-2097 The SICs column gives a ran
Solution 1:
Quick and dirty but I think this gets you to what you need:
from io import StringIO
import pandas as pd
players=StringIO(u"""Name,SICs
Agric,0100-0199
Agric,0210-0211
Food,2048-2048
Soda,1198-1200""")
df = pd.DataFrame.from_csv(players, sep=",", parse_dates=False).reset_index()
df2 = pd.DataFrame(columns=('Name', 'SIC'))
count = 0
for idx,r in df.iterrows():
data = r['SICs'].split("-")
for i in range(int(data[0]), int(data[1])+1):
df2.loc[count] = (r['Name'], i)
count += 1
Solution 2:
The neatest way I found (building on from Andy Hayden's answer):
# Extract date min and max
df = df.set_index("Name")
df = df['SICs'].str.extract("(\d+)-(\d+)")
df.columns = ['min', 'max']
df = df.astype('int')
# Enumerate dates into wide table
enumerated_dates = [np.arange(row['min'], row['max']+1) for _, row in df.iterrows()]
df = pd.DataFrame.from_records(data=enumerated_dates, index=df.index)
# Convert from wide to long table
df = df.stack().reset_index(1, drop=True)
It is however slow due to the for loop. A vectorised solution would be amazing but I cant find one.
Solution 3:
You can use str.extract to get strings from a regular expression:
In [11]: df
Out[11]:
Name SICs
0 Agri 0100-0199
1 Agri 0910-0919
2 Food 2000-2009
First take out the name as that's the thing we want to keep:
In [12]: df1 = df.set_index("Name")
In [13]: df1
Out[13]:
SICs
Name
Agri 0100-0199
Agri 0910-0919
Food 2000-2009
In [14]: df1['SICs'].str.extract("(\d+)-(\d+)")
Out[14]:
0 1
Name
Agri 0100 0199
Agri 0910 0919
Food 2000 2009
Then flatten this with stack (which adds a MultiIndex):
In [15]: df1['SICs'].str.extract("(\d+)-(\d+)").stack()
Out[15]:
Name
Agri 0 0100
1 0199
0 0910
1 0919
Food 0 2000
1 2009
dtype: object
If you must you can remove the 0-1 level of the MultiIndex:
In [16]: df1['SICs'].str.extract("(\d+)-(\d+)").stack().reset_index(1, drop=True)
Out[16]:
Name
Agri 0100
Agri 0199
Agri 0910
Agri 0919
Food 2000
Food 2009
dtype: object
Post a Comment for "Expand Pandas Dataframe Based On Range In A Column"