Manipulating Data Frames Based On Different Columns
I have a data frame df with two columns called Rule_ID and Location. It has data like - Rule_ID Location [u'2c78g',u'df567',u'5ty78'] US [u'2c78g',u'd67g
Solution 1:
Here's one way
Using apply
In [235]: df.groupby('Location')['Rule_ID'].apply(lambda x: len(set(x.sum())))
Out[235]:
Location
India 3
Japan 3
US 4
Name: Rule_ID, dtype: int64
-
In [236]: (df.groupby('Location')
.apply(lambda x: pd.Series(x['Rule_ID'].sum()))
.reset_index()
.groupby(['Location', 0]).size())
Out[236]:
Location 0
India 2c78g 1
d67gh 1
df890o 1
Japan 5ty78 1
d67gh 1
df890o 1
US 2c78g 25ty78 2
df567 1
df890o 1
dtype: int64
Details
x.sum()
on list joins them, you could get unique count by counting set of the list.
In[237]: df.groupby('Location')['Rule_ID'].apply(lambda x: x.sum())
Out[237]:
LocationIndia[2c78g, d67gh, df890o]Japan[d67gh, df890o, 5ty78]US[2c78g, df567, 5ty78, 2c78g, 5ty78, df890o]Name: Rule_ID, dtype: object
Applying pd.Series
on list would create new rows, then groupby
on location and measure.
In [240]: df.groupby('Location').apply(lambda x: pd.Series(x['Rule_ID'].sum()))
Out[240]:
Location
India 02c78g
1 d67gh
2 df890o
Japan 0 d67gh
1 df890o
25ty78
US 02c78g
1 df567
25ty78
32c78g
45ty78
5 df890o
dtype: object
Solution 2:
You need to transform your data frame to long format (unnest column Rule_ID), after which it would be straight forward to summarize:
df_long = pd.DataFrame({
"Rule_ID": [e for s in df.Rule_ID for e in s],
"Location": df.Location.repeat(df.Rule_ID.str.len())
})
df_long.groupby('Location').Rule_ID.nunique()
#Location#India 3#Japan 3#US 4#Name: Rule_ID, dtype: int64
df_long.groupby(['Rule_ID', 'Location']).size()
#Rule_ID Location#u'2c78g' India 1# US 2#u'5ty78' Japan 1# US 2#u'd67gh' India 1# Japan 1#u'df567' US 1#u'df890o' India 1# Japan 1# US 1#dtype: int64
Post a Comment for "Manipulating Data Frames Based On Different Columns"