Skip to content Skip to sidebar Skip to footer

Manipulating Data Frames Based On Different Columns

I have a data frame df with two columns called Rule_ID and Location. It has data like - Rule_ID Location [u'2c78g',u'df567',u'5ty78'] US [u'2c78g',u'd67g

Solution 1:

Here's one way

Using apply

In [235]: df.groupby('Location')['Rule_ID'].apply(lambda x: len(set(x.sum())))
Out[235]:
Location
India    3
Japan    3
US       4
Name: Rule_ID, dtype: int64

-

In [236]: (df.groupby('Location')
             .apply(lambda x: pd.Series(x['Rule_ID'].sum()))
             .reset_index()
             .groupby(['Location', 0]).size())
Out[236]:
Location  0
India     2c78g     1
          d67gh     1
          df890o    1
Japan     5ty78     1
          d67gh     1
          df890o    1
US        2c78g     25ty78     2
          df567     1
          df890o    1
dtype: int64

Details

x.sum() on list joins them, you could get unique count by counting set of the list.

In[237]: df.groupby('Location')['Rule_ID'].apply(lambda x: x.sum())
Out[237]:
LocationIndia[2c78g, d67gh, df890o]Japan[d67gh, df890o, 5ty78]US[2c78g, df567, 5ty78, 2c78g, 5ty78, df890o]Name: Rule_ID, dtype: object

Applying pd.Series on list would create new rows, then groupby on location and measure.

In [240]: df.groupby('Location').apply(lambda x: pd.Series(x['Rule_ID'].sum()))
Out[240]:
Location
India     02c78g
          1     d67gh
          2    df890o
Japan     0     d67gh
          1    df890o
          25ty78
US        02c78g
          1     df567
          25ty78
          32c78g
          45ty78
          5    df890o
dtype: object

Solution 2:

You need to transform your data frame to long format (unnest column Rule_ID), after which it would be straight forward to summarize:

df_long = pd.DataFrame({
        "Rule_ID": [e for s in df.Rule_ID for e in s],
        "Location": df.Location.repeat(df.Rule_ID.str.len())
    })

df_long.groupby('Location').Rule_ID.nunique()

#Location#India    3#Japan    3#US       4#Name: Rule_ID, dtype: int64

df_long.groupby(['Rule_ID', 'Location']).size()

#Rule_ID    Location#u'2c78g'   India       1#           US          2#u'5ty78'   Japan       1#           US          2#u'd67gh'   India       1#           Japan       1#u'df567'   US          1#u'df890o'  India       1#           Japan       1#           US          1#dtype: int64

Post a Comment for "Manipulating Data Frames Based On Different Columns"