How Do I Count The Values From A Pandas Column Which Is A List Of Strings?
Solution 1:
Solution
Best option: df.colors.explode().dropna().value_counts()
.
However, if you also want to have counts for empty lists ([]
), use Method-1.B/C
similar to what was suggested by Quang Hoang in the comments.
You can use any of the following two methods.
- Method-1: Use pandas methods alone ⭐⭐⭐
explode --> dropna --> value_counts
- Method-2: Use
list.extend --> pd.Series.value_counts
## Method-1# A. If you don't want counts for empty []
df.colors.explode().dropna().value_counts()
# B. If you want counts for empty [] (classified as NaN)
df.colors.explode().value_counts(dropna=False) # returns [] as Nan# C. If you want counts for empty [] (classified as [])
df.colors.explode().fillna('[]').value_counts() # returns [] as []## Method-2
colors = []
_ = [colors.extend(e) for e in df.colors iflen(e)>0]
pd.Series(colors).value_counts()
Output:
green 2
blue 2
brown 2
red 1
purple 1
# NaN 1 ## For Method-1.B# [] 1 ## For Method-1.C
dtype: int64
Dummy Data
import pandas as pd
df = pd.DataFrame({'colors':[['blue','green','brown'],
[],
['green','red','blue'],
['purple'],
['brown']]})
Solution 2:
Use a Counter
+ chain
, which is meant to do exactly this. Then construct the Series from the Counter object.
import pandas as pd
from collections import Counter
from itertools import chain
s = pd.Series([['blue','green','brown'], [], ['green','red','blue']])
pd.Series(Counter(chain.from_iterable(s)))
#blue 2#green 2#brown 1#red 1#dtype: int64
While explode
+ value_counts
are the pandas way to do things, they're slower for shorter lists.
import perfplot
import pandas as pd
import numpy as np
from collections import Counter
from itertools import chain
defcounter(s):
return pd.Series(Counter(chain.from_iterable(s)))
defexplode(s):
return s.explode().value_counts()
perfplot.show(
setup=lambda n: pd.Series([['blue','green','brown'], [], ['green','red','blue']]*n),
kernels=[
lambda s: counter(s),
lambda s: explode(s),
],
labels=['counter', 'explode'],
n_range=[2 ** k for k inrange(17)],
equality_check=np.allclose,
xlabel='~len(s)'
)
Solution 3:
You can use Counter
from the collections
module:
import pandas as pd
from collections import Counter
from itertools import chain
df = pd.DataFrame({'colors':[['blue','green','brown'],
[],
['green','red','blue'],
['purple'],
['brown']]})
df = pd.Series(Counter(chain(*df.colors)))
print (df)
Output:
blue 2
green 2
brown 2
red 1
purple 1
dtype: int64
Solution 4:
A quick and dirty solution would be something like this I imagine.
You'd still have to add a condition to get the empty list, though.
colors = df.colors.tolist()
d ={}for l in colors:forcin l:ifc not in d.keys():
d.update({c:1})else:
current_val = d.get(c)
d.update({c: current_val+1})
this produces a dictionary looking like this:
{'blue': 2, 'green': 2, 'brown': 2, 'red': 1, 'purple': 1}
Solution 5:
I would use .apply
with pd.Series
to accomplish this:
# 1. Expand columns and count them
df_temp = df["colors"].apply(pd.Series.value_counts)
blue brown green purple red
01.01.01.0NaNNaN1NaNNaNNaNNaNNaN21.0NaN1.0NaN1.03NaNNaNNaN1.0NaN4NaN1.0NaNNaNNaN# 2. Get the value counts from this:
df_temp.sum()
blue 2.0
brown 2.0
green 2.0
purple 1.0
red 1.0# Alternatively, convert to a dict
df_temp.sum().to_dict()# {'blue': 2.0, 'brown': 2.0, 'green': 2.0, 'purple': 1.0, 'red': 1.0}
Post a Comment for "How Do I Count The Values From A Pandas Column Which Is A List Of Strings?"