Pandas: Drop Quasi-duplicates By Column Values
I have a list that, let's say, looks like this (which I'm putting into a DF): [ ['john', '1', '1', '2016'], ['john', '1', '10', '2016'], ['sally', '3', '5', '2016'], ['sally', '4',
Solution 1:
You can sort the data frame by year, month, day
and then take the first row from each name
:
df.sort_values(by = ['year', 'month', 'day']).groupby('name').first()
# month day year# name # john 1 1 2016#sally 3 5 2016
Data:
df = pd.DataFrame([['john', '1', '1', '2016'],
['john', '1', '10', '2016'],
['sally', '3', '5', '2016'],
['sally', '4', '1', '2016']],
columns = ['name', 'month', 'day', 'year'])
Solution 2:
Option 1
use pd.to_datetime
to parse ['year', 'month', 'day'] columns.
groupby('name')
then take first
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df.sort_values(['name', 'date']).groupby('name').first()
Option 2
Same pd.to_datetime
usage.
groupby('name')
take idxmin
to locate smallest date.
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df.ix[df.groupby('name').date.idxmin()]
Post a Comment for "Pandas: Drop Quasi-duplicates By Column Values"