Remove Cancelling Rows From Pandas Dataframe
I have a list of invoices sent out to customers. However, sometimes a bad invoice is sent, which is later cancelled. My Pandas Dataframe looks something like this, except much larg
Solution 1:
def remove_cancelled_transactions(df):
trans_neg = df.amount < 0
return df.loc[~(trans_neg | trans_neg.shift(-1))]
groups = [df.customer, df.invoice_nr, df.date, df.amount.abs()]
df.groupby(groups, as_index=False, group_keys=False) \
.apply(remove_cancelled_transactions)
Solution 2:
You can use filter
all values, where each group has values where sum is 0
and modulo by 2
is 0
:
print (df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
.filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)))
customer invoice_nr amount date
index
0 1 1 10 01-01-2016
1 1 1 -10 01-01-2016
5 2 4 12 02-01-2016
6 2 4 -12 02-01-2016
idx = df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
.filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)).index
print (idx)
Int64Index([0, 1, 5, 6], dtype='int64', name='index')
print (df.drop(idx))
customer invoice_nr amount date
index
2 1 1 11 01-01-2016
3 1 2 10 02-01-2016
4 2 3 7 01-01-2016
7 2 4 8 02-01-2016
8 2 4 4 02-01-2016
EDIT by comment:
If in real data are not duplicates for one invoice and one customer and one date, so you can use this way:
print (df)
index customer invoice_nr amount date
0 0 1 1 10 01-01-2016
1 1 1 1 -10 01-01-2016
2 2 1 1 11 01-01-2016
3 3 1 2 10 02-01-2016
4 4 2 3 7 01-01-2016
5 5 2 4 12 02-01-2016
6 6 2 4 -12 02-01-2016
7 7 2 4 8 02-01-2016
8 8 2 4 4 02-01-2016
df['amount_abs'] = df.amount.abs()
df.drop_duplicates(['customer','invoice_nr', 'date', 'amount_abs'], keep=False, inplace=True)
df.drop('amount_abs', axis=1, inplace=True)
print (df)
index customer invoice_nr amount date
2 2 1 1 11 01-01-2016
3 3 1 2 10 02-01-2016
4 4 2 3 7 01-01-2016
7 7 2 4 8 02-01-2016
8 8 2 4 4 02-01-2016
Solution 3:
What if you just do a groupby on all 3 fields? The resulting sums would net out any canceled invoices:
df2 = df.groupby(['customer','invoice_nr','date']).sum()
results in
customer invoice_nr date
1 1 2016/01/01 11
2 2016/02/01 10
2 3 2016/01/01 7
Post a Comment for "Remove Cancelling Rows From Pandas Dataframe"