Remove Cancelling Rows From Pandas Dataframe

January 31, 2023 Post a Comment

I have a list of invoices sent out to customers. However, sometimes a bad invoice is sent, which is later cancelled. My Pandas Dataframe looks something like this, except much larg

Solution 1:

def remove_cancelled_transactions(df):
    trans_neg = df.amount < 0
    return df.loc[~(trans_neg | trans_neg.shift(-1))]

groups = [df.customer, df.invoice_nr, df.date, df.amount.abs()]
df.groupby(groups, as_index=False, group_keys=False) \
  .apply(remove_cancelled_transactions)

Solution 2:

You can use filter all values, where each group has values where sum is 0 and modulo by 2 is 0:

print (df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
        .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)))

       customer  invoice_nr  amount        date
index                                          
0             1           1      10  01-01-2016
1             1           1     -10  01-01-2016
5             2           4      12  02-01-2016
6             2           4     -12  02-01-2016

idx = df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
        .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)).index

print (idx)      
Int64Index([0, 1, 5, 6], dtype='int64', name='index')

print (df.drop(idx))  
       customer  invoice_nr  amount        date
index                                          
2             1           1      11  01-01-2016
3             1           2      10  02-01-2016
4             2           3       7  01-01-2016
7             2           4       8  02-01-2016
8             2           4       4  02-01-2016

EDIT by comment:

If in real data are not duplicates for one invoice and one customer and one date, so you can use this way:

 print (df)
   index  customer  invoice_nr  amount        date
0      0         1           1      10  01-01-2016
1      1         1           1     -10  01-01-2016
2      2         1           1      11  01-01-2016
3      3         1           2      10  02-01-2016
4      4         2           3       7  01-01-2016
5      5         2           4      12  02-01-2016
6      6         2           4     -12  02-01-2016
7      7         2           4       8  02-01-2016
8      8         2           4       4  02-01-2016

df['amount_abs'] = df.amount.abs()
df.drop_duplicates(['customer','invoice_nr', 'date', 'amount_abs'], keep=False, inplace=True)
df.drop('amount_abs', axis=1, inplace=True)
print (df)
   index  customer  invoice_nr  amount        date
2      2         1           1      11  01-01-2016
3      3         1           2      10  02-01-2016
4      4         2           3       7  01-01-2016
7      7         2           4       8  02-01-2016
8      8         2           4       4  02-01-2016

Solution 3:

What if you just do a groupby on all 3 fields? The resulting sums would net out any canceled invoices:

df2 = df.groupby(['customer','invoice_nr','date']).sum()

results in

customer invoice_nr date
1        1          2016/01/01      11
         2          2016/02/01      10
2        3          2016/01/01       7

Python stackoverflow Examples

Remove Cancelling Rows From Pandas Dataframe

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Remove Cancelling Rows From Pandas Dataframe"