Skip to content Skip to sidebar Skip to footer

Combine Two Dataframes Based On Ranges Which May Partially Overlap Using Pandas And Track Multiple Values

I have two big dataframes (100K rows), One has 'values', one has 'types'. I want to assign a 'type' from df2 to a column in df1 based on depth. The depths are given as depth 'From

Solution 1:

From df2 create an auxiliary Series, marking each "starting point" of a unit (a range of length 1):

units = df2.set_index('Type').apply(lambda row: pd.Series(
    range(row.From, row.To)), axis=1).stack()\
    .reset_index(level=1, drop=True)

The result is:

Type
A    0.0
A    1.0
A    2.0
A    3.0
B    4.0
C    5.0
D    6.0
D    7.0
E    8.0
E    9.0
dtype: float64

Then define a function generating Type for the current row:

def getType(row):
    gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
    if gr.ngroups == 1:
        return gr.ngroup().index[0]
    txt = []
    for key, grp in gr:
        siz = grp.size
        un = 'unit' if siz == 1 else 'units'
        txt.append(f'{siz} {un} {key}')
    return ','.join(txt)

And to generate Type column, apply it to each row:

df1['Type'] = df1.apply(getType, axis=1)

The result is:

   From    To    val                Type
0   1.0   3.0  0.001                   A
1   3.0   5.0  0.005   1 unit A,1 unit B
2   5.0   7.0  0.002   1 unit C,1 unit D
3   7.0  10.0  0.001  1 unit D,2 units E

This result is a bit different from your expected result, but I think you created it in a bit inconsequent way.

I think that my solution is correct (at least more consequent), because:

  1. Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the result is just A (like in your post).
  2. Row 3.0 - 5.0 can be "divided" into:
    • 3.0 - 4.0 is within the limits of 0 4 A (1 unit),
    • 4.0 - 5.0 is within the limits of 4 5 B (also 1 unit, but you want 2 units here).
  3. Row 5.0 - 7.0 can be again "divided" into:
    • 5.0 - 6.0 is within the limits of 5 6 C (1 unit),
    • 6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
  4. Row 7.0 - 10.0 can be "divided" into:
    • 7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
    • 8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).

Post a Comment for "Combine Two Dataframes Based On Ranges Which May Partially Overlap Using Pandas And Track Multiple Values"