Combine Two Dataframes Based On Ranges Which May Partially Overlap Using Pandas And Track Multiple Values
I have two big dataframes (100K rows), One has 'values', one has 'types'. I want to assign a 'type' from df2 to a column in df1 based on depth. The depths are given as depth 'From
Solution 1:
From df2 create an auxiliary Series, marking each "starting point" of a unit (a range of length 1):
units = df2.set_index('Type').apply(lambda row: pd.Series(
range(row.From, row.To)), axis=1).stack()\
.reset_index(level=1, drop=True)
The result is:
Type
A 0.0
A 1.0
A 2.0
A 3.0
B 4.0
C 5.0
D 6.0
D 7.0
E 8.0
E 9.0
dtype: float64
Then define a function generating Type for the current row:
def getType(row):
gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
if gr.ngroups == 1:
return gr.ngroup().index[0]
txt = []
for key, grp in gr:
siz = grp.size
un = 'unit' if siz == 1 else 'units'
txt.append(f'{siz} {un} {key}')
return ','.join(txt)
And to generate Type column, apply it to each row:
df1['Type'] = df1.apply(getType, axis=1)
The result is:
From To val Type
0 1.0 3.0 0.001 A
1 3.0 5.0 0.005 1 unit A,1 unit B
2 5.0 7.0 0.002 1 unit C,1 unit D
3 7.0 10.0 0.001 1 unit D,2 units E
This result is a bit different from your expected result, but I think you created it in a bit inconsequent way.
I think that my solution is correct (at least more consequent), because:
- Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the result is just A (like in your post).
- Row 3.0 - 5.0 can be "divided" into:
- 3.0 - 4.0 is within the limits of 0 4 A (1 unit),
- 4.0 - 5.0 is within the limits of 4 5 B (also 1 unit, but you want 2 units here).
- Row 5.0 - 7.0 can be again "divided" into:
- 5.0 - 6.0 is within the limits of 5 6 C (1 unit),
- 6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
- Row 7.0 - 10.0 can be "divided" into:
- 7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
- 8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).
Post a Comment for "Combine Two Dataframes Based On Ranges Which May Partially Overlap Using Pandas And Track Multiple Values"