Lengthening A Dataframe Based On Stacking Columns Within It In Pandas
I am looking for a function that achieves the following. It is best shown in an example. Consider: pd.DataFrame([ [1, 2, 3 ], [4, 5, np.nan ]], columns=['x', 'y1', 'y2']) which lo
Solution 1:
You can use stack
to get things done i.e
pd.DataFrame(df.set_index('x').stack().reset_index(level=0).values,columns=['x','y'])
x y
01.02.011.03.024.05.0
Solution 2:
Repeat all the items in first column based on counts of not null values in each row. Then simply create your final dataframe using the rest of not null values in other columns. You can use DataFrame.count()
method to count not null values and numpy.repeat()
to repeat an array based on a respective count array.
>>>rest = df.loc[:,'y1':]>>>pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
'y': rest.values[rest.notna()]})
Demo:
>>>df
x y1 y2 y3 y4
0 1 2.0 3.0 NaN 6.0
1 4 5.0 NaN 9.0 3.0
2 10 NaN NaN NaN NaN
3 9 NaN NaN 6.0 NaN
4 7 6.0 NaN NaN NaN
>>>rest = df.loc[:,'y1':]>>>pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
'y': rest.values[rest.notna()]})
x y
0 1 2.0
1 1 3.0
2 1 6.0
3 4 5.0
4 4 9.0
5 4 3.0
6 9 6.0
7 7 6.0
Solution 3:
Here's one based on NumPy, as you were looking for performance -
defgather_columns(df):
col_mask = [i.startswith('y') for i in df.columns]
ally_vals = df.iloc[:,col_mask].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})
Sample run -
In [78]: df #(added more cols for variety)
Out[78]:
x y1 y2 y5 y7
0123.0NaNNaN145NaN6.07.0
In [79]: gather_columns(df)
Out[79]:
x y
012.0113.0245.0346.0447.0
If the y
columns are always starting from the second column onwards until the end, we can simply slice the dataframe and hence get further performance boost, like so -
defgather_columns_v2(df):
ally_vals = df.iloc[:,1:].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})
Post a Comment for "Lengthening A Dataframe Based On Stacking Columns Within It In Pandas"