Interpolating Values From A Dataframe Based On A Column Value
Solution 1:
One good solution for making this faster is pandas.DataFrame.eval()
:
TL;DR
Seconds per number of rows
Rows: 100 1000 10000 1E5 1E6 1E7
apply: 0.076 0.734 7.812
eval: 0.056 0.053 0.058 0.087 0.338 2.887
As can be seen from these timings, eval()
has a lot of setup overhead, and up to 10,000 rows basically takes the same time. But it is two orders of magnitude faster than the apply, and thus it certainly worth the overhead for large data sets.
What is it?
From the (DOCS)
pandas.eval(expr, parser='pandas', engine=None, truediv=True,
local_dict=None, global_dict=None, resolvers=(),
level=0, target=None, inplace=None)
Evaluate a Python expression as a string using various backends.
The following arithmetic operations are supported: +, -, *, /, ** , %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.
Tricks performed for this Question:
The code below exploits the fact that the interpolation is always only in two segments. It actually calculates the interpolant for both segments, and then discards the unused segment via a multiply by a bool test (ie, 0, 1)
The actual expression passed to eval is:
((y2-y1) / 0.5 * (x0-0.0) + y1) * (x0 < 0.5)+((y3-y2) / 0.5 * (x0-0.5) + y2) * (x0 >= 0.5)
Code:
import pandas as pd
import numpy as np
xp = [0.0, 0.5, 1.0]
np.random.seed(100)
def method1():
df['interp'] = df.apply(
lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)
def method2():
exp = '((y%d-y%d) / %s * (x0-%s) + y%d) * (x0 %s 0.5)'
exp_1 = exp % (2, 1, xp[1] - xp[0], xp[0], 1, '<')
exp_2 = exp % (3, 2, xp[2] - xp[1], xp[1], 2, '>=')
df['interp2'] = df.eval(exp_1 + '+' + exp_2)
from timeit import timeit
def runit(stmt):
print("%s: %.3f" % (
stmt, timeit(stmt + '()', number=10,
setup='from __main__ import ' + stmt)))
def runit_size(size):
global df
df = pd.DataFrame(
np.random.rand(size, 4), columns=['x0', 'y1', 'y2', 'y3'])
print('Rows: %d' % size)
if size <= 10000:
runit('method1')
runit('method2')
for i in (100, 1000, 10000, 100000, 1000000, 10000000):
runit_size(i)
print(df.head())
Results:
x0 y1 y2 y3 interp interp2
0 0.060670 0.949837 0.608659 0.672003 0.908439 0.908439
1 0.462774 0.704273 0.181067 0.647582 0.220021 0.220021
2 0.568109 0.954138 0.796690 0.585310 0.767897 0.767897
3 0.455355 0.738452 0.812236 0.927291 0.805648 0.805648
4 0.826376 0.029957 0.772803 0.521777 0.608946 0.608946
Post a Comment for "Interpolating Values From A Dataframe Based On A Column Value"