Skip to content Skip to sidebar Skip to footer

Interpolating Values From A Dataframe Based On A Column Value

Assuming I have a the following problem: import pandas as pd import numpy as np xp = [0.0, 0.5, 1.0] np.random.seed(100) df = pd.DataFrame(np.random.rand(10, 4), columns=['x0', '

Solution 1:

One good solution for making this faster is pandas.DataFrame.eval():

TL;DR

Seconds per number of rows
Rows:     100   1000  10000    1E5    1E6    1E7
apply:  0.076  0.734  7.812
eval:   0.056  0.053  0.058  0.087  0.338  2.887

As can be seen from these timings, eval() has a lot of setup overhead, and up to 10,000 rows basically takes the same time. But it is two orders of magnitude faster than the apply, and thus it certainly worth the overhead for large data sets.

What is it?

From the (DOCS)

pandas.eval(expr, parser='pandas', engine=None, truediv=True, 
            local_dict=None, global_dict=None, resolvers=(),
            level=0, target=None, inplace=None)

Evaluate a Python expression as a string using various backends.

The following arithmetic operations are supported: +, -, *, /, ** , %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.

Tricks performed for this Question:

The code below exploits the fact that the interpolation is always only in two segments. It actually calculates the interpolant for both segments, and then discards the unused segment via a multiply by a bool test (ie, 0, 1)

The actual expression passed to eval is:

((y2-y1) / 0.5 * (x0-0.0) + y1) * (x0 < 0.5)+((y3-y2) / 0.5 * (x0-0.5) + y2) * (x0 >= 0.5)

Code:

import pandas as pd
import numpy as np

xp = [0.0, 0.5, 1.0]

np.random.seed(100)

def method1():
    df['interp'] = df.apply(
        lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)

def method2():
    exp = '((y%d-y%d) / %s * (x0-%s) + y%d) * (x0 %s 0.5)'
    exp_1 = exp % (2, 1, xp[1] - xp[0], xp[0], 1, '<')
    exp_2 = exp % (3, 2, xp[2] - xp[1], xp[1], 2, '>=')

    df['interp2'] = df.eval(exp_1 + '+' + exp_2)

from timeit import timeit

def runit(stmt):
    print("%s: %.3f" % (
        stmt, timeit(stmt + '()', number=10,
                     setup='from __main__ import ' + stmt)))

def runit_size(size):
    global df
    df = pd.DataFrame(
        np.random.rand(size, 4), columns=['x0', 'y1', 'y2', 'y3'])

    print('Rows: %d' % size)
    if size <= 10000:
        runit('method1')
    runit('method2')

for i in (100, 1000, 10000, 100000, 1000000, 10000000):
    runit_size(i)

print(df.head())

Results:

         x0        y1        y2        y3    interp   interp2
0  0.060670  0.949837  0.608659  0.672003  0.908439  0.908439
1  0.462774  0.704273  0.181067  0.647582  0.220021  0.220021
2  0.568109  0.954138  0.796690  0.585310  0.767897  0.767897
3  0.455355  0.738452  0.812236  0.927291  0.805648  0.805648
4  0.826376  0.029957  0.772803  0.521777  0.608946  0.608946

Post a Comment for "Interpolating Values From A Dataframe Based On A Column Value"