2 回答

TA貢獻1895條經(jīng)驗 獲得超7個贊
如果你真的需要一個快速的解決方案,你應(yīng)該使用numba。numba的替代方案是cython。兩者都編譯你的 python 代碼c以使其更快,但我認為numba更簡單,它們或多或少具有相同的性能。
將代碼編譯到c/Fortran是numpy / pandas內(nèi)部函數(shù)如此快速的原因。更多信息請參閱pandas 文檔。
讓我們首先創(chuàng)建示例:
import numpy as np
import pandas as pd
from numba import njit
df = pd.DataFrame({
'C1': [2, 10, 8, 30],
'C2': [5, 12, 3, 25],
'C3': [5, 2, 17, 3]
})
coeff = pd.Series([3, 3, 5, 7])
然后通過轉(zhuǎn)換為numba我們得到答案的代碼:
@njit
def event_v(data, coeff, rate_low=2, rate_high=2):
out = -np.ones(len(data), dtype=np.int8)
for k in range(len(data) - 1):
next_val = data[k + 1, 0]
c = coeff[k]
low_bound = next_val - rate_low * c
high_bound = next_val + rate_high * c
for j in range(k + 1, len(data)):
if data[j, 1] < low_bound:
out[k] = 0
break
if data[j, 2] >= high_bound:
out[k] = 1
break
return out
df["C4"] = event_v(df.values, coeff.values)
測試 10 000 行:
n = 10_000
df = pd.DataFrame(np.random.randint(30, size=[n, 3]), columns=["C1", "C2", "C3"])
coeff = pd.Series(np.random.randint(10, size=n))
%timeit event_v(df.values, coeff.values)
3.39 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit event(df, coeff) # Code from the question
28.4 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
它快了大約8500 倍
測試 1 000 000 行:
n = 1_000_000
df = pd.DataFrame(np.random.randint(30, size=[n, 3]), columns=["C1", "C2", "C3"])
coeff = pd.Series(np.random.randint(10, size=n))
%timeit event_v(df.values, coeff.values)
27.6 s ± 1.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
我嘗試使用問題的代碼運行它,但超過 2 小時%timeit后沒有完成。

TA貢獻1780條經(jīng)驗 獲得超4個贊
next_value、low_bound 和 high_bound 可以很容易地向量化,并且它們的計算速度非???。第二部分不容易矢量化,因為它可能需要掃描整個數(shù)組的每一行。通過在 numpy 數(shù)組中進行比較,可以獲得對您的實現(xiàn)的輕微改進(對于較大的 n 變得更加重要)。
def get_c4(low_bound, high_bound, c2, c3):
for idx in range(len(c2)):
if c2[idx] < low_bound:
return 0
if c3[idx] >= high_bound:
return 1
return -1
def event_new(data: pd.DataFrame, coeff, rate_low=2, rate_high=2):
data['next_val'] = data['C1'].shift(periods=-1).ffill().astype('int')
data['low_bound'] = (data['next_val'] - rate_low * coeff).astype('int')
data['high_bound'] = (data['next_val'] + rate_high * coeff).astype('int')
c2 = data['C2'].to_numpy()
c3 = data['C3'].to_numpy()
data['C4'] = data.apply(lambda x: get_c4(x.low_bound, x.high_bound, c2[data.index.get_loc(x) + 1:], c3[data.index.get_loc(x) + 1:]), axis=1)
data.drop(columns=['next_val', 'low_bound', 'high_bound'])
return data
基準代碼:
for n in [1e2, 1e3, 1e4, 1e5, 1e6]:
n = int(n)
df = pd.DataFrame({'C1': random_list(n=n), 'C2': random_list(n=n), 'C3': random_list(n=n)})
coeff = pd.Series(random_list(start=2, stop=7, n=n))
print(f"n={n}:")
print(f"Time org: {timeit.timeit(lambda: event(df.copy(), coeff), number=1):.3f} seconds")
print(f"Time new: {timeit.timeit(lambda: event_new(df.copy(), coeff), number=1):.3f} seconds")
輸出:
n=100:
Time org: 0.007 seconds
Time new: 0.012 seconds
n=1000:
Time org: 0.070 seconds
Time new: 0.048 seconds
n=10000:
Time org: 0.854 seconds
Time new: 0.493 seconds
n=100000:
Time org: 7.565 seconds
Time new: 4.456 seconds
n=1000000:
Time org: 216.408 seconds
Time new: 45.199 seconds
添加回答
舉報