首頁猿問 Pandas...

Pandas DataFrames：有效地查找一列中另一列具有更大值的下一個值

Python

交互式愛情 2023-09-05 20:22:11

標(biāo)題描述了我的情況。我已經(jīng)有了這個的工作版本，但是當(dāng)擴(kuò)展到大型 DataFrame（>1M 行）時，它的效率非常低。我想知道是否有人有更好的主意這樣做。包含解決方案和代碼的示例創(chuàng)建一個新列next_time，該列具有下一個時間值，其中該price列大于當(dāng)前行。import pandas as pddf = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})print(df) time price0 15 10.001 30 10.012 45 10.003 60 10.014 75 10.025 90 9.99series_to_concat = []for price in df['price'].unique(): index_equal_to_price = df[df['price'] == price].index series_time_greater_than_price = df[df['price'] > price]['time'] time_greater_than_price_backfilled = series_time_greater_than_price.reindex(index_equal_to_price.union(series_time_greater_than_price.index)).fillna(method='backfill') series_to_concat.append(time_greater_than_price_backfilled.reindex(index_equal_to_price))df['next_time'] = pd.concat(series_to_concat, sort=False)print(df) time price next_time0 15 10.00 30.01 30 10.01 75.02 45 10.00 60.03 60 10.01 75.04 75 10.02 NaN5 90 9.99 NaN這讓我得到了想要的結(jié)果。當(dāng)擴(kuò)展到一些大型數(shù)據(jù)幀時，計(jì)算可能需要幾分鐘。有誰對如何解決這個問題有更好的想法？編輯：約束的澄清我們可以假設(shè)數(shù)據(jù)幀按時間排序。另一種表達(dá)方式是，給定任何行n (Time_ n , Price_ n ), 0 <= n <= len(df) - 1，找到x使得 Time_ x > Time_ n AND Price_ x > Price_ n AND 存在不存在y使得n < y < x且 Price_ y > Price_ n。

查看完整描述

3 回答

慕斯王

TA貢獻(xiàn)1864條經(jīng)驗(yàn) 獲得超2個贊

大衛(wèi)確實(shí)想出了一個很好的解決方案，可以在以后找到最接近的更高價格。然而，我確實(shí)想在稍后的時間找到下一個更高的價格。我們與我的同事一起找到了這個解決方案。

包含元組的堆棧（索引、價格）

迭代所有行（索引 i）
當(dāng)堆棧非空并且堆棧頂部的價格較低時，則彈出并用 times[index] 填充彈出的索引
將 (i,prices[i]) 壓入堆棧

import numpy as np

import pandas as pd

df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})

print(df)

time price

0 15 10.00

1 30 10.01

2 45 10.00

3 60 10.01

4 75 10.02

5 90 9.99

times = df['time'].to_numpy()

prices = df['price'].to_numpy()

stack = []

next_times = np.full(len(df), np.nan)

for i in range(len(df)):

while stack and prices[i] > stack[-1][1]:

stack_time_index, stack_price = stack.pop()

next_times[stack_time_index] = times[i]

stack.append((i, prices[i]))

df['next_time'] = next_times

print(df)

time price next_time

0 15 10.00 30.0

1 30 10.01 75.0

2 45 10.00 60.0

3 60 10.01 75.0

4 75 10.02 NaN

5 90 9.99 NaN

該解決方案實(shí)際上執(zhí)行速度非常快。我不完全確定，但我相信復(fù)雜性將接近O(n)，因?yàn)樗菍φ麄€數(shù)據(jù)幀的一次完整傳遞。其表現(xiàn)如此良好的原因是堆棧本質(zhì)上是排序的，其中最大的價格位于底部，最小的價格位于堆棧的頂部。

這是我對實(shí)際數(shù)據(jù)框的測試

print(f'{len(df):,.0f} rows with {len(df["price"].unique()):,.0f} unique prices ranging from ${df["price"].min():,.2f} to ${df["price"].max():,.2f}')

667,037 rows with 11,786 unique prices ranging from $1,857.52 to $2,022.00

def find_next_time_with_greater_price(df):

times = df['time'].to_numpy()

prices = df['price'].to_numpy()

stack = []

next_times = np.full(len(df), np.nan)

for i in range(len(df)):

while stack and prices[i] > stack[-1][1]:

stack_time_index, stack_price = stack.pop()

next_times[stack_time_index] = times[i]

stack.append((i, prices[i]))

return next_times

%timeit -n10 -r10 df['next_time'] = find_next_time_with_greater_price(df)

434 ms ± 11.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

反對回復(fù) 2023-09-05

哆啦的時光機(jī)

TA貢獻(xiàn)1779條經(jīng)驗(yàn) 獲得超6個贊

這個在不到 7 秒的時間內(nèi)為我返回了包含 1,000,000 行和 162,000 個唯一價格的數(shù)據(jù)框變體。因此，我認(rèn)為既然你在 660,000 行和 12,000 個唯一價格上運(yùn)行它，速度的提高將是 100x-1000x。

您的問題更加復(fù)雜，因?yàn)樽罱咏妮^高價格必須在稍后的時間出現(xiàn)。我必須從幾個不同的角度來解決這個問題（正如您在關(guān)于我的評論中提到的那樣，np.where()將其分解為幾種不同的方法）。

import pandas as pd

df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})

def bisect_right(a, x, lo=0, hi=None):

? ? if lo < 0:

? ? ? ? raise ValueError('lo must be non-negative')

? ? if hi is None:

? ? ? ? hi = len(a)

? ? while lo < hi:

? ? ? ? mid = (lo+hi)//2

? ? ? ? if x < a[mid]: hi = mid

? ? ? ? else: lo = mid+1

? ? return lo

def get_closest_higher(df, col, val):

? ? higher_idx = bisect_right(df[col].values, val)

? ? return higher_idx

df = df.sort_values(['price', 'time']).reset_index(drop=True)

df['next_time'] = df['price'].apply(lambda x: get_closest_higher(df, 'price', x))

df['next_time'] = df['next_time'].map(df['time'])

df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'] )

df = df.sort_values('time').reset_index(drop=True)

df['next_time'] = np.where((df['price'].shift(-1) > df['price'])

? ? ? ? ? ? ? ? ? ? ? ? ? ?,df['time'].shift(-1),

? ? ? ? ? ? ? ? ? ? ? ? ? ?df['next_time'])

df['next_time'] = df['next_time'].ffill()

df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'])

df

Out[1]:?

? ?time? price? next_time

0? ? 15? 10.00? ? ? ?30.0

1? ? 30? 10.01? ? ? ?75.0

2? ? 45? 10.00? ? ? ?60.0

3? ? 60? 10.01? ? ? ?75.0

4? ? 75? 10.02? ? ? ? NaN

5? ? 90? ?9.99? ? ? ? NaN

反對回復(fù) 2023-09-05

喵喔喔

TA貢獻(xiàn)1735條經(jīng)驗(yàn) 獲得超5個贊

%timeit當(dāng)我在此示例上進(jìn)行測試時，這些解決方案速度更快，但我在更大的數(shù)據(jù)幀上進(jìn)行了測試，它們比您的解決方案慢得多。看看這 3 個解決方案中的任何一個在較大的數(shù)據(jù)框中是否更快，這將是很有趣的。

我希望其他人能夠發(fā)布更有效的解決方案。以下是一些不同的答案：

您可以使用單行代碼來實(shí)現(xiàn)這一點(diǎn)，該單行代碼同時next循環(huán)遍歷time和列。該函數(shù)的工作方式與列表理解完全相同，但您需要使用圓括號而不是方括號，并且它僅返回第一個值。您還需要將處理錯誤作為函數(shù)中的參數(shù)傳遞。pricezipnextTrueNonenext
您需要通過axis=1，因?yàn)槟诎戳羞M(jìn)行比較。

這應(yīng)該會提高性能，因?yàn)楫?dāng)?shù)诜祷氐谝粋€值并移動到下一行后停止時，您不會循環(huán)遍歷整個列。

import pandas as pd

df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})

print(df)

? ?time? price

0? ? 15? 10.00

1? ? 30? 10.01

2? ? 45? 10.00

3? ? 60? 10.01

4? ? 75? 10.02

5? ? 90? ?9.99

df['next_time'] = (df.apply(lambda x: next((z for (y, z) in zip(df['price'], df['time'])

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if y > x['price'] if z > x['time']), None), axis=1))

df

Out[1]:?

? ?time? price? next_time

0? ? 15? 10.00? ? ? ?30.0

1? ? 30? 10.01? ? ? ?75.0

2? ? 45? 10.00? ? ? ?60.0

3? ? 60? 10.01? ? ? ?75.0

4? ? 75? 10.02? ? ? ? NaN

5? ? 90? ?9.99? ? ? ? NaN

正如您所看到的，列表理解會返回相同的結(jié)果，但理論上會慢很多......因?yàn)榈倲?shù)會顯著增加，尤其是對于大型數(shù)據(jù)幀。

df['next_time'] = (df.apply(lambda x: [z for (y, z) in zip(df['price'], df['time'])

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if y > x['price'] if z > x['time']], axis=1)).str[0]

df

Out[2]:?

? ?time? price? next_time

0? ? 15? 10.00? ? ? ?30.0

1? ? 30? 10.01? ? ? ?75.0

2? ? 45? 10.00? ? ? ?60.0

3? ? 60? 10.01? ? ? ?75.0

4? ? 75? 10.02? ? ? ? NaN

5? ? 90? ?9.99? ? ? ? NaN

使用 some 和 np.where() 創(chuàng)建函數(shù)的另一個選項(xiàng)numpy：

def closest(x):

? ? try:

? ? ? ? lst = df.groupby(df['price'].cummax())['time'].transform('first')

? ? ? ? lst = np.asarray(lst)

? ? ? ? lst = lst[lst>x]?

? ? ? ? idx = (np.abs(lst - x)).argmin()?

? ? ? ? return lst[idx]

? ? except ValueError:

? ? ? ? pass

df['next_time'] = np.where((df['price'].shift(-1) > df['price']),

? ? ? ? ? ? ? ? ? ? ? ? ? ? df['time'].shift(-1),

? ? ? ? ? ? ? ? ? ? ? ? ? ? df['time'].apply(lambda x: closest(x)))

反對回復(fù) 2023-09-05

3 回答
0 關(guān)注
153 瀏覽

關(guān)注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

Pandas DataFrames：有效地查找一列中另一列具有更大值的下一個值

Pandas DataFrames：有效地查找一列中另一列具有更大值的下一個值

3 回答

添加回答