首頁猿問聚合數(shù)據(jù)幀上的列根據(jù)另一個數(shù)據(jù)幀對...

聚合數(shù)據(jù)幀上的列根據(jù)另一個數(shù)據(jù)幀對其進行分組而不合并它們

Python

阿波羅的戰(zhàn)車 2022-01-05 20:05:09

我有兩個數(shù)據(jù)幀 df1 和 df2：df1 有 column1、column2 并且有很多行（~1000 萬）df2 有 column2，還有很多其他列，而且很短（約 100 列和約 1000 行）我想要實現(xiàn)的是：df1.merge(df2, on=column2).groupby(column1).agg($SomeAggregatingFunction)但避免合并操作，因為它會占用大量內(nèi)存。有沒有辦法獲得這種行為？

查看完整描述

1 回答

肥皂起泡泡

TA貢獻1829條經(jīng)驗獲得超6個贊

除非內(nèi)存開銷成為瓶頸，否則我預計這種方法可能會更慢。不過，您是否嘗試過df2根據(jù)操作column2后返回的索引進行子集化？請參閱下面的示例，了解我的意思。groupbydf1

我想另一種選擇是考慮一個 map-reduce 框架（例如，pyspark）？

# two toy datasets

df1 = pd.DataFrame({i:np.random.choice(np.arange(10), size=20) for i in range(2)}).rename(columns={0:'col1',1:'col2'})

df2 = pd.DataFrame({i:np.random.choice(np.arange(10), size=5) for i in range(2)}).rename(columns={0:'colOther',1:'col2'})

# make sure we don't use values of col2 that df2 doesn't contain

df1 = df1[df1['col2'].isin(df2['col2'])]

# for faster indexing and use of .loc

df2_col2_idx = df2.set_index('col2')

# iterate over the groups rather than merge

for i,group in df1.groupby('col1'):

subset = df2_col2_idx.loc[group.col2,:]

# some function on the subset here

# note 'i' is the col1 index

print(i,subset.colOther.mean())

更新：將@max 對apply功能的評論建議包含在組中：

df1.groupby(column1).apply(lambda x: df2_col2_idx.loc[x[columns2],other_columns].agg($SomeAggregatingFunction))

反對回復 2022-01-05

1 回答
0 關(guān)注
162 瀏覽

關(guān)注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學習伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

聚合數(shù)據(jù)幀上的列根據(jù)另一個數(shù)據(jù)幀對其進行分組而不合并它們

聚合數(shù)據(jù)幀上的列根據(jù)另一個數(shù)據(jù)幀對其進行分組而不合并它們

1 回答

添加回答