1 回答

TA貢獻1829條經(jīng)驗 獲得超6個贊
除非內(nèi)存開銷成為瓶頸,否則我預計這種方法可能會更慢。不過,您是否嘗試過df2根據(jù)操作column2后返回的索引進行子集化?請參閱下面的示例,了解我的意思。groupbydf1
我想另一種選擇是考慮一個 map-reduce 框架(例如,pyspark)?
# two toy datasets
df1 = pd.DataFrame({i:np.random.choice(np.arange(10), size=20) for i in range(2)}).rename(columns={0:'col1',1:'col2'})
df2 = pd.DataFrame({i:np.random.choice(np.arange(10), size=5) for i in range(2)}).rename(columns={0:'colOther',1:'col2'})
# make sure we don't use values of col2 that df2 doesn't contain
df1 = df1[df1['col2'].isin(df2['col2'])]
# for faster indexing and use of .loc
df2_col2_idx = df2.set_index('col2')
# iterate over the groups rather than merge
for i,group in df1.groupby('col1'):
subset = df2_col2_idx.loc[group.col2,:]
# some function on the subset here
# note 'i' is the col1 index
print(i,subset.colOther.mean())
更新:將@max 對apply功能的評論建議包含在組中:
df1.groupby(column1).apply(lambda x: df2_col2_idx.loc[x[columns2],other_columns].agg($SomeAggregatingFunction))
添加回答
舉報