1 回答

TA貢獻(xiàn)1155條經(jīng)驗(yàn) 獲得超0個(gè)贊
Groupby 不使用矢量化,但它具有使用 Cython 優(yōu)化的聚合函數(shù)。
你可以取平均值:
(df["Spend"] > LIMIT).groupby(df["Name"]).mean()
df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
或者用div0 替換 NaN:
df[df["Spend"] > LIMIT].groupby("Name").size() \
.div(df.groupby("Name").size(), fill_value = 0)
df["Spend"].gt(LIMIT).groupby(df["Name"]).sum() \
.div(df.groupby("Name").size(), fill_value = 0)
以上每個(gè)都會(huì)產(chǎn)生
Name
Alice 0.0
Bob 0.5
Charles 1.0
dtype: float64
表現(xiàn)
取決于每個(gè)條件過(guò)濾的行數(shù)和行數(shù),因此最好在真實(shí)數(shù)據(jù)上進(jìn)行測(cè)試。
np.random.seed(123)
N = 100000
df = pd.DataFrame({
"Name": np.random.randint(1000, size = N),
"Spend": np.random.randint(10, size = N)
})
LIMIT = 6
In [10]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
6.16 ms ± 332 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df[df["Spend"] > LIMIT].groupby("Name").size().div(df.groupby("Name").size(), fill_value = 0)
6.35 ms ± 95.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).sum().div(df.groupby("Name").size(), fill_value = 0)
9.66 ms ± 365 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# RafaelC comment solution
In [13]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).sum() / s.size)
400 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).mean())
328 ms ± 6.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
這個(gè) NumPy 解決方案是矢量化的,但有點(diǎn)復(fù)雜:
In [15]: %%timeit
...: i, r = pd.factorize(df["Name"])
...: a = pd.Series(np.bincount(i), index = r)
...:
...: i1, r1 = pd.factorize(df["Name"].values[df["Spend"].values > LIMIT])
...: b = pd.Series(np.bincount(i1), index = r1)
...:
...: df1 = b.div(a, fill_value = 0)
...:
5.05 ms ± 82.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
添加回答
舉報(bào)