1 回答

TA貢獻(xiàn)1735條經(jīng)驗 獲得超5個贊
使用一些技巧
用于
pd.factorize()
將分類數(shù)據(jù)轉(zhuǎn)換為每個類別的值計算代表組/子組對的值/因子f
隨機(jī)化一點
np.random.uniform()
,最小值和最大值接近 1一旦有一個代表分組的值,就可以
sort_values()
并且reset_index()
有一個干凈的有序索引最終通過整數(shù)余數(shù)進(jìn)行分組
group = list("ABCD")
subgroup = list("abcdef")
df = pd.DataFrame([{"group":group[random.randint(0,len(group)-1)],
"subgroup":subgroup[random.randint(0,len(subgroup)-1)],
"value":random.randint(1,3)} for i in range(300)])
bins=6
dfc = df.assign(
# take into account concentration of group and subgroup
# randomise a bit....
f = ((pd.factorize(df["group"])[0] +1)*10 +
(pd.factorize(df["subgroup"])[0] +1)
*np.random.uniform(0.99,1.01,len(df))
),
).sort_values("f").reset_index(drop=True).assign(
gc=lambda dfa: dfa.index%(bins)
).drop(columns="f")
# check distribution ... used plot for SO
dfc.groupby(["gc","group","subgroup"]).count().unstack(0).plot(kind="barh")
# every group same size...
# dfc.groupby("gc").count()
# now it's easy to get each of the cuts.... 0 through 5
# dfcut0 = dfc.query("gc==0").drop(columns="gc").copy().reset_index(drop=True)
# dfcut0
添加回答
舉報