1 回答

TA貢獻(xiàn)1951條經(jīng)驗(yàn) 獲得超3個(gè)贊
編輯:執(zhí)行(類似)您要求的操作的一種方法可能是按類別存儲(chǔ) ID,然后對(duì)于每個(gè)類別,獲取 70% 的 ID,并將具有這些 ID 的樣本插入到訓(xùn)練中,其余的插入到測試集中。
請(qǐng)注意,如果每個(gè) ID 出現(xiàn)的次數(shù)不同,這仍然不能保證分布相同。此外,鑒于每個(gè) ID 可以屬于 D 中的多個(gè)類,并且不應(yīng)在訓(xùn)練集和測試集之間共享,因此尋求相同的分布成為一個(gè)復(fù)雜的優(yōu)化問題。這是因?yàn)槊總€(gè) ID 只能包含在train或test中,同時(shí)向分配的集合添加可變數(shù)量的類,這取決于給定 ID 在其出現(xiàn)的所有行中所具有的類。
在近似平衡分布的同時(shí)分割數(shù)據(jù)的一種相當(dāng)簡單的方法是按隨機(jī)順序迭代類,并僅考慮每個(gè) ID 出現(xiàn)的其中一個(gè)類,因此將其分配給其所有行進(jìn)行訓(xùn)練/測試,因此為以后的課程刪除它。
我發(fā)現(xiàn)將 ID 視為一列有助于完成此任務(wù),因此我更改了您提供的代碼,如下所示:
# Given snippet (modified)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = np.random.randint(0,20,size=(300, 5))
df = pd.DataFrame(data, columns=['ID', 'A', 'B', 'C', 'D'])
建議的解決方案:
import random
from collections import defaultdict
classes = df.D.unique().tolist() # get unique classes,
random.shuffle(classes) # shuffle to eliminate positional biases
ids_by_class = defaultdict(list)
# iterate over classes
temp_df = df.copy()
for c in classes:
c_rows = temp_df.loc[temp_df['D'] == c] # rows with given class
ids = temp_df.ID.unique().tolist() # IDs in these rows
ids_by_class[c].extend(ids)
# remove ids so they cannot be taken into account for other classes
temp_df = temp_df[~temp_df.ID.isin(ids)]
# now construct ids split, class by class
train_ids, test_ids = [], []
for c, ids in ids_by_class.items():
random.shuffle(ids) # shuffling can eliminate positional biases
# split the IDs
split = int(len(ids)*0.7) # split at 70%
train_ids.extend(ids[:split])
test_ids.extend(ids[split:])
# finally use the ids in train and test to get the
# data split from the original df
train = df.loc[df['ID'].isin(train_ids)]
test = df.loc[df['ID'].isin(test_ids)]
讓我們測試一下分割比大致符合 70/30,數(shù)據(jù)被保留并且訓(xùn)練和測試數(shù)據(jù)幀之間沒有共享 ID:
# 1) check that elements in Train are roughly 70% and Test 30% of original df
print(f'Numbers of elements in train: {len(train)}, test: {len(test)}| Perfect split would be train: {int(len(df)*0.7)}, test: {int(len(df)*0.3)}')
# 2) check that concatenating Train and Test gives back the original df
train_test = pd.concat([train, test]).sort_values(by=['ID', 'A', 'B', 'C', 'D']) # concatenate dataframes into one, and sort
sorted_df = df.sort_values(by=['ID', 'A', 'B', 'C', 'D']) # sort original df
assert train_test.equals(sorted_df) # check equality
# 3) check that the IDs are not shared between train/test sets
train_id_set = set(train.ID.unique().tolist())
test_id_set = set(test.ID.unique().tolist())
assert len(train_id_set.intersection(test_id_set)) == 0
輸出示例:
Numbers of elements in train: 209, test: 91| Perfect split would be train: 210, test: 90
Numbers of elements in train: 210, test: 90| Perfect split would be train: 210, test: 90
Numbers of elements in train: 227, test: 73| Perfect split would be train: 210, test: 90
添加回答
舉報(bào)