首頁猿問對(duì)數(shù)據(jù)集進(jìn)行分層，同時(shí)避免索引污染？

對(duì)數(shù)據(jù)集進(jìn)行分層，同時(shí)避免索引污染？

Python

泛舟湖上清波郎朗 2023-08-15 17:23:26

作為可重現(xiàn)的示例，我有以下數(shù)據(jù)集：import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitdata = np.random.randint(0,20,size=(300, 5))df = pd.DataFrame(data, columns=['ID', 'A', 'B', 'C', 'D'])df = df.set_index(['ID'])df.head()Out: A B C DID 12 3 14 4 79 5 9 8 412 18 17 3 141 0 10 1 09 10 5 11 9我需要執(zhí)行 70%-30% 的分層分割（在 y 上），我知道它看起來像這樣：# Train/Test SplitX = df.iloc[:,0:-1] # Columns A, B, and Cy = df.iloc[:,-1] # Column DX_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, test_size = 0.30, stratify = y)然而，雖然我希望訓(xùn)練和測試集具有相同（或足夠相似）的“D”分布，但我不希望測試和訓(xùn)練中都存在相同的“ID”。我怎么能這樣做呢？

查看完整描述

1 回答

飲歌長嘯

TA貢獻(xiàn)1951條經(jīng)驗(yàn) 獲得超3個(gè)贊

編輯：執(zhí)行（類似）您要求的操作的一種方法可能是按類別存儲(chǔ) ID，然后對(duì)于每個(gè)類別，獲取 70% 的 ID，并將具有這些 ID 的樣本插入到訓(xùn)練中，其余的插入到測試集中。

請(qǐng)注意，如果每個(gè) ID 出現(xiàn)的次數(shù)不同，這仍然不能保證分布相同。此外，鑒于每個(gè) ID 可以屬于 D 中的多個(gè)類，并且不應(yīng)在訓(xùn)練集和測試集之間共享，因此尋求相同的分布成為一個(gè)復(fù)雜的優(yōu)化問題。這是因?yàn)槊總€(gè) ID 只能包含在train或test中，同時(shí)向分配的集合添加可變數(shù)量的類，這取決于給定 ID 在其出現(xiàn)的所有行中所具有的類。

在近似平衡分布的同時(shí)分割數(shù)據(jù)的一種相當(dāng)簡單的方法是按隨機(jī)順序迭代類，并僅考慮每個(gè) ID 出現(xiàn)的其中一個(gè)類，因此將其分配給其所有行進(jìn)行訓(xùn)練/測試，因此為以后的課程刪除它。

我發(fā)現(xiàn)將 ID 視為一列有助于完成此任務(wù)，因此我更改了您提供的代碼，如下所示：

# Given snippet (modified)

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

data = np.random.randint(0,20,size=(300, 5))

df = pd.DataFrame(data, columns=['ID', 'A', 'B', 'C', 'D'])

建議的解決方案：

import random

from collections import defaultdict

classes = df.D.unique().tolist() # get unique classes,

random.shuffle(classes) # shuffle to eliminate positional biases

ids_by_class = defaultdict(list)

# iterate over classes

temp_df = df.copy()

for c in classes:

c_rows = temp_df.loc[temp_df['D'] == c] # rows with given class

ids = temp_df.ID.unique().tolist() # IDs in these rows

ids_by_class[c].extend(ids)

# remove ids so they cannot be taken into account for other classes

temp_df = temp_df[~temp_df.ID.isin(ids)]

# now construct ids split, class by class

train_ids, test_ids = [], []

for c, ids in ids_by_class.items():

random.shuffle(ids) # shuffling can eliminate positional biases

# split the IDs

split = int(len(ids)*0.7) # split at 70%

train_ids.extend(ids[:split])

test_ids.extend(ids[split:])

# finally use the ids in train and test to get the

# data split from the original df

train = df.loc[df['ID'].isin(train_ids)]

test = df.loc[df['ID'].isin(test_ids)]

讓我們測試一下分割比大致符合 70/30，數(shù)據(jù)被保留并且訓(xùn)練和測試數(shù)據(jù)幀之間沒有共享 ID：

# 1) check that elements in Train are roughly 70% and Test 30% of original df

print(f'Numbers of elements in train: {len(train)}, test: {len(test)}| Perfect split would be train: {int(len(df)*0.7)}, test: {int(len(df)*0.3)}')

# 2) check that concatenating Train and Test gives back the original df

train_test = pd.concat([train, test]).sort_values(by=['ID', 'A', 'B', 'C', 'D']) # concatenate dataframes into one, and sort

sorted_df = df.sort_values(by=['ID', 'A', 'B', 'C', 'D']) # sort original df

assert train_test.equals(sorted_df) # check equality

# 3) check that the IDs are not shared between train/test sets

train_id_set = set(train.ID.unique().tolist())

test_id_set = set(test.ID.unique().tolist())

assert len(train_id_set.intersection(test_id_set)) == 0

輸出示例：

Numbers of elements in train: 209, test: 91| Perfect split would be train: 210, test: 90

Numbers of elements in train: 210, test: 90| Perfect split would be train: 210, test: 90

Numbers of elements in train: 227, test: 73| Perfect split would be train: 210, test: 90

反對(duì) 回復(fù) 2023-08-15

1 回答
0 關(guān)注
142 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

對(duì)數(shù)據(jù)集進(jìn)行分層，同時(shí)避免索引污染？

對(duì)數(shù)據(jù)集進(jìn)行分層，同時(shí)避免索引污染？

1 回答

添加回答

對(duì)數(shù)據(jù)集進(jìn)行分層，同時(shí)避免索引污染？

對(duì)數(shù)據(jù)集進(jìn)行分層，同時(shí)避免索引污染？