1 回答

TA貢獻(xiàn)1797條經(jīng)驗(yàn) 獲得超6個(gè)贊
編輯我替換np.random.shuffle(A)為A = np.random.permutation(A),唯一的區(qū)別是它不會(huì)改變輸入數(shù)組。這在這段代碼中沒(méi)有任何區(qū)別,但通常更安全。
這個(gè)想法是通過(guò)使用隨機(jī)采樣輸入numpy.random.permutation。一旦行被打亂,我們只需要遍歷所有可能的測(cè)試集(所需大小的滑動(dòng)窗口,這里是輸入大小的 20%)。相應(yīng)的訓(xùn)練集僅由所有剩余元素組成。
這將保留所有子集上的原始類(lèi)分布,即使我們因?yàn)槲覀兇騺y了輸入而按順序選擇了它們。
以下代碼迭代測(cè)試/訓(xùn)練集組合:
import numpy as np
def csv_to_array(file):
with open(file, 'r') as f:
data = np.loadtxt(f, delimiter=',')
return data
def classes_distribution(A):
"""Print the class distributions of array A."""
nb_classes = np.unique(A[:,-1]).shape[0]
total_size = A.shape[0]
for i in range(nb_classes):
class_size = sum(row[-1] == i for row in A)
class_p = class_size/total_size
print(f"\t P(class_{i}) = {class_p:.3f}")
def random_samples(A, test_set_p=0.2):
"""Split the input array A in two uniformly chosen
random sets: test/training.
Repeat this until all rows have been yielded once at least
once as a test set."""
A = np.random.permutation(A)
sample_size = int(test_set_p*A.shape[0])
for start in range(0, A.shape[0], sample_size):
end = start + sample_size
yield {
"test": A[start:end,],
"train": np.append(A[:start,], A[end:,], 0)
}
def main():
ecoli = csv_to_array('ecoli.csv')
print("Input set shape: ", ecoli.shape)
print("Input set class distribution:")
classes_distribution(ecoli)
print("Training sets class distributions:")
for iteration in random_samples(ecoli):
test_set = iteration["test"]
training_set = iteration["train"]
classes_distribution(training_set)
print("---")
# ... Do what ever with these two sets
main()
它產(chǎn)生以下形式的輸出:
Input set shape: (169, 8)
Input set class distribution:
P(class_0) = 0.308
P(class_1) = 0.213
P(class_2) = 0.207
P(class_3) = 0.118
P(class_4) = 0.154
Training sets class distributions:
P(class_0) = 0.316
P(class_1) = 0.206
P(class_2) = 0.199
P(class_3) = 0.118
P(class_4) = 0.162
...
添加回答
舉報(bào)