2 回答

TA貢獻(xiàn)1998條經(jīng)驗(yàn) 獲得超6個(gè)贊
OneHotEncoder()
您可以使用和以直接的方式執(zhí)行此操作np.dot()
將數(shù)據(jù)框中的每個(gè)元素轉(zhuǎn)換為字符串
使用單熱編碼器通過(guò)分類(lèi)元素的唯一詞匯表將數(shù)據(jù)幀轉(zhuǎn)換為單熱
與自身進(jìn)行點(diǎn)積以計(jì)算共現(xiàn)
使用同現(xiàn)矩陣和
feature_names
一個(gè)熱編碼器重新創(chuàng)建數(shù)據(jù)幀
#assuming this is your dataset
0 1 2 3
0 (-1.774, 1.145] (-3.21, 0.533] (0.0166, 2.007] (2.0, 3.997]
1 (-1.774, 1.145] (-3.21, 0.533] (2.007, 3.993] (2.0, 3.997]
from sklearn.preprocessing import OneHotEncoder
df = df.astype(str) #turn each element to string
#get one hot representation of the dataframe
l = OneHotEncoder()
data = l.fit_transform(df.values)
#get co-occurance matrix using a dot product
co_occurance = np.dot(data.T, data)
#get vocab (columns and indexes) for co-occuance matrix
#get_feature_names() has a weird suffix which I am removing for better readibility here
vocab = [i[3:] for i in l.get_feature_names()]
#create co-occurance matrix
ddf = pd.DataFrame(co_occurance.todense(), columns=vocab, index=vocab)
print(ddf)
(-1.774, 1.145] (-3.21, 0.533] (0.0166, 2.007] \
(-1.774, 1.145] 2.0 2.0 1.0
(-3.21, 0.533] 2.0 2.0 1.0
(0.0166, 2.007] 1.0 1.0 1.0
(2.007, 3.993] 1.0 1.0 0.0
(2.0, 3.997] 2.0 2.0 1.0
(2.007, 3.993] (2.0, 3.997]
(-1.774, 1.145] 1.0 2.0
(-3.21, 0.533] 1.0 2.0
(0.0166, 2.007] 0.0 1.0
(2.007, 3.993] 1.0 1.0
(2.0, 3.997] 1.0 2.0
正如您可以從上面的輸出中驗(yàn)證的那樣,它正是共現(xiàn)矩陣應(yīng)該是什么。
這種方法的優(yōu)點(diǎn)是您可以使用transform單熱編碼器對(duì)象的方法對(duì)其進(jìn)行縮放,并且大部分處理都發(fā)生在稀疏矩陣中,直到創(chuàng)建數(shù)據(jù)幀的最后一步,以提高內(nèi)存效率。

TA貢獻(xiàn)1829條經(jīng)驗(yàn) 獲得超7個(gè)贊
假設(shè)您的數(shù)據(jù)位于數(shù)據(jù)框 df 中。
然后,您可以在數(shù)據(jù)幀上執(zhí)行 2 個(gè)循環(huán),并在數(shù)據(jù)幀的每一行上執(zhí)行兩個(gè)循環(huán),如下所示:
from collections import defaultdict
co_occrence = defaultdict(int)
for index, row in df.iterrows():
for index2, row2 in df.iloc[index + 1:].iterrows():
for row_index, feature in enumerate(row):
for feature2 in row2[row_index + 1:]:
co_occrence[feature, feature2] += 1
添加回答
舉報(bào)