1 回答

TA貢獻(xiàn)1812條經(jīng)驗(yàn) 獲得超5個(gè)贊
文檔是內(nèi)聯(lián)的
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
docs=["the house had a tiny little mouse",
"the cat saw the mouse",
"the mouse ran away from the house",
"the cat finally ate the mouse",
"the end of the mouse story"
]
tfidf_vectorizer=TfidfVectorizer(use_idf=True)
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)
tfidf = tfidf_vectorizer_vectors.todense()
# TFIDF of words not in the doc will be 0, so replace them with nan
tfidf[tfidf == 0] = np.nan
# Use nanmean of numpy which will ignore nan while calculating the mean
means = np.nanmean(tfidf, axis=0)
# convert it into a dictionary for later lookup
means = dict(zip(tfidf_vectorizer.get_feature_names(), means.tolist()[0]))
tfidf = tfidf_vectorizer_vectors.todense()
# Argsort the full TFIDF dense vector
ordered = np.argsort(tfidf*-1)
words = tfidf_vectorizer.get_feature_names()
top_k = 5
for i, doc in enumerate(docs):
result = { }
# Pick top_k from each argsorted matrix for each doc
for t in range(top_k):
# Pick the top k word, find its average tfidf from the
# precomputed dictionary using nanmean and save it to later use
result[words[ordered[i,t]]] = means[words[ordered[i,t]]]
print (result )
輸出
{'had': 0.4935620852501244, 'little': 0.4935620852501244, 'tiny': 0.4935620852501244, 'house': 0.38349121689490395, 'mouse': 0.24353457958557367}
{'saw': 0.5990921556092994, 'the': 0.4400321635416817, 'cat': 0.44898681252620987, 'mouse': 0.24353457958557367, 'ate': 0.5139230069660121}
{'away': 0.4570928721125019, 'from': 0.4570928721125019, 'ran': 0.4570928721125019, 'the': 0.4400321635416817, 'house': 0.38349121689490395}
{'ate': 0.5139230069660121, 'finally': 0.5139230069660121, 'the': 0.4400321635416817, 'cat': 0.44898681252620987, 'mouse': 0.24353457958557367}
{'end': 0.4917531872315962, 'of': 0.4917531872315962, 'story': 0.4917531872315962, 'the': 0.4400321635416817, 'mouse': 0.24353457958557367}
讓我們破譯result[words[ordered[i,t]]] = means[words[ordered[i,t]]]
如果詞匯量是并且v
文檔數(shù)是n
ordered
是一個(gè)大小為 的矩陣nxv
。該矩陣的值是對應(yīng)于詞匯的索引,并且該矩陣根據(jù)每個(gè)文檔的 TF-IDF 分?jǐn)?shù)進(jìn)行排序。words
v
是詞匯表中單詞的列表大小。將此視為單詞映射器的 idmeans
是一個(gè)大小的字典,v
每個(gè)值都是單詞的平均 TF-IDF。
添加回答
舉報(bào)