首頁猿問如何獲取語料庫中單詞的平均...

如何獲取語料庫中單詞的平均 TF-IDF 值？

Python

qq_笑_17 2022-05-19 14:01:52

我正在嘗試獲取整個(gè)語料庫中單詞的平均 TF-IDF 值。假設(shè)我們的語料庫中出現(xiàn)了 4 次“stack”這個(gè)詞（幾百個(gè)文檔）。它在找到的 4 個(gè)文檔中有這些值0.34, 0.45, 0.68, 0.78。因此，它在整個(gè)語料庫中的平均 TF-IDF 值為0.5625. 我怎樣才能找到文檔中所有單詞的這個(gè)？我正在使用 TF-IDF 的 scikit-learn 實(shí)現(xiàn)。這是我用來獲取每個(gè)文檔的 TF-IDF 值的代碼：for i in docs_test: feature_names=cv.get_feature_names() doc=docs_test[itr] itr += 1 tf_idf_vector=tfidf_transformer.transform(cv.transform([doc])) sorted_items=sort_coo(tf_idf_vector.tocoo()) #Extracting the top 81 keywords along with their TF-IDF scores keywords=extract_topn_from_vector(feature_names,sorted_items,81)對于每次迭代，這會(huì)輸出一個(gè)包含 81 個(gè)單詞的字典以及該文檔的 TF-IDF 分?jǐn)?shù)： {'kerry': 0.396, 'paris': 0.278, 'france': 0.252 ......}由于我只輸出前 81 個(gè)單詞，我知道該文檔中的所有單詞都不會(huì)被覆蓋。所以，我想要文檔中前 81 個(gè)單詞的平均 TF-IDF 值（單詞將被重復(fù)）。

查看完整描述

1 回答

慕雪6442864

TA貢獻(xiàn)1812條經(jīng)驗(yàn) 獲得超5個(gè)贊

文檔是內(nèi)聯(lián)的

from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np

docs=["the house had a tiny little mouse",

"the cat saw the mouse",

"the mouse ran away from the house",

"the cat finally ate the mouse",

"the end of the mouse story"

]

tfidf_vectorizer=TfidfVectorizer(use_idf=True)

tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)

tfidf = tfidf_vectorizer_vectors.todense()

# TFIDF of words not in the doc will be 0, so replace them with nan

tfidf[tfidf == 0] = np.nan

# Use nanmean of numpy which will ignore nan while calculating the mean

means = np.nanmean(tfidf, axis=0)

# convert it into a dictionary for later lookup

means = dict(zip(tfidf_vectorizer.get_feature_names(), means.tolist()[0]))

tfidf = tfidf_vectorizer_vectors.todense()

# Argsort the full TFIDF dense vector

ordered = np.argsort(tfidf*-1)

words = tfidf_vectorizer.get_feature_names()

top_k = 5

for i, doc in enumerate(docs):

result = { }

# Pick top_k from each argsorted matrix for each doc

for t in range(top_k):

# Pick the top k word, find its average tfidf from the

# precomputed dictionary using nanmean and save it to later use

result[words[ordered[i,t]]] = means[words[ordered[i,t]]]

print (result )

輸出

{'had': 0.4935620852501244, 'little': 0.4935620852501244, 'tiny': 0.4935620852501244, 'house': 0.38349121689490395, 'mouse': 0.24353457958557367}

{'saw': 0.5990921556092994, 'the': 0.4400321635416817, 'cat': 0.44898681252620987, 'mouse': 0.24353457958557367, 'ate': 0.5139230069660121}

{'away': 0.4570928721125019, 'from': 0.4570928721125019, 'ran': 0.4570928721125019, 'the': 0.4400321635416817, 'house': 0.38349121689490395}

{'ate': 0.5139230069660121, 'finally': 0.5139230069660121, 'the': 0.4400321635416817, 'cat': 0.44898681252620987, 'mouse': 0.24353457958557367}

{'end': 0.4917531872315962, 'of': 0.4917531872315962, 'story': 0.4917531872315962, 'the': 0.4400321635416817, 'mouse': 0.24353457958557367}

讓我們破譯result[words[ordered[i,t]]] = means[words[ordered[i,t]]]

如果詞匯量是并且v文檔數(shù)是n

ordered是一個(gè)大小為的矩陣nxv。該矩陣的值是對應(yīng)于詞匯的索引，并且該矩陣根據(jù)每個(gè)文檔的 TF-IDF 分?jǐn)?shù)進(jìn)行排序。
wordsv是詞匯表中單詞的列表大小。將此視為單詞映射器的 id
means是一個(gè)大小的字典，v每個(gè)值都是單詞的平均 TF-IDF。