第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時(shí)綁定郵箱和手機(jī)立即綁定
已解決430363個(gè)問題,去搜搜看,總會(huì)有你想問的

如何獲取語料庫中單詞的平均 TF-IDF 值?

如何獲取語料庫中單詞的平均 TF-IDF 值?

qq_笑_17 2022-05-19 14:01:52
我正在嘗試獲取整個(gè)語料庫中單詞的平均 TF-IDF 值。假設(shè)我們的語料庫中出現(xiàn)了 4 次“stack”這個(gè)詞(幾百個(gè)文檔)。它在找到的 4 個(gè)文檔中有這些值0.34, 0.45, 0.68, 0.78。因此,它在整個(gè)語料庫中的平均 TF-IDF 值為0.5625. 我怎樣才能找到文檔中所有單詞的這個(gè)?我正在使用 TF-IDF 的 scikit-learn 實(shí)現(xiàn)。這是我用來獲取每個(gè)文檔的 TF-IDF 值的代碼:for i in docs_test:    feature_names=cv.get_feature_names()    doc=docs_test[itr]    itr += 1    tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))    sorted_items=sort_coo(tf_idf_vector.tocoo())    #Extracting the top 81 keywords along with their TF-IDF scores    keywords=extract_topn_from_vector(feature_names,sorted_items,81)對于每次迭代,這會(huì)輸出一個(gè)包含 81 個(gè)單詞的字典以及該文檔的 TF-IDF 分?jǐn)?shù): {'kerry': 0.396, 'paris': 0.278, 'france': 0.252 ......}由于我只輸出前 81 個(gè)單詞,我知道該文檔中的所有單詞都不會(huì)被覆蓋。所以,我想要文檔中前 81 個(gè)單詞的平均 TF-IDF 值(單詞將被重復(fù))。
查看完整描述

1 回答

?
慕雪6442864

TA貢獻(xiàn)1812條經(jīng)驗(yàn) 獲得超5個(gè)贊

文檔是內(nèi)聯(lián)的


from sklearn.feature_extraction.text import TfidfVectorizer 

import numpy as np


docs=["the house had a tiny little mouse",

      "the cat saw the mouse",

      "the mouse ran away from the house",

      "the cat finally ate the mouse",

      "the end of the mouse story"

     ]


tfidf_vectorizer=TfidfVectorizer(use_idf=True)

tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)


tfidf = tfidf_vectorizer_vectors.todense()

# TFIDF of words not in the doc will be 0, so replace them with nan

tfidf[tfidf == 0] = np.nan

# Use nanmean of numpy which will ignore nan while calculating the mean

means = np.nanmean(tfidf, axis=0)

# convert it into a dictionary for later lookup

means = dict(zip(tfidf_vectorizer.get_feature_names(), means.tolist()[0]))


tfidf = tfidf_vectorizer_vectors.todense()

# Argsort the full TFIDF dense vector

ordered = np.argsort(tfidf*-1)

words = tfidf_vectorizer.get_feature_names()


top_k = 5

for i, doc in enumerate(docs):

    result = { }

    # Pick top_k from each argsorted matrix for each doc

    for t in range(top_k):

        # Pick the top k word, find its average tfidf from the

        # precomputed dictionary using nanmean and save it to later use

        result[words[ordered[i,t]]] = means[words[ordered[i,t]]]

    print (result )

輸出

{'had': 0.4935620852501244, 'little': 0.4935620852501244, 'tiny': 0.4935620852501244, 'house': 0.38349121689490395, 'mouse': 0.24353457958557367}

{'saw': 0.5990921556092994, 'the': 0.4400321635416817, 'cat': 0.44898681252620987, 'mouse': 0.24353457958557367, 'ate': 0.5139230069660121}

{'away': 0.4570928721125019, 'from': 0.4570928721125019, 'ran': 0.4570928721125019, 'the': 0.4400321635416817, 'house': 0.38349121689490395}

{'ate': 0.5139230069660121, 'finally': 0.5139230069660121, 'the': 0.4400321635416817, 'cat': 0.44898681252620987, 'mouse': 0.24353457958557367}

{'end': 0.4917531872315962, 'of': 0.4917531872315962, 'story': 0.4917531872315962, 'the': 0.4400321635416817, 'mouse': 0.24353457958557367}

讓我們破譯result[words[ordered[i,t]]] = means[words[ordered[i,t]]]

如果詞匯量是并且v文檔數(shù)是n

  • ordered是一個(gè)大小為 的矩陣nxv。該矩陣的值是對應(yīng)于詞匯的索引,并且該矩陣根據(jù)每個(gè)文檔的 TF-IDF 分?jǐn)?shù)進(jìn)行排序。

  • wordsv是詞匯表中單詞的列表大小。將此視為單詞映射器的 id

  • means是一個(gè)大小的字典,v每個(gè)值都是單詞的平均 TF-IDF。


查看完整回答
反對 回復(fù) 2022-05-19
  • 1 回答
  • 0 關(guān)注
  • 176 瀏覽
慕課專欄
更多

添加回答

舉報(bào)

0/150
提交
取消
微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號