2 回答

TA貢獻(xiàn)1798條經(jīng)驗(yàn) 獲得超3個(gè)贊
你可以使用Scikit學(xué)習(xí)計(jì)數(shù)矢量器為此
from sklearn.feature_extraction.text import CountVectorizer
from gensim import matutils
from gensim.models.ldamodel import LdaModel
text = ['computer time graph', 'survey response eps', 'human system computer','machinelearning is very hot topic','python win the race for simplicity as compared to other programming language']
# suppose this are the word that you want to be used in your vocab
vocabulary = ['machine','python','learning','human', 'system','hot','time']
vect = CountVectorizer(vocabulary = vocabulary)
x = vect.fit_transform(text)
feature_name = vect.get_feature_names()
# now you can use matutils helper function of gensim
model = LdaModel(matutils.Sparse2Corpus(x),num_topic=3,id2word=dict([(i, s) for i, s in enumerate(feature_name)]))
#printing the topic
model.show_topics()
#to see the vocab that use being used
print(vect.get_feature_names())
['machine', 'python', 'learning', 'human', 'system', 'hot', 'time'] # you will get the feature that you want include

TA貢獻(xiàn)1872條經(jīng)驗(yàn) 獲得超4個(gè)贊
LDA的主題建模方法是將每個(gè)文檔視為一定比例的主題集合。每個(gè)主題作為關(guān)鍵字的集合,同樣,以一定的比例。
一旦為算法提供了主題的數(shù)量,它就會(huì)重新排列文檔中的主題分布和主題內(nèi)的關(guān)鍵字分布,以獲得主題關(guān)鍵字分布的良好組合。
主題模型的兩個(gè)主要輸入是字典或詞匯()和語料庫。LDAid2word
您可以使用類似這樣的東西來實(shí)現(xiàn)此目的:
import gensim.corpora as corpora
# Create Dictionary/Vocabulary
id2word = corpora.Dictionary(data_lemmatized)
# Create Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
添加回答
舉報(bào)