3 回答

TA貢獻(xiàn)1871條經(jīng)驗(yàn) 獲得超13個(gè)贊
有關(guān)NLTK貝葉斯分類(lèi)器的數(shù)據(jù)結(jié)構(gòu)的教程很棒。從更高的角度來(lái)看,我們可以將其視為
我們有帶有情感標(biāo)簽的輸入句子:
training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
讓我們將特征集視為單個(gè)單詞,因此我們從訓(xùn)練數(shù)據(jù)中提取所有可能單詞的列表(我們稱(chēng)之為詞匯),如下所示:
from nltk.tokenize import word_tokenize
from itertools import chain
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
本質(zhì)上,vocabulary這里是@ 275365的相同all_word
>>> all_words = set(word.lower() for passage in training_data for word in word_tokenize(passage[0]))
>>> vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
>>> print vocabulary == all_words
True
從每個(gè)數(shù)據(jù)點(diǎn)(即每個(gè)句子和pos / neg標(biāo)簽),我們要說(shuō)一個(gè)特征(即詞匯中的單詞)是否存在。
>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> print {i:True for i in vocabulary if i in sentence}
{'this': True, 'i': True, 'sandwich': True, 'love': True, '.': True}
但是我們還想告訴分類(lèi)器,句子中不存在哪個(gè)單詞,而是詞匯中的單詞,因此對(duì)于每個(gè)數(shù)據(jù)點(diǎn),我們列出詞匯中所有可能的單詞,并說(shuō)出一個(gè)單詞是否存在:
>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x = {i:True for i in vocabulary if i in sentence}
>>> y = {i:False for i in vocabulary if i not in sentence}
>>> x.update(y)
>>> print x
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}
但是,由于這兩次遍歷詞匯表,因此這樣做更有效:
>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x = {i:(i in sentence) for i in vocabulary}
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}
因此,對(duì)于每個(gè)句子,我們想告訴每個(gè)句子的分類(lèi)器哪個(gè)詞存在,哪個(gè)詞不存在,并為其賦予pos / neg標(biāo)記。我們可以稱(chēng)其為feature_set,它是一個(gè)由x(如上所示)及其標(biāo)簽組成的元組。
>>> feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]
然后,我們將feature_set中的這些功能和標(biāo)簽提供給分類(lèi)器以對(duì)其進(jìn)行訓(xùn)練:
from nltk import NaiveBayesClassifier as nbc
classifier = nbc.train(feature_set)
現(xiàn)在,您擁有訓(xùn)練有素的分類(lèi)器,并且當(dāng)您要標(biāo)記新句子時(shí),您必須“特征化”新句子以查看新句子中哪個(gè)詞在分類(lèi)器接受過(guò)訓(xùn)練的詞匯表中:
>>> test_sentence = "This is the best band I've ever heard! foobar"
>>> featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}
注意:從上面的步驟中可以看到,樸素的貝葉斯分類(lèi)器無(wú)法處理超出詞匯量的單詞,因?yàn)閒oobar標(biāo)記化為特征后會(huì)消失。
然后,您將特征化的測(cè)試句子輸入分類(lèi)器,并要求其進(jìn)行分類(lèi):
>>> classifier.classify(featurized_test_sentence)
'pos'
希望這可以更清晰地說(shuō)明如何將數(shù)據(jù)輸入到NLTK的樸素貝葉斯分類(lèi)器中進(jìn)行情感分析。這是完整的代碼,沒(méi)有注釋和演練:
from nltk import NaiveBayesClassifier as nbc
from nltk.tokenize import word_tokenize
from itertools import chain
training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
classifier = nbc.train(feature_set)
test_sentence = "This is the best band I've ever heard!"
featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}
print "test_sent:",test_sentence
print "tag:",classifier.classify(featurized_test_sentence)
添加回答
舉報(bào)