快速提問:在將其輸入某些自然語言處理算法之前,我正在使用stringand去除所有標(biāo)點符號和停用詞的文本塊作為數(shù)據(jù)預(yù)處理的一部分。nltk.stopwords我已經(jīng)在幾個原始文本塊上分別測試了每個組件,因為我仍然習(xí)慣了這個過程,而且看起來還不錯。 def text_process(text): """ Takes in string of text, and does following operations: 1. Removes punctuation. 2. Removes stopwords. 3. Returns a list of cleaned "tokenized" text. """ nopunc = [char for char in text.lower() if char not in string.punctuation] nopunc = ''.join(nopunc) return [word for word in nopunc.split() if word not in stopwords.words('english')]然而,當(dāng)我將此函數(shù)應(yīng)用于我的數(shù)據(jù)框的文本列時——它是來自一堆 Pitchfork 評論的文本——我可以看到標(biāo)點符號實際上并沒有被刪除,盡管停用詞被刪除了。未處理: pitchfork['content'].head(5)0 “Trip-hop” eventually became a ’90s punchline,...1 Eight years, five albums, and two EPs in, the ...2 Minneapolis’ Uranium Club seem to revel in bei...3 Minneapolis’ Uranium Club seem to revel in bei...4 Kleenex began with a crash. It transpired one ...Name: content, dtype: object處理: pitchfork['content'].head(5).apply(text_process)0 [“triphop”, eventually, became, ’90s, punchlin...1 [eight, years, five, albums, two, eps, new, yo...2 [minneapolis’, uranium, club, seem, revel, agg...3 [minneapolis’, uranium, club, seem, revel, agg...4 [kleenex, began, crash, it, transpired, one, n...Name: content, dtype: object關(guān)于這里出了什么問題的任何想法?我查看了文檔,但我還沒有看到任何人以完全相同的方式解決這個問題,所以我很想了解如何解決這個問題。非常感謝!
1 回答

侃侃無極
TA貢獻(xiàn)2051條經(jīng)驗 獲得超10個贊
這里的問題是 utf-8 對左右引號(單引號和雙引號)有不同的編碼,而不僅僅是string.punctuation.
我會做類似的事情
punctuation = [ c for c in string.punctuation ] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019']
nopunc = [ char for char in text.decode('utf-8').lower() if char not in punctuation ]
這會將非 ascii 引號的 utf-8 值添加到名為 的列表中punctuation,然后將文本解碼為utf-8,并替換這些值。
注意:這是python2,如果您使用的是python3,則utf值的格式可能會略有不同
添加回答
舉報
0/150
提交
取消