首頁猿問使用...

使用 string.punctuation 刪除字符串的標(biāo)點符號時出錯

Python

搖曳的薔薇 2022-07-12 09:27:25

快速提問：在將其輸入某些自然語言處理算法之前，我正在使用stringand去除所有標(biāo)點符號和停用詞的文本塊作為數(shù)據(jù)預(yù)處理的一部分。nltk.stopwords我已經(jīng)在幾個原始文本塊上分別測試了每個組件，因為我仍然習(xí)慣了這個過程，而且看起來還不錯。 def text_process(text): """ Takes in string of text, and does following operations: 1. Removes punctuation. 2. Removes stopwords. 3. Returns a list of cleaned "tokenized" text. """ nopunc = [char for char in text.lower() if char not in string.punctuation] nopunc = ''.join(nopunc) return [word for word in nopunc.split() if word not in stopwords.words('english')]然而，當(dāng)我將此函數(shù)應(yīng)用于我的數(shù)據(jù)框的文本列時——它是來自一堆 Pitchfork 評論的文本——我可以看到標(biāo)點符號實際上并沒有被刪除，盡管停用詞被刪除了。未處理： pitchfork['content'].head(5)0 “Trip-hop” eventually became a ’90s punchline,...1 Eight years, five albums, and two EPs in, the ...2 Minneapolis’ Uranium Club seem to revel in bei...3 Minneapolis’ Uranium Club seem to revel in bei...4 Kleenex began with a crash. It transpired one ...Name: content, dtype: object處理： pitchfork['content'].head(5).apply(text_process)0 [“triphop”, eventually, became, ’90s, punchlin...1 [eight, years, five, albums, two, eps, new, yo...2 [minneapolis’, uranium, club, seem, revel, agg...3 [minneapolis’, uranium, club, seem, revel, agg...4 [kleenex, began, crash, it, transpired, one, n...Name: content, dtype: object關(guān)于這里出了什么問題的任何想法？我查看了文檔，但我還沒有看到任何人以完全相同的方式解決這個問題，所以我很想了解如何解決這個問題。非常感謝！

查看完整描述

1 回答

侃侃無極

TA貢獻(xiàn)2051條經(jīng)驗獲得超10個贊

這里的問題是 utf-8 對左右引號（單引號和雙引號）有不同的編碼，而不僅僅是string.punctuation.

我會做類似的事情

punctuation = [ c for c in string.punctuation ] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019']

nopunc = [ char for char in text.decode('utf-8').lower() if char not in punctuation ]

這會將非 ascii 引號的 utf-8 值添加到名為的列表中punctuation，然后將文本解碼為utf-8，并替換這些值。

注意：這是python2，如果您使用的是python3，則utf值的格式可能會略有不同

反對回復(fù) 2022-07-12

1 回答
0 關(guān)注
229 瀏覽

關(guān)注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

使用 string.punctuation 刪除字符串的標(biāo)點符號時出錯

使用 string.punctuation 刪除字符串的標(biāo)點符號時出錯

1 回答

添加回答