首頁猿問如何優(yōu)化預處理所有文本文檔而不使用...

如何優(yōu)化預處理所有文本文檔而不使用for循環(huán)在每次迭代中預處理單個文本文檔？

Python

慕哥6287543 2021-10-19 14:40:45

我想優(yōu)化下面的代碼，以便它可以有效地處理 3000 個文本數據，然后這些數據將被饋送到 TFIDF Vectorizer 和 links() 進行聚類。到目前為止，我已經使用 Pandas 讀取了 excel 并將數據框保存到列表變量中。然后我將列表中的每個文本元素迭代為標記，然后從元素中過濾掉停用詞。過濾后的元素存儲在另一個變量中，該變量存儲在列表中。所以最后，我創(chuàng)建了一個處理過的文本元素列表（來自列表）。我認為可以在創(chuàng)建列表和過濾掉停用詞時以及將數據保存到兩個不同的變量中時執(zhí)行優(yōu)化：documents_no_stopwords 和 processing_words。如果有人可以幫助我或建議我遵循的方向，那就太好了。temp=0df=pandas.read_excel('File.xlsx')for text in df['text'].tolist(): temp=temp+1 preprocessing(text) print tempdef preprocessing(word): tokens = tokenizer.tokenize(word) processed_words = [] for w in tokens: if w in stop_words: continue else: ## a new list is created with only the nouns in them for each text document processed_words.append(w) ## This step creates a list of text documents with only the nouns in them documents_no_stopwords.append(' '.join(processed_words)) processed_words=[]

查看完整描述

1 回答

冉冉說

TA貢獻1877條經驗獲得超1個贊

您需要首先制作set停用詞并使用列表理解來過濾標記。

def preprocessing(txt):

tokens = word_tokenize(txt)

# print(tokens)

stop_words = set(stopwords.words("english"))

tokens = [i for i in tokens if i not in stop_words]

return " ".join(tokens)

string = "Hey this is Sam. How are you?"

print(preprocessing(string))

輸出：

'Hey Sam . How ?'

而不是使用for循環(huán)，使用df.apply如下：

df['text'] = df['text'].apply(preprocessing)

為什么集合優(yōu)于列表

stopwords.words() 如果檢查有重復條目，len(stopwords.words())并且len(set(stopwords.words())) 設置的長度小了幾百。這就是為什么set這里是首選。

這是性能使用list和set

x = stopwords.words('english')

y = set(stopwords.words('english'))

%timeit new = [i for i in tokens if i not in x]

# 10000 loops, best of 3: 120 μs per loop

%timeit old = [j for j in tokens if j not in y]

# 1000000 loops, best of 3: 1.16 μs per loop

而且list-comprehension速度比平時快for-loop。

反對回復 2021-10-19

1 回答
0 關注
157 瀏覽

關注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網站添加到書簽

微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

如何優(yōu)化預處理所有文本文檔而不使用for循環(huán)在每次迭代中預處理單個文本文檔？

如何優(yōu)化預處理所有文本文檔而不使用for循環(huán)在每次迭代中預處理單個文本文檔？

1 回答

添加回答

如何優(yōu)化預處理所有文本文檔而不使用for循環(huán)在每次迭代中預處理單個文本文檔？