首頁猿問將 n 元語法與組重復(fù)項進行比較

將 n 元語法與組重復(fù)項進行比較

Python

偶然的你 2023-08-08 16:00:15

我正在編寫一個腳本，如果兩行之間的三個連續(xù)單詞匹配，該腳本將認為兩行是重復(fù)的。假設(shè)我當(dāng)前的數(shù)據(jù)集是：1 A Course of Pure Mathematics by G. H. Hardy2 Agile Software Development, Principles, Patterns, and Practices by Robert C. Martin3 Advanced Programming in the UNIX Environment, 3rd Edition4 Advanced Selling Strategies: Brian Tracy5 Advanced Programming in the UNIX(R) Environment6 Alex's Adventures in Numberland: Dispatches from the Wonderful World of Mathematics by Alex Bellos, Andy Riley7 Advertising Secrets of the Written Word: The Ultimate Resource on How to Write Powerful Advertising8 Agile Software Development, Principles, Patterns, and Practices9 A Course of Pure Mathematics (Cambridge Mathematical Library) 10th Edition by G. H. Hardy 10 Alex’s Adventures in Numberland11 Advertising Secrets of the Written Word12 Alex's Adventures in Numberland Paperback by Alex Bellos這里，1 和 9 是重復(fù)的，因為course pure mathematics匹配。2 和 8 是重復(fù)的，因為advanced programming unix匹配。3 和 5 是重復(fù)的，因為advanced programming unix匹配。等等 ...

查看完整描述

1 回答

寶慕林4294392

TA貢獻2021條經(jīng)驗獲得超8個贊

OP 這里，解決方案似乎是：

import re

from nltk.util import ngrams

OriginalBooksList = list()

booksAfterRemovingStopWords = list()

booksWithNGrams = list()

stopWords = ['I', 'a', 'about', 'an', 'are', 'as', 'at', 'be', 'by', 'com', 'for', 'from', 'how', 'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'this', 'to', 'was', 'the',

'and', 'A', 'About', 'An', 'Are', 'As', 'At', 'Be', 'By', 'Com', 'For', 'From', 'How', 'In', 'Is', 'It', 'Of', 'On', 'Or', 'That', 'The', 'This', 'To', 'Was', 'The', 'And']

with open('UnifiedBookList.txt') as fin:

for line_no, line in enumerate(fin):

OriginalBooksList.append(line)

line = re.sub(r'[^\w\s]', ' ', line) # replace punctuation with space

line = re.sub(' +', ' ', line) # replace multiple space with one

line = line.lower() # to lower case

if line.strip() and len(line.split()) > 2: # line can not be empty and line must have more than 2 words

booksAfterRemovingStopWords.append(' '.join([i for i in line.split(

) if i not in stopWords])) # Remove Stop Words And Make Sentence

for line_no, line in enumerate(booksAfterRemovingStopWords):

tokens = line.split(" ")

output = list(ngrams(tokens, 3))

temp = list()

temp.append(OriginalBooksList[line_no]) # Adding original line

for x in output: # Adding n-grams

temp.append(' '.join(x))

booksWithNGrams.append(temp)

while booksWithNGrams:

first_element = booksWithNGrams.pop(0)

x = 0

for mylist in booksWithNGrams:

if set(first_element) & set(mylist):

if x == 0:

print(first_element[0])

x = 1

# print(set(first_element) & set(mylist))

print(mylist[0])

booksWithNGrams.remove(mylist)

x = 0

反對回復(fù) 2023-08-08

1 回答
0 關(guān)注
139 瀏覽

關(guān)注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

將 n 元語法與組重復(fù)項進行比較

將 n 元語法與組重復(fù)項進行比較

1 回答

添加回答