第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

將 n 元語法與組重復(fù)項進行比較

將 n 元語法與組重復(fù)項進行比較

偶然的你 2023-08-08 16:00:15
我正在編寫一個腳本,如果兩行之間的三個連續(xù)單詞匹配,該腳本將認為兩行是重復(fù)的。假設(shè)我當(dāng)前的數(shù)據(jù)集是:1 A Course of Pure Mathematics by G. H. Hardy2 Agile Software Development, Principles, Patterns, and Practices by Robert C. Martin3 Advanced Programming in the UNIX Environment, 3rd Edition4 Advanced Selling Strategies: Brian Tracy5 Advanced Programming in the UNIX(R) Environment6 Alex's Adventures in Numberland: Dispatches from the Wonderful World of Mathematics by Alex Bellos, Andy Riley7 Advertising Secrets of the Written Word: The Ultimate Resource on How to Write Powerful Advertising8 Agile Software Development, Principles, Patterns, and Practices9 A Course of Pure Mathematics (Cambridge Mathematical Library) 10th Edition by G. H. Hardy 10 Alex’s Adventures in Numberland11 Advertising Secrets of the Written Word12 Alex's Adventures in Numberland Paperback by Alex Bellos這里,1 和 9 是重復(fù)的,因為course pure mathematics匹配。2 和 8 是重復(fù)的,因為advanced programming unix匹配。3 和 5 是重復(fù)的,因為advanced programming unix匹配。等等 ...
查看完整描述

1 回答

?
寶慕林4294392

TA貢獻2021條經(jīng)驗 獲得超8個贊

OP 這里,解決方案似乎是:


import re

from nltk.util import ngrams


OriginalBooksList = list()

booksAfterRemovingStopWords = list()

booksWithNGrams = list()

stopWords = ['I', 'a', 'about', 'an', 'are', 'as', 'at', 'be', 'by', 'com', 'for', 'from', 'how', 'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'this', 'to', 'was', 'the',

             'and', 'A', 'About', 'An', 'Are', 'As', 'At', 'Be', 'By', 'Com', 'For', 'From', 'How', 'In', 'Is', 'It', 'Of', 'On', 'Or', 'That', 'The', 'This', 'To', 'Was', 'The', 'And']


with open('UnifiedBookList.txt') as fin:

    for line_no, line in enumerate(fin):

        OriginalBooksList.append(line)

        line = re.sub(r'[^\w\s]', ' ', line)  # replace punctuation with space

        line = re.sub(' +', ' ', line)  # replace multiple space with one

        line = line.lower()  # to lower case

        if line.strip() and len(line.split()) > 2:  # line can not be empty and line must have more than 2 words

            booksAfterRemovingStopWords.append(' '.join([i for i in line.split(

            ) if i not in stopWords]))  # Remove Stop Words And Make Sentence



for line_no, line in enumerate(booksAfterRemovingStopWords):

    tokens = line.split(" ")

    output = list(ngrams(tokens, 3))

    temp = list()


    temp.append(OriginalBooksList[line_no])  # Adding original line

    for x in output:  # Adding n-grams

        temp.append(' '.join(x))

    booksWithNGrams.append(temp)


while booksWithNGrams:

    first_element = booksWithNGrams.pop(0)

    x = 0

    for mylist in booksWithNGrams:

        if set(first_element) & set(mylist):

            if x == 0:

                print(first_element[0])

                x = 1

                # print(set(first_element) & set(mylist))

            print(mylist[0])

            booksWithNGrams.remove(mylist)

    x = 0


查看完整回答
反對 回復(fù) 2023-08-08
  • 1 回答
  • 0 關(guān)注
  • 139 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號