首頁(yè) 猿問(wèn) 如何在python中有效地搜索字符...

如何在python中有效地搜索字符串中的列表元素

Python

Qyouu 2021-10-26 17:00:55

我有一個(gè)概念列表 ( myconcepts) 和一個(gè)句子列表 ( sentences) 如下。concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']簡(jiǎn)而言之，我想找到conceptsin sentences。更具體地說(shuō)，給定concepts（例如，['natural language processing', 'text mining', 'texts', 'nlp']）中的列表，我想在句子中識(shí)別這些概念并將它們替換為其第一個(gè)元素（即natural language processing）。示例：所以，如果我們考慮句子data mining and text mining; 結(jié)果應(yīng)該是advanced data mining and natural language processing。（因?yàn)閐ata miningand的第一個(gè)元素分別text mining是advanced data mining和natural language processing）。上述虛擬數(shù)據(jù)的結(jié)果應(yīng)該是：['advanced data mining and natural language processing', 'natural language processing is mainly used by discourse analysis community', 'advanced data mining in python is fun', 'discourse analysis advanced data mining analysis involves natural language processing', 'advanced data mining and advanced data mining are both very interesting']我目前正在使用正則表達(dá)式執(zhí)行此操作，如下所示：concepts_re = []for item in sorted_wikipedia_redirects: item_re = "|".join(re.escape(item) for item in item) concepts_re.append(item_re)sentences_mapping = []for sentence in sentences: for terms in concepts: if len(terms) > 1: for item in terms: if item in sentence: sentence = re.sub(concepts_re[concepts.index(terms)], item[0], sentence) sentences_mapping.append(sentence)在我的真實(shí)數(shù)據(jù)集中，我有大約 800 萬(wàn)個(gè)concepts. 因此，我的方法非常低效，處理一個(gè)句子需要 5 分鐘。我想知道在 python 中是否有任何有效的方法來(lái)做到這一點(diǎn)。對(duì)于那些想要處理一長(zhǎng)串concepts來(lái)測(cè)量時(shí)間的人，我附上了一個(gè)更長(zhǎng)的列表：https : //drive.google.com/file/d/1OsggJTDZx67PGH4LupXIkCTObla0gDnX/view? usp =sharing如果需要，我很樂(lè)意提供更多詳細(xì)信息。

查看完整描述

3 回答

BIG陽(yáng)

TA貢獻(xiàn)1859條經(jīng)驗(yàn) 獲得超6個(gè)贊

下面提供的解決方案在運(yùn)行時(shí)的復(fù)雜度約為O(n)，其中n是每個(gè)句子中的標(biāo)記數(shù)。

對(duì)于 500 萬(wàn)個(gè)句子，concepts.txt它在大約 30 秒內(nèi)執(zhí)行所需的操作，請(qǐng)參閱第三部分中的基本測(cè)試。

當(dāng)涉及到空間復(fù)雜性時(shí)，您將不得不保留一個(gè)嵌套的字典結(jié)構(gòu)（讓我們暫時(shí)像這樣簡(jiǎn)化它），假設(shè)它是O(c*u)，其中u是特定長(zhǎng)度概念的唯一標(biāo)記（標(biāo)記方式），而 c 是概念的長(zhǎng)度。

很難確定確切的復(fù)雜性，但它與此非常相似（對(duì)于您的示例數(shù)據(jù)和您提供的concepts.txt數(shù)據(jù)[ ]這些數(shù)據(jù)非常準(zhǔn)確，但我們將在執(zhí)行過(guò)程中了解血腥細(xì)節(jié)）。

我假設(shè)您可以在空格上拆分您的概念和句子，如果不是這樣，我建議您查看spaCy，它提供了更智能的方法來(lái)標(biāo)記您的數(shù)據(jù)。

一、簡(jiǎn)介

讓我們以你的例子為例：

concepts = [

["natural language processing", "text mining", "texts", "nlp"],

["advanced data mining", "data mining", "data"],

["discourse analysis", "learning analytics", "mooc"],

]

正如您所說(shuō)，概念中的每個(gè)元素都必須映射到第一個(gè)元素，因此，在 Pythonish 中，它會(huì)大致遵循以下幾點(diǎn)：

for concept in concepts:

concept[1:] = concept[0]

如果所有概念的標(biāo)記長(zhǎng)度都等于 1（這里不是這種情況），并且是唯一的，那么任務(wù)將很容易。讓我們專注于第二種情況和一個(gè)特定的（稍作修改的）示例concept以了解我的觀點(diǎn)：

["advanced data mining", "data something", "data"]

這里data將映射到advanced data mining, BUT data something，它由組成data應(yīng)該在它之前映射。如果我理解正確，你會(huì)想要這句話：

"Here is data something and another data"

要映射到：

"Here is advanced data mapping and another advanced data mining"

而不是天真的方法：

"Here is advanced data mapping something and another advanced data mining"

請(qǐng)注意，對(duì)于第二個(gè)示例，我們只映射了data，而不是data something。

為了優(yōu)先考慮data something（和其他符合這種模式的），我使用了一個(gè)充滿字典的數(shù)組結(jié)構(gòu)，其中數(shù)組中較早的概念是那些在標(biāo)記方面更長(zhǎng)的概念。

繼續(xù)我們的例子，這樣的數(shù)組看起來(lái)像這樣：

structure = [

{"data": {"something": "advanced data mining"}},

{"data": "advanced data mining"},

]

請(qǐng)注意，如果我們按此順序遍歷標(biāo)記（例如，首先遍歷具有連續(xù)標(biāo)記的第一個(gè)字典，如果未找到匹配項(xiàng)，則轉(zhuǎn)到第二個(gè)字典，依此類推），我們將首先獲得最長(zhǎng)的概念。

2. 代碼

好的，我希望你明白基本的想法（如果沒(méi)有，請(qǐng)?jiān)谙旅姘l(fā)表評(píng)論，我會(huì)嘗試更詳細(xì)地解釋不清楚的部分）。

免責(zé)聲明：我對(duì)這個(gè)代碼方面并不是特別自豪，但它完成了工作，我想可能會(huì)更糟。

2.1 分層字典

首先，讓我們獲得最長(zhǎng)的概念標(biāo)記（不包括第一個(gè)元素，因?yàn)樗俏覀兊哪繕?biāo)，我們永遠(yuǎn)不必更改它）：

def get_longest(concepts: List[List[str]]):

return max(len(text.split()) for concept in concepts for text in concept[1:])

使用這些信息，我們可以通過(guò)創(chuàng)建與不同長(zhǎng)度的概念一樣多的字典來(lái)初始化我們的結(jié)構(gòu)（在上面的示例中，它將是 2，所以它適用于您的所有數(shù)據(jù)。任何長(zhǎng)度的概念都可以）：

def init_hierarchical_dictionaries(longest: int):

return [(length, {}) for length in reversed(range(longest))]

請(qǐng)注意，我將每個(gè)概念的長(zhǎng)度添加到數(shù)組中，IMO 這樣在遍歷時(shí)更容易，盡管在對(duì)實(shí)現(xiàn)進(jìn)行了一些更改后，您可以不使用它。

現(xiàn)在，當(dāng)我們有了這些輔助函數(shù)時(shí)，我們可以從概念列表中創(chuàng)建結(jié)構(gòu)：

def create_hierarchical_dictionaries(concepts: List[List[str]]):

# Initialization

longest = get_longest(concepts)

hierarchical_dictionaries = init_hierarchical_dictionaries(longest)

for concept in concepts:

for text in concept[1:]:

tokens = text.split()

# Initialize dictionary; get the one with corresponding length.

# The longer, the earlier it is in the hierarchy

current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]

# All of the tokens except the last one are another dictionary mapping to

# the next token in concept.

for token in tokens[:-1]:

current_dictionary[token] = {}

current_dictionary = current_dictionary[token]

# Last token is mapped to the first concept

current_dictionary[tokens[-1]] = concept[0].split()

return hierarchical_dictionaries

此函數(shù)將創(chuàng)建我們的分層字典，請(qǐng)參閱源代碼中的注釋以獲得一些解釋。您可能想創(chuàng)建一個(gè)自定義類來(lái)保留這個(gè)東西，這樣使用起來(lái)應(yīng)該更容易。

這與1. 介紹中描述的對(duì)象完全相同

2.2 遍歷字典

這部分要困難得多，但這次讓我們使用自上而下的方法。我們將從簡(jiǎn)單開(kāi)始：

def embed_sentences(sentences: List[str], hierarchical_dictionaries):

return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)

提供分層字典，它創(chuàng)建一個(gè)生成器，根據(jù)概念映射轉(zhuǎn)換每個(gè)句子。

現(xiàn)在traverse功能：

def traverse(sentence: str, hierarchical_dictionaries):

# Get all tokens in the sentence

tokens = sentence.split()

output_sentence = []

# Initialize index to the first token

index = 0

# Until any tokens left to check for concepts

while index < len(tokens):

# Iterate over hierarchical dictionaries (elements of the array)

for hierarchical_dictionary_tuple in hierarchical_dictionaries:

# New index is returned based on match and token-wise length of concept

index, concept = traverse_through_dictionary(

index, tokens, hierarchical_dictionary_tuple

)

# Concept was found in current hierarchical_dictionary_tuple, let's add it

# to output

if concept is not None:

output_sentence.extend(concept)

# No need to check other hierarchical dictionaries for matching concept

break

# Token (and it's next tokens) do not match with any concept, return original

else:

output_sentence.append(tokens[index])

# Increment index in order to move to the next token

index += 1

# Join list of tokens into a sentence

return " ".join(output_sentence)

再一次，如果您不確定發(fā)生了什么，請(qǐng)發(fā)表評(píng)論。

使用這種方法，悲觀地，我們將執(zhí)行O(n*c!)檢查，其中 n 是句子中的標(biāo)記數(shù)，c 是最長(zhǎng)概念的標(biāo)記長(zhǎng)度，它是階乘。這種情況是極不可能在實(shí)踐中發(fā)生，在句子中每個(gè)令牌將不得不幾乎完全適合最長(zhǎng)的概念，再加上所有短的概念必須是最短的一個(gè)前綴（如super data mining，super data和data）。

對(duì)于任何實(shí)際問(wèn)題，它會(huì)更接近 O(n)，正如我之前所說(shuō)，使用您在 .txt 文件中提供的數(shù)據(jù)，最壞的情況是 O(3 * n)，通常是 O(2 * n) .

遍歷每個(gè)字典：

def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):

# Get the level of nested dictionaries and initial dictionary

length, current_dictionary = hierarchical_dictionary_tuple

# inner_index will loop through tokens until match or no match was found

inner_index = index

for _ in range(length):

# Get next nested dictionary and move inner_index to the next token

current_dictionary = current_dictionary.get(tokens[inner_index])

inner_index += 1

# If no match was found in any level of dictionary

# Return current index in sentence and None representing lack of concept.

if current_dictionary is None or inner_index >= len(tokens):

return index, None

# If everything went fine through all nested dictionaries, check whether

# last token corresponds to concept

concept = current_dictionary.get(tokens[inner_index])

if concept is None:

return index, None

# If so, return inner_index (we have moved length tokens, so we have to update it)

return inner_index, concept

這構(gòu)成了我的解決方案的“肉”。

3. 結(jié)果

現(xiàn)在，為簡(jiǎn)潔起見(jiàn)，下面提供了整個(gè)源代碼（concepts.txt是您提供的）：

import ast

import time

from typing import List

def get_longest(concepts: List[List[str]]):

return max(len(text.split()) for concept in concepts for text in concept[1:])

def init_hierarchical_dictionaries(longest: int):

return [(length, {}) for length in reversed(range(longest))]

def create_hierarchical_dictionaries(concepts: List[List[str]]):

# Initialization

longest = get_longest(concepts)

hierarchical_dictionaries = init_hierarchical_dictionaries(longest)

for concept in concepts:

for text in concept[1:]:

tokens = text.split()

# Initialize dictionary; get the one with corresponding length.

# The longer, the earlier it is in the hierarchy

current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]

# All of the tokens except the last one are another dictionary mapping to

# the next token in concept.

for token in tokens[:-1]:

current_dictionary[token] = {}

current_dictionary = current_dictionary[token]

# Last token is mapped to the first concept

current_dictionary[tokens[-1]] = concept[0].split()

return hierarchical_dictionaries

def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):

# Get the level of nested dictionaries and initial dictionary

length, current_dictionary = hierarchical_dictionary_tuple

# inner_index will loop through tokens until match or no match was found

inner_index = index

for _ in range(length):

# Get next nested dictionary and move inner_index to the next token

current_dictionary = current_dictionary.get(tokens[inner_index])

inner_index += 1

# If no match was found in any level of dictionary

# Return current index in sentence and None representing lack of concept.

if current_dictionary is None or inner_index >= len(tokens):

return index, None

# If everything went fine through all nested dictionaries, check whether

# last token corresponds to concept

concept = current_dictionary.get(tokens[inner_index])

if concept is None:

return index, None

# If so, return inner_index (we have moved length tokens, so we have to update it)

return inner_index, concept

def traverse(sentence: str, hierarchical_dictionaries):

# Get all tokens in the sentence

tokens = sentence.split()

output_sentence = []

# Initialize index to the first token

index = 0

# Until any tokens left to check for concepts

while index < len(tokens):

# Iterate over hierarchical dictionaries (elements of the array)

for hierarchical_dictionary_tuple in hierarchical_dictionaries:

# New index is returned based on match and token-wise length of concept

index, concept = traverse_through_dictionary(

index, tokens, hierarchical_dictionary_tuple

)

# Concept was found in current hierarchical_dictionary_tuple, let's add it

# to output

if concept is not None:

output_sentence.extend(concept)

# No need to check other hierarchical dictionaries for matching concept

break

# Token (and it's next tokens) do not match with any concept, return original

else:

output_sentence.append(tokens[index])

# Increment index in order to move to the next token

index += 1

# Join list of tokens into a sentence

return " ".join(output_sentence)

def embed_sentences(sentences: List[str], hierarchical_dictionaries):

return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)

def sanity_check():

concepts = [

["natural language processing", "text mining", "texts", "nlp"],

["advanced data mining", "data mining", "data"],

["discourse analysis", "learning analytics", "mooc"],

]

sentences = [

"data mining and text mining",

"nlp is mainly used by discourse analysis community",

"data mining in python is fun",

"mooc data analysis involves texts",

"data and data mining are both very interesting",

]

targets = [

"advanced data mining and natural language processing",

"natural language processing is mainly used by discourse analysis community",

"advanced data mining in python is fun",

"discourse analysis advanced data mining analysis involves natural language processing",

"advanced data mining and advanced data mining are both very interesting",

]

hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)

results = list(embed_sentences(sentences, hierarchical_dictionaries))

if results == targets:

print("Correct results")

else:

print("Incorrect results")

def speed_check():

with open("./concepts.txt") as f:

concepts = ast.literal_eval(f.read())

initial_sentences = [

"data mining and text mining",

"nlp is mainly used by discourse analysis community",

"data mining in python is fun",

"mooc data analysis involves texts",

"data and data mining are both very interesting",

]

sentences = initial_sentences.copy()

for i in range(1_000_000):

sentences += initial_sentences

start = time.time()

hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)

middle = time.time()

letters = []

for result in embed_sentences(sentences, hierarchical_dictionaries):

letters.append(result[0].capitalize())

end = time.time()

print(f"Time for hierarchical creation {(middle-start) * 1000.0} ms")

print(f"Time for embedding {(end-middle) * 1000.0} ms")

print(f"Overall time elapsed {(end-start) * 1000.0} ms")

def main():

sanity_check()

speed_check()

if __name__ == "__main__":

main()

速度檢查結(jié)果如下：

Time for hierarchical creation 107.71822929382324 ms

Time for embedding 30460.427284240723 ms

Overall time elapsed 30568.145513534546 ms

因此，對(duì)于 500 萬(wàn)個(gè)句子（您提供的 5 個(gè)句子連接了 100 萬(wàn)次）和您提供的概念文件（1.1 mb），執(zhí)行概念映射大約需要 30 秒，我想這還不錯(cuò)。

在最壞的情況下，字典應(yīng)該占用與輸入文件一樣多的內(nèi)存（concepts.txt在這種情況下），但通常會(huì)更低/更低，因?yàn)樗Q于概念長(zhǎng)度和這些單詞的唯一單詞的組合。

反對(duì) 回復(fù) 2021-10-26

牧羊人nacy

TA貢獻(xiàn)1862條經(jīng)驗(yàn) 獲得超7個(gè)贊

使用后綴數(shù)組的方法，

如果您的數(shù)據(jù)已經(jīng)過(guò)清理，請(qǐng)?zhí)^(guò)此步驟。

首先，清理您的數(shù)據(jù)，用您知道不會(huì)成為任何概念或句子的一部分的任何字符替換所有空白字符。

然后為所有句子構(gòu)建后綴數(shù)組。對(duì)于每個(gè)句子，這需要 O(nLogn) 時(shí)間。很少有算法可以使用后綴樹(shù)在 O(n) 時(shí)間內(nèi)做到這一點(diǎn)

為所有句子準(zhǔn)備好后綴數(shù)組后，只需對(duì)您的概念執(zhí)行二分搜索。

您可以使用 LCP 陣列進(jìn)一步優(yōu)化您的搜索。

使用 LCP 和后綴數(shù)組，搜索的時(shí)間復(fù)雜度可以降低到 O(n)。

編輯：?這種方法通常用于基因組的序列比對(duì)，并且也非常流行。您應(yīng)該很容易找到適合您的實(shí)現(xiàn)

反對(duì) 回復(fù) 2021-10-26

滄海一幻覺(jué)

TA貢獻(xiàn)1824條經(jīng)驗(yàn) 獲得超5個(gè)贊

import re

concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]

sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']

replacementDict = {concept[0] : concept[1:] for concept in concepts}

finderAndReplacements = [(re.compile('(' + '|'.join(replacees) + ')'), replacement)

for replacement, replacees in replacementDict.items()]

def sentenceReplaced(findRegEx, replacement, sentence):

return findRegEx.sub(replacement, sentence, count=0)

def sentencesAllReplaced(sentences, finderAndReplacements=finderAndReplacements):

for regex, replacement in finderAndReplacements:

sentences = [sentenceReplaced(regex, replacement, sentence) for sentence in sentences]

return sentences

print(sentencesAllReplaced(sentences))

設(shè)置：我更喜歡concepts用字典表示，其中鍵、值是替換、替換。將此存儲(chǔ)在replacementDict

為每個(gè)預(yù)期的替換組編譯匹配的正則表達(dá)式。將其與其預(yù)期的替代品一起存儲(chǔ)在finderAndReplacements列表中。

sentenceReplaced執(zhí)行替換后，函數(shù)返回輸入語(yǔ)句。（此處的應(yīng)用順序無(wú)關(guān)緊要，因此如果我們注意避免競(jìng)爭(zhēng)條件，并行化應(yīng)該是可能的。）

最后，我們循環(huán)查找/替換每個(gè)句子。（大量并行結(jié)構(gòu)將提供改進(jìn)的性能。）

我很想看到一些徹底的基準(zhǔn)測(cè)試/測(cè)試/報(bào)告，因?yàn)槲掖_信根據(jù)此任務(wù)輸入 ( concepts, sentences) 和運(yùn)行它的硬件的性質(zhì)，有很多微妙之處。

在這種情況下，sentences與concepts替代品相比，是占主導(dǎo)地位的輸入組件，我相信編譯正則表達(dá)式將是有利的。當(dāng)句子很少而概念很多時(shí)，尤其是大多數(shù)概念都不在任何句子中時(shí)，編譯這些匹配器將是一種浪費(fèi)。如果每次替換都有很多替換，則使用的編譯方法可能性能不佳甚至出錯(cuò)。. . （關(guān)于輸入?yún)?shù)的不同假設(shè)提供了許多權(quán)衡考慮，這是經(jīng)常發(fā)生的情況。）

反對(duì) 回復(fù) 2021-10-26