3 回答

TA貢獻(xiàn)1859條經(jīng)驗(yàn) 獲得超6個(gè)贊
下面提供的解決方案在運(yùn)行時(shí)的復(fù)雜度約為O(n),其中n是每個(gè)句子中的標(biāo)記數(shù)。
對(duì)于 500 萬(wàn)個(gè)句子,concepts.txt它在大約 30 秒內(nèi)執(zhí)行所需的操作,請(qǐng)參閱第三部分中的基本測(cè)試。
當(dāng)涉及到空間復(fù)雜性時(shí),您將不得不保留一個(gè)嵌套的字典結(jié)構(gòu)(讓我們暫時(shí)像這樣簡(jiǎn)化它),假設(shè)它是O(c*u),其中u是特定長(zhǎng)度概念的唯一標(biāo)記(標(biāo)記方式) ,而 c 是概念的長(zhǎng)度。
很難確定確切的復(fù)雜性,但它與此非常相似(對(duì)于您的示例數(shù)據(jù)和您提供的concepts.txt數(shù)據(jù)[ ]這些數(shù)據(jù)非常準(zhǔn)確,但我們將在執(zhí)行過(guò)程中了解血腥細(xì)節(jié))。
我假設(shè)您可以在空格上拆分您的概念和句子,如果不是這樣,我建議您查看spaCy,它提供了更智能的方法來(lái)標(biāo)記您的數(shù)據(jù)。
一、簡(jiǎn)介
讓我們以你的例子為例:
concepts = [
["natural language processing", "text mining", "texts", "nlp"],
["advanced data mining", "data mining", "data"],
["discourse analysis", "learning analytics", "mooc"],
]
正如您所說(shuō),概念中的每個(gè)元素都必須映射到第一個(gè)元素,因此,在 Pythonish 中,它會(huì)大致遵循以下幾點(diǎn):
for concept in concepts:
concept[1:] = concept[0]
如果所有概念的標(biāo)記長(zhǎng)度都等于 1(這里不是這種情況),并且是唯一的,那么任務(wù)將很容易。讓我們專注于第二種情況和一個(gè)特定的(稍作修改的)示例concept以了解我的觀點(diǎn):
["advanced data mining", "data something", "data"]
這里data將映射到advanced data mining, BUT data something,它由 組成data應(yīng)該在它之前映射。如果我理解正確,你會(huì)想要這句話:
"Here is data something and another data"
要映射到:
"Here is advanced data mapping and another advanced data mining"
而不是天真的方法:
"Here is advanced data mapping something and another advanced data mining"
請(qǐng)注意,對(duì)于第二個(gè)示例,我們只映射了data,而不是data something。
為了優(yōu)先考慮data something(和其他符合這種模式的),我使用了一個(gè)充滿字典的數(shù)組結(jié)構(gòu),其中數(shù)組中較早的概念是那些在標(biāo)記方面更長(zhǎng)的概念。
繼續(xù)我們的例子,這樣的數(shù)組看起來(lái)像這樣:
structure = [
{"data": {"something": "advanced data mining"}},
{"data": "advanced data mining"},
]
請(qǐng)注意,如果我們按此順序遍歷標(biāo)記(例如,首先遍歷具有連續(xù)標(biāo)記的第一個(gè)字典,如果未找到匹配項(xiàng),則轉(zhuǎn)到第二個(gè)字典,依此類推),我們將首先獲得最長(zhǎng)的概念。
2. 代碼
好的,我希望你明白基本的想法(如果沒(méi)有,請(qǐng)?jiān)谙旅姘l(fā)表評(píng)論,我會(huì)嘗試更詳細(xì)地解釋不清楚的部分)。
免責(zé)聲明:我對(duì)這個(gè)代碼方面并不是特別自豪,但它完成了工作,我想可能會(huì)更糟。
2.1 分層字典
首先,讓我們獲得最長(zhǎng)的概念標(biāo)記(不包括第一個(gè)元素,因?yàn)樗俏覀兊哪繕?biāo),我們永遠(yuǎn)不必更改它):
def get_longest(concepts: List[List[str]]):
return max(len(text.split()) for concept in concepts for text in concept[1:])
使用這些信息,我們可以通過(guò)創(chuàng)建與不同長(zhǎng)度的概念一樣多的字典來(lái)初始化我們的結(jié)構(gòu)(在上面的示例中,它將是 2,所以它適用于您的所有數(shù)據(jù)。任何長(zhǎng)度的概念都可以):
def init_hierarchical_dictionaries(longest: int):
return [(length, {}) for length in reversed(range(longest))]
請(qǐng)注意,我將每個(gè)概念的長(zhǎng)度添加到數(shù)組中,IMO 這樣在遍歷時(shí)更容易,盡管在對(duì)實(shí)現(xiàn)進(jìn)行了一些更改后,您可以不使用它。
現(xiàn)在,當(dāng)我們有了這些輔助函數(shù)時(shí),我們可以從概念列表中創(chuàng)建結(jié)構(gòu):
def create_hierarchical_dictionaries(concepts: List[List[str]]):
# Initialization
longest = get_longest(concepts)
hierarchical_dictionaries = init_hierarchical_dictionaries(longest)
for concept in concepts:
for text in concept[1:]:
tokens = text.split()
# Initialize dictionary; get the one with corresponding length.
# The longer, the earlier it is in the hierarchy
current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
# All of the tokens except the last one are another dictionary mapping to
# the next token in concept.
for token in tokens[:-1]:
current_dictionary[token] = {}
current_dictionary = current_dictionary[token]
# Last token is mapped to the first concept
current_dictionary[tokens[-1]] = concept[0].split()
return hierarchical_dictionaries
此函數(shù)將創(chuàng)建我們的分層字典,請(qǐng)參閱源代碼中的注釋以獲得一些解釋。您可能想創(chuàng)建一個(gè)自定義類來(lái)保留這個(gè)東西,這樣使用起來(lái)應(yīng)該更容易。
這與1. 介紹中描述的對(duì)象完全相同
2.2 遍歷字典
這部分要困難得多,但這次讓我們使用自上而下的方法。我們將從簡(jiǎn)單開(kāi)始:
def embed_sentences(sentences: List[str], hierarchical_dictionaries):
return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)
提供分層字典,它創(chuàng)建一個(gè)生成器,根據(jù)概念映射轉(zhuǎn)換每個(gè)句子。
現(xiàn)在traverse功能:
def traverse(sentence: str, hierarchical_dictionaries):
# Get all tokens in the sentence
tokens = sentence.split()
output_sentence = []
# Initialize index to the first token
index = 0
# Until any tokens left to check for concepts
while index < len(tokens):
# Iterate over hierarchical dictionaries (elements of the array)
for hierarchical_dictionary_tuple in hierarchical_dictionaries:
# New index is returned based on match and token-wise length of concept
index, concept = traverse_through_dictionary(
index, tokens, hierarchical_dictionary_tuple
)
# Concept was found in current hierarchical_dictionary_tuple, let's add it
# to output
if concept is not None:
output_sentence.extend(concept)
# No need to check other hierarchical dictionaries for matching concept
break
# Token (and it's next tokens) do not match with any concept, return original
else:
output_sentence.append(tokens[index])
# Increment index in order to move to the next token
index += 1
# Join list of tokens into a sentence
return " ".join(output_sentence)
再一次,如果您不確定發(fā)生了什么,請(qǐng)發(fā)表評(píng)論。
使用這種方法,悲觀地,我們將執(zhí)行O(n*c!)檢查,其中 n 是句子中的標(biāo)記數(shù),c 是最長(zhǎng)概念的標(biāo)記長(zhǎng)度,它是階乘。這種情況是極不可能在實(shí)踐中發(fā)生,在句子中每個(gè)令牌將不得不幾乎完全適合最長(zhǎng)的概念,再加上所有短的概念必須是最短的一個(gè)前綴(如super data mining,super data和data)。
對(duì)于任何實(shí)際問(wèn)題,它會(huì)更接近 O(n),正如我之前所說(shuō),使用您在 .txt 文件中提供的數(shù)據(jù),最壞的情況是 O(3 * n),通常是 O(2 * n) .
遍歷每個(gè)字典:
def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
# Get the level of nested dictionaries and initial dictionary
length, current_dictionary = hierarchical_dictionary_tuple
# inner_index will loop through tokens until match or no match was found
inner_index = index
for _ in range(length):
# Get next nested dictionary and move inner_index to the next token
current_dictionary = current_dictionary.get(tokens[inner_index])
inner_index += 1
# If no match was found in any level of dictionary
# Return current index in sentence and None representing lack of concept.
if current_dictionary is None or inner_index >= len(tokens):
return index, None
# If everything went fine through all nested dictionaries, check whether
# last token corresponds to concept
concept = current_dictionary.get(tokens[inner_index])
if concept is None:
return index, None
# If so, return inner_index (we have moved length tokens, so we have to update it)
return inner_index, concept
這構(gòu)成了我的解決方案的“肉”。
3. 結(jié)果
現(xiàn)在,為簡(jiǎn)潔起見(jiàn),下面提供了整個(gè)源代碼(concepts.txt是您提供的):
import ast
import time
from typing import List
def get_longest(concepts: List[List[str]]):
return max(len(text.split()) for concept in concepts for text in concept[1:])
def init_hierarchical_dictionaries(longest: int):
return [(length, {}) for length in reversed(range(longest))]
def create_hierarchical_dictionaries(concepts: List[List[str]]):
# Initialization
longest = get_longest(concepts)
hierarchical_dictionaries = init_hierarchical_dictionaries(longest)
for concept in concepts:
for text in concept[1:]:
tokens = text.split()
# Initialize dictionary; get the one with corresponding length.
# The longer, the earlier it is in the hierarchy
current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
# All of the tokens except the last one are another dictionary mapping to
# the next token in concept.
for token in tokens[:-1]:
current_dictionary[token] = {}
current_dictionary = current_dictionary[token]
# Last token is mapped to the first concept
current_dictionary[tokens[-1]] = concept[0].split()
return hierarchical_dictionaries
def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
# Get the level of nested dictionaries and initial dictionary
length, current_dictionary = hierarchical_dictionary_tuple
# inner_index will loop through tokens until match or no match was found
inner_index = index
for _ in range(length):
# Get next nested dictionary and move inner_index to the next token
current_dictionary = current_dictionary.get(tokens[inner_index])
inner_index += 1
# If no match was found in any level of dictionary
# Return current index in sentence and None representing lack of concept.
if current_dictionary is None or inner_index >= len(tokens):
return index, None
# If everything went fine through all nested dictionaries, check whether
# last token corresponds to concept
concept = current_dictionary.get(tokens[inner_index])
if concept is None:
return index, None
# If so, return inner_index (we have moved length tokens, so we have to update it)
return inner_index, concept
def traverse(sentence: str, hierarchical_dictionaries):
# Get all tokens in the sentence
tokens = sentence.split()
output_sentence = []
# Initialize index to the first token
index = 0
# Until any tokens left to check for concepts
while index < len(tokens):
# Iterate over hierarchical dictionaries (elements of the array)
for hierarchical_dictionary_tuple in hierarchical_dictionaries:
# New index is returned based on match and token-wise length of concept
index, concept = traverse_through_dictionary(
index, tokens, hierarchical_dictionary_tuple
)
# Concept was found in current hierarchical_dictionary_tuple, let's add it
# to output
if concept is not None:
output_sentence.extend(concept)
# No need to check other hierarchical dictionaries for matching concept
break
# Token (and it's next tokens) do not match with any concept, return original
else:
output_sentence.append(tokens[index])
# Increment index in order to move to the next token
index += 1
# Join list of tokens into a sentence
return " ".join(output_sentence)
def embed_sentences(sentences: List[str], hierarchical_dictionaries):
return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)
def sanity_check():
concepts = [
["natural language processing", "text mining", "texts", "nlp"],
["advanced data mining", "data mining", "data"],
["discourse analysis", "learning analytics", "mooc"],
]
sentences = [
"data mining and text mining",
"nlp is mainly used by discourse analysis community",
"data mining in python is fun",
"mooc data analysis involves texts",
"data and data mining are both very interesting",
]
targets = [
"advanced data mining and natural language processing",
"natural language processing is mainly used by discourse analysis community",
"advanced data mining in python is fun",
"discourse analysis advanced data mining analysis involves natural language processing",
"advanced data mining and advanced data mining are both very interesting",
]
hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)
results = list(embed_sentences(sentences, hierarchical_dictionaries))
if results == targets:
print("Correct results")
else:
print("Incorrect results")
def speed_check():
with open("./concepts.txt") as f:
concepts = ast.literal_eval(f.read())
initial_sentences = [
"data mining and text mining",
"nlp is mainly used by discourse analysis community",
"data mining in python is fun",
"mooc data analysis involves texts",
"data and data mining are both very interesting",
]
sentences = initial_sentences.copy()
for i in range(1_000_000):
sentences += initial_sentences
start = time.time()
hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)
middle = time.time()
letters = []
for result in embed_sentences(sentences, hierarchical_dictionaries):
letters.append(result[0].capitalize())
end = time.time()
print(f"Time for hierarchical creation {(middle-start) * 1000.0} ms")
print(f"Time for embedding {(end-middle) * 1000.0} ms")
print(f"Overall time elapsed {(end-start) * 1000.0} ms")
def main():
sanity_check()
speed_check()
if __name__ == "__main__":
main()
速度檢查結(jié)果如下:
Time for hierarchical creation 107.71822929382324 ms
Time for embedding 30460.427284240723 ms
Overall time elapsed 30568.145513534546 ms
因此,對(duì)于 500 萬(wàn)個(gè)句子(您提供的 5 個(gè)句子連接了 100 萬(wàn)次)和您提供的概念文件(1.1 mb),執(zhí)行概念映射大約需要 30 秒,我想這還不錯(cuò)。
在最壞的情況下,字典應(yīng)該占用與輸入文件一樣多的內(nèi)存(concepts.txt在這種情況下),但通常會(huì)更低/更低,因?yàn)樗Q于概念長(zhǎng)度和這些單詞的唯一單詞的組合。

TA貢獻(xiàn)1862條經(jīng)驗(yàn) 獲得超7個(gè)贊
使用后綴數(shù)組的方法,
如果您的數(shù)據(jù)已經(jīng)過(guò)清理,請(qǐng)?zhí)^(guò)此步驟。
首先,清理您的數(shù)據(jù),用您知道不會(huì)成為任何概念或句子的一部分的任何字符替換所有空白字符。
然后為所有句子構(gòu)建后綴數(shù)組。對(duì)于每個(gè)句子,這需要 O(nLogn) 時(shí)間。很少有算法可以使用后綴樹(shù)在 O(n) 時(shí)間內(nèi)做到這一點(diǎn)
為所有句子準(zhǔn)備好后綴數(shù)組后,只需對(duì)您的概念執(zhí)行二分搜索。
您可以使用 LCP 陣列進(jìn)一步優(yōu)化您的搜索。
使用 LCP 和后綴數(shù)組,搜索的時(shí)間復(fù)雜度可以降低到 O(n)。
編輯:?這種方法通常用于基因組的序列比對(duì),并且也非常流行。您應(yīng)該很容易找到適合您的實(shí)現(xiàn)

TA貢獻(xiàn)1824條經(jīng)驗(yàn) 獲得超5個(gè)贊
import re
concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]
sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']
replacementDict = {concept[0] : concept[1:] for concept in concepts}
finderAndReplacements = [(re.compile('(' + '|'.join(replacees) + ')'), replacement)
for replacement, replacees in replacementDict.items()]
def sentenceReplaced(findRegEx, replacement, sentence):
return findRegEx.sub(replacement, sentence, count=0)
def sentencesAllReplaced(sentences, finderAndReplacements=finderAndReplacements):
for regex, replacement in finderAndReplacements:
sentences = [sentenceReplaced(regex, replacement, sentence) for sentence in sentences]
return sentences
print(sentencesAllReplaced(sentences))
設(shè)置:我更喜歡concepts用字典表示,其中鍵、值是替換、替換。將此存儲(chǔ)在replacementDict
為每個(gè)預(yù)期的替換組編譯匹配的正則表達(dá)式。將其與其預(yù)期的替代品一起存儲(chǔ)在finderAndReplacements列表中。
sentenceReplaced執(zhí)行替換后,函數(shù)返回輸入語(yǔ)句。(此處的應(yīng)用順序無(wú)關(guān)緊要,因此如果我們注意避免競(jìng)爭(zhēng)條件,并行化應(yīng)該是可能的。)
最后,我們循環(huán)查找/替換每個(gè)句子。(大量并行結(jié)構(gòu)將提供改進(jìn)的性能。)
我很想看到一些徹底的基準(zhǔn)測(cè)試/測(cè)試/報(bào)告,因?yàn)槲掖_信根據(jù)此任務(wù)輸入 ( concepts, sentences) 和運(yùn)行它的硬件的性質(zhì),有很多微妙之處。
在這種情況下,sentences與concepts替代品相比,是占主導(dǎo)地位的輸入組件,我相信編譯正則表達(dá)式將是有利的。當(dāng)句子很少而概念很多時(shí),尤其是大多數(shù)概念都不在任何句子中時(shí),編譯這些匹配器將是一種浪費(fèi)。如果每次替換都有很多替換,則使用的編譯方法可能性能不佳甚至出錯(cuò)。. . (關(guān)于輸入?yún)?shù)的不同假設(shè)提供了許多權(quán)衡考慮,這是經(jīng)常發(fā)生的情況。)
添加回答
舉報(bào)