3 回答

TA貢獻1859條經(jīng)驗 獲得超6個贊
下面提供的解決方案在運行時的復雜度約為O(n),其中n是每個句子中的標記數(shù)。
對于 500 萬個句子,concepts.txt它在大約 30 秒內(nèi)執(zhí)行所需的操作,請參閱第三部分中的基本測試。
當涉及到空間復雜性時,您將不得不保留一個嵌套的字典結(jié)構(gòu)(讓我們暫時像這樣簡化它),假設它是O(c*u),其中u是特定長度概念的唯一標記(標記方式) ,而 c 是概念的長度。
很難確定確切的復雜性,但它與此非常相似(對于您的示例數(shù)據(jù)和您提供的concepts.txt數(shù)據(jù)[ ]這些數(shù)據(jù)非常準確,但我們將在執(zhí)行過程中了解血腥細節(jié))。
我假設您可以在空格上拆分您的概念和句子,如果不是這樣,我建議您查看spaCy,它提供了更智能的方法來標記您的數(shù)據(jù)。
一、簡介
讓我們以你的例子為例:
concepts = [
["natural language processing", "text mining", "texts", "nlp"],
["advanced data mining", "data mining", "data"],
["discourse analysis", "learning analytics", "mooc"],
]
正如您所說,概念中的每個元素都必須映射到第一個元素,因此,在 Pythonish 中,它會大致遵循以下幾點:
for concept in concepts:
concept[1:] = concept[0]
如果所有概念的標記長度都等于 1(這里不是這種情況),并且是唯一的,那么任務將很容易。讓我們專注于第二種情況和一個特定的(稍作修改的)示例concept以了解我的觀點:
["advanced data mining", "data something", "data"]
這里data將映射到advanced data mining, BUT data something,它由 組成data應該在它之前映射。如果我理解正確,你會想要這句話:
"Here is data something and another data"
要映射到:
"Here is advanced data mapping and another advanced data mining"
而不是天真的方法:
"Here is advanced data mapping something and another advanced data mining"
請注意,對于第二個示例,我們只映射了data,而不是data something。
為了優(yōu)先考慮data something(和其他符合這種模式的),我使用了一個充滿字典的數(shù)組結(jié)構(gòu),其中數(shù)組中較早的概念是那些在標記方面更長的概念。
繼續(xù)我們的例子,這樣的數(shù)組看起來像這樣:
structure = [
{"data": {"something": "advanced data mining"}},
{"data": "advanced data mining"},
]
請注意,如果我們按此順序遍歷標記(例如,首先遍歷具有連續(xù)標記的第一個字典,如果未找到匹配項,則轉(zhuǎn)到第二個字典,依此類推),我們將首先獲得最長的概念。
2. 代碼
好的,我希望你明白基本的想法(如果沒有,請在下面發(fā)表評論,我會嘗試更詳細地解釋不清楚的部分)。
免責聲明:我對這個代碼方面并不是特別自豪,但它完成了工作,我想可能會更糟。
2.1 分層字典
首先,讓我們獲得最長的概念標記(不包括第一個元素,因為它是我們的目標,我們永遠不必更改它):
def get_longest(concepts: List[List[str]]):
return max(len(text.split()) for concept in concepts for text in concept[1:])
使用這些信息,我們可以通過創(chuàng)建與不同長度的概念一樣多的字典來初始化我們的結(jié)構(gòu)(在上面的示例中,它將是 2,所以它適用于您的所有數(shù)據(jù)。任何長度的概念都可以):
def init_hierarchical_dictionaries(longest: int):
return [(length, {}) for length in reversed(range(longest))]
請注意,我將每個概念的長度添加到數(shù)組中,IMO 這樣在遍歷時更容易,盡管在對實現(xiàn)進行了一些更改后,您可以不使用它。
現(xiàn)在,當我們有了這些輔助函數(shù)時,我們可以從概念列表中創(chuàng)建結(jié)構(gòu):
def create_hierarchical_dictionaries(concepts: List[List[str]]):
# Initialization
longest = get_longest(concepts)
hierarchical_dictionaries = init_hierarchical_dictionaries(longest)
for concept in concepts:
for text in concept[1:]:
tokens = text.split()
# Initialize dictionary; get the one with corresponding length.
# The longer, the earlier it is in the hierarchy
current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
# All of the tokens except the last one are another dictionary mapping to
# the next token in concept.
for token in tokens[:-1]:
current_dictionary[token] = {}
current_dictionary = current_dictionary[token]
# Last token is mapped to the first concept
current_dictionary[tokens[-1]] = concept[0].split()
return hierarchical_dictionaries
此函數(shù)將創(chuàng)建我們的分層字典,請參閱源代碼中的注釋以獲得一些解釋。您可能想創(chuàng)建一個自定義類來保留這個東西,這樣使用起來應該更容易。
這與1. 介紹中描述的對象完全相同
2.2 遍歷字典
這部分要困難得多,但這次讓我們使用自上而下的方法。我們將從簡單開始:
def embed_sentences(sentences: List[str], hierarchical_dictionaries):
return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)
提供分層字典,它創(chuàng)建一個生成器,根據(jù)概念映射轉(zhuǎn)換每個句子。
現(xiàn)在traverse功能:
def traverse(sentence: str, hierarchical_dictionaries):
# Get all tokens in the sentence
tokens = sentence.split()
output_sentence = []
# Initialize index to the first token
index = 0
# Until any tokens left to check for concepts
while index < len(tokens):
# Iterate over hierarchical dictionaries (elements of the array)
for hierarchical_dictionary_tuple in hierarchical_dictionaries:
# New index is returned based on match and token-wise length of concept
index, concept = traverse_through_dictionary(
index, tokens, hierarchical_dictionary_tuple
)
# Concept was found in current hierarchical_dictionary_tuple, let's add it
# to output
if concept is not None:
output_sentence.extend(concept)
# No need to check other hierarchical dictionaries for matching concept
break
# Token (and it's next tokens) do not match with any concept, return original
else:
output_sentence.append(tokens[index])
# Increment index in order to move to the next token
index += 1
# Join list of tokens into a sentence
return " ".join(output_sentence)
再一次,如果您不確定發(fā)生了什么,請發(fā)表評論。
使用這種方法,悲觀地,我們將執(zhí)行O(n*c!)檢查,其中 n 是句子中的標記數(shù),c 是最長概念的標記長度,它是階乘。這種情況是極不可能在實踐中發(fā)生,在句子中每個令牌將不得不幾乎完全適合最長的概念,再加上所有短的概念必須是最短的一個前綴(如super data mining,super data和data)。
對于任何實際問題,它會更接近 O(n),正如我之前所說,使用您在 .txt 文件中提供的數(shù)據(jù),最壞的情況是 O(3 * n),通常是 O(2 * n) .
遍歷每個字典:
def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
# Get the level of nested dictionaries and initial dictionary
length, current_dictionary = hierarchical_dictionary_tuple
# inner_index will loop through tokens until match or no match was found
inner_index = index
for _ in range(length):
# Get next nested dictionary and move inner_index to the next token
current_dictionary = current_dictionary.get(tokens[inner_index])
inner_index += 1
# If no match was found in any level of dictionary
# Return current index in sentence and None representing lack of concept.
if current_dictionary is None or inner_index >= len(tokens):
return index, None
# If everything went fine through all nested dictionaries, check whether
# last token corresponds to concept
concept = current_dictionary.get(tokens[inner_index])
if concept is None:
return index, None
# If so, return inner_index (we have moved length tokens, so we have to update it)
return inner_index, concept
這構(gòu)成了我的解決方案的“肉”。
3. 結(jié)果
現(xiàn)在,為簡潔起見,下面提供了整個源代碼(concepts.txt是您提供的):
import ast
import time
from typing import List
def get_longest(concepts: List[List[str]]):
return max(len(text.split()) for concept in concepts for text in concept[1:])
def init_hierarchical_dictionaries(longest: int):
return [(length, {}) for length in reversed(range(longest))]
def create_hierarchical_dictionaries(concepts: List[List[str]]):
# Initialization
longest = get_longest(concepts)
hierarchical_dictionaries = init_hierarchical_dictionaries(longest)
for concept in concepts:
for text in concept[1:]:
tokens = text.split()
# Initialize dictionary; get the one with corresponding length.
# The longer, the earlier it is in the hierarchy
current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
# All of the tokens except the last one are another dictionary mapping to
# the next token in concept.
for token in tokens[:-1]:
current_dictionary[token] = {}
current_dictionary = current_dictionary[token]
# Last token is mapped to the first concept
current_dictionary[tokens[-1]] = concept[0].split()
return hierarchical_dictionaries
def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
# Get the level of nested dictionaries and initial dictionary
length, current_dictionary = hierarchical_dictionary_tuple
# inner_index will loop through tokens until match or no match was found
inner_index = index
for _ in range(length):
# Get next nested dictionary and move inner_index to the next token
current_dictionary = current_dictionary.get(tokens[inner_index])
inner_index += 1
# If no match was found in any level of dictionary
# Return current index in sentence and None representing lack of concept.
if current_dictionary is None or inner_index >= len(tokens):
return index, None
# If everything went fine through all nested dictionaries, check whether
# last token corresponds to concept
concept = current_dictionary.get(tokens[inner_index])
if concept is None:
return index, None
# If so, return inner_index (we have moved length tokens, so we have to update it)
return inner_index, concept
def traverse(sentence: str, hierarchical_dictionaries):
# Get all tokens in the sentence
tokens = sentence.split()
output_sentence = []
# Initialize index to the first token
index = 0
# Until any tokens left to check for concepts
while index < len(tokens):
# Iterate over hierarchical dictionaries (elements of the array)
for hierarchical_dictionary_tuple in hierarchical_dictionaries:
# New index is returned based on match and token-wise length of concept
index, concept = traverse_through_dictionary(
index, tokens, hierarchical_dictionary_tuple
)
# Concept was found in current hierarchical_dictionary_tuple, let's add it
# to output
if concept is not None:
output_sentence.extend(concept)
# No need to check other hierarchical dictionaries for matching concept
break
# Token (and it's next tokens) do not match with any concept, return original
else:
output_sentence.append(tokens[index])
# Increment index in order to move to the next token
index += 1
# Join list of tokens into a sentence
return " ".join(output_sentence)
def embed_sentences(sentences: List[str], hierarchical_dictionaries):
return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)
def sanity_check():
concepts = [
["natural language processing", "text mining", "texts", "nlp"],
["advanced data mining", "data mining", "data"],
["discourse analysis", "learning analytics", "mooc"],
]
sentences = [
"data mining and text mining",
"nlp is mainly used by discourse analysis community",
"data mining in python is fun",
"mooc data analysis involves texts",
"data and data mining are both very interesting",
]
targets = [
"advanced data mining and natural language processing",
"natural language processing is mainly used by discourse analysis community",
"advanced data mining in python is fun",
"discourse analysis advanced data mining analysis involves natural language processing",
"advanced data mining and advanced data mining are both very interesting",
]
hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)
results = list(embed_sentences(sentences, hierarchical_dictionaries))
if results == targets:
print("Correct results")
else:
print("Incorrect results")
def speed_check():
with open("./concepts.txt") as f:
concepts = ast.literal_eval(f.read())
initial_sentences = [
"data mining and text mining",
"nlp is mainly used by discourse analysis community",
"data mining in python is fun",
"mooc data analysis involves texts",
"data and data mining are both very interesting",
]
sentences = initial_sentences.copy()
for i in range(1_000_000):
sentences += initial_sentences
start = time.time()
hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)
middle = time.time()
letters = []
for result in embed_sentences(sentences, hierarchical_dictionaries):
letters.append(result[0].capitalize())
end = time.time()
print(f"Time for hierarchical creation {(middle-start) * 1000.0} ms")
print(f"Time for embedding {(end-middle) * 1000.0} ms")
print(f"Overall time elapsed {(end-start) * 1000.0} ms")
def main():
sanity_check()
speed_check()
if __name__ == "__main__":
main()
速度檢查結(jié)果如下:
Time for hierarchical creation 107.71822929382324 ms
Time for embedding 30460.427284240723 ms
Overall time elapsed 30568.145513534546 ms
因此,對于 500 萬個句子(您提供的 5 個句子連接了 100 萬次)和您提供的概念文件(1.1 mb),執(zhí)行概念映射大約需要 30 秒,我想這還不錯。
在最壞的情況下,字典應該占用與輸入文件一樣多的內(nèi)存(concepts.txt在這種情況下),但通常會更低/更低,因為它取決于概念長度和這些單詞的唯一單詞的組合。

TA貢獻1862條經(jīng)驗 獲得超7個贊
使用后綴數(shù)組的方法,
如果您的數(shù)據(jù)已經(jīng)過清理,請?zhí)^此步驟。
首先,清理您的數(shù)據(jù),用您知道不會成為任何概念或句子的一部分的任何字符替換所有空白字符。
然后為所有句子構(gòu)建后綴數(shù)組。對于每個句子,這需要 O(nLogn) 時間。很少有算法可以使用后綴樹在 O(n) 時間內(nèi)做到這一點
為所有句子準備好后綴數(shù)組后,只需對您的概念執(zhí)行二分搜索。
您可以使用 LCP 陣列進一步優(yōu)化您的搜索。
使用 LCP 和后綴數(shù)組,搜索的時間復雜度可以降低到 O(n)。
編輯:?這種方法通常用于基因組的序列比對,并且也非常流行。您應該很容易找到適合您的實現(xiàn)

TA貢獻1824條經(jīng)驗 獲得超5個贊
import re
concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]
sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']
replacementDict = {concept[0] : concept[1:] for concept in concepts}
finderAndReplacements = [(re.compile('(' + '|'.join(replacees) + ')'), replacement)
for replacement, replacees in replacementDict.items()]
def sentenceReplaced(findRegEx, replacement, sentence):
return findRegEx.sub(replacement, sentence, count=0)
def sentencesAllReplaced(sentences, finderAndReplacements=finderAndReplacements):
for regex, replacement in finderAndReplacements:
sentences = [sentenceReplaced(regex, replacement, sentence) for sentence in sentences]
return sentences
print(sentencesAllReplaced(sentences))
設置:我更喜歡concepts用字典表示,其中鍵、值是替換、替換。將此存儲在replacementDict
為每個預期的替換組編譯匹配的正則表達式。將其與其預期的替代品一起存儲在finderAndReplacements列表中。
sentenceReplaced執(zhí)行替換后,函數(shù)返回輸入語句。(此處的應用順序無關緊要,因此如果我們注意避免競爭條件,并行化應該是可能的。)
最后,我們循環(huán)查找/替換每個句子。(大量并行結(jié)構(gòu)將提供改進的性能。)
我很想看到一些徹底的基準測試/測試/報告,因為我確信根據(jù)此任務輸入 ( concepts, sentences) 和運行它的硬件的性質(zhì),有很多微妙之處。
在這種情況下,sentences與concepts替代品相比,是占主導地位的輸入組件,我相信編譯正則表達式將是有利的。當句子很少而概念很多時,尤其是大多數(shù)概念都不在任何句子中時,編譯這些匹配器將是一種浪費。如果每次替換都有很多替換,則使用的編譯方法可能性能不佳甚至出錯。. . (關于輸入?yún)?shù)的不同假設提供了許多權衡考慮,這是經(jīng)常發(fā)生的情況。)
添加回答
舉報