2 回答

TA貢獻(xiàn)1875條經(jīng)驗(yàn) 獲得超3個贊
我發(fā)現(xiàn) spacy 匹配器對匹配術(shù)語的索引進(jìn)行排序,即使它發(fā)現(xiàn)術(shù)語列表中列出的術(shù)語早于另一個術(shù)語。所以我可以在下一個匹配的索引之前結(jié)束跨度。代碼來顯示我的意思:
data = u"Species:cat color:orange and white with yellow spots number feet: 4"
from spacy.matcher import PhraseMatcher
import en_core_web_sm
nlp = en_core_web_sm.load()
data=data.lower()
matcher = PhraseMatcher(nlp.vocab)
terminology_list = [u"species",u"color", u"number feet"]
patterns = list(nlp.tokenizer.pipe(terminology_list))
matcher.add("Terms", None, *patterns)
doc = nlp(data)
matches=matcher(doc)
matched_phrases={}
for idd, (match_id, start, end) in enumerate(matches):
key_match = doc[start:end]
if idd != len(matches)-1:
end_index=matches[idd+1][1]
else:
end_index=len(doc)
phrase = doc[end:end_index]
if phrase.text != '':
matched_phrases[key_match] = phrase
print(matched_phrases)

TA貢獻(xiàn)1830條經(jīng)驗(yàn) 獲得超9個贊
我有一個不使用 spaCy 的想法。
首先,我將字符串拆分為令牌
split = "Species:cat color:orange and white with yellow spots number feet: 4".replace(": ", ":").split()
然后我遍歷令牌列表,保存鍵,然后將值合并到鍵中,因?yàn)橛行骆I
goal = []
key_value = None
for token in split:
print(token)
if ":" in token:
if key_value:
goal.append(kv)
key_value = token
else:
key_value = token
else:
key_value += " " + token
goal.append(key_value)
goal
>>>
['Species:cat', 'color:orange and white with yellow spots number', 'feet:4']
添加回答
舉報(bào)