第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

合并NLP中的相關(guān)詞

合并NLP中的相關(guān)詞

ibeautiful 2023-07-11 13:45:56
我想定義一個新單詞,其中包括兩個(或更多)不同單詞的計數(shù)值。例如:Words Frequency0   mom 2501   2020    1512   the 1243   19  824   mother  81... ... ...10  London  611  life    612  something   6我想將母親定義為mom + mother:Words Frequency0   mother  3311   2020    1512   the 1243   19  82... ... ...9   London  610  life    611  something   6這是一種替代定義具有某種含義的單詞組的方法(至少對于我的目的而言)。任何建議將不勝感激。
查看完整描述

6 回答

?
忽然笑

TA貢獻1806條經(jīng)驗 獲得超5個贊

更新 2020 年 10 月 21 日


我決定構(gòu)建一個 Python 模塊來處理我在這個答案中概述的任務(wù)。該模塊名為wordhoard ,可以從pypi下載


我曾嘗試在需要確定關(guān)鍵字(例如醫(yī)療保健)和關(guān)鍵字同義詞(例如健康計劃、預防醫(yī)學)的頻率的項目中使用Word2vec 和WordNet。我發(fā)現(xiàn)大多數(shù) NLP 庫無法生成我需要的結(jié)果,因此我決定使用自定義關(guān)鍵字和同義詞構(gòu)建自己的詞典。這種方法適用于多個項目中的文本分析和分類。


我確信精通 NLP 技術(shù)的人可能有更強大的解決方案,但下面的解決方案是一次又一次為我工作的類似解決方案。


我對答案進行了編碼以匹配您問題中的詞頻數(shù)據(jù),但可以對其進行修改以使用任何關(guān)鍵字和同義詞數(shù)據(jù)集。


import string


# Python Dictionary

# I manually created these word relationship - primary_word:synonyms

word_relationship = {"father": ['dad', 'daddy', 'old man', 'pa', 'pappy', 'papa', 'pop'],

          "mother": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}


# This input text is from various poems about mothers and fathers

input_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \

         'that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of ' \

         'her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She ' \

         'has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the ' \

         'greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, ' \

         'This to me you have always been. Through the good times and the bad, Your understanding I have had.'


# converts the input text to lowercase and splits the words based on empty space.

wordlist = input_text.lower().split()


# remove all punctuation from the wordlist

remove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation) 

for s in wordlist]


# list for word frequencies

wordfreq = []


# count the frequencies of a word

for w in remove_punctuation:

wordfreq.append(remove_punctuation.count(w))


word_frequencies = (dict(zip(remove_punctuation, wordfreq)))


word_matches = []


# loop through the dictionaries

for word, frequency in word_frequencies.items():

   for keyword, synonym in word_relationship.items():

      match = [x for x in synonym if word == x]

      if word == keyword or match:

        match = ' '.join(map(str, match))

        # append the keywords (mother), synonyms(mom) and frequencies to a list

        word_matches.append([keyword, match, frequency])


# used to hold the final keyword and frequencies

final_results = {}


# list comprehension to obtain the primary keyword and its frequencies

synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]


# iterate synonym_matches and output total frequency count for a specific keyword

for item in synonym_matches:

  if item[0] not in final_results.keys():

    frequency_count = 0

    frequency_count = frequency_count + item[1]

    final_results[item[0]] = frequency_count

  else:

    frequency_count = frequency_count + item[1]

    final_results[item[0]] = frequency_count


 

print(final_results)

# output

{'mother': 3, 'father': 2}

其他方法

以下是一些其他方法及其開箱即用的輸出。


NLTK字網(wǎng)


在此示例中,我查找了“母親”一詞的同義詞。請注意,WordNet 沒有與“mother”一詞相關(guān)的同義詞“mom”或“mum”。這兩個詞在我上面的示例文本中。另請注意,“父親”一詞被列為“母親”的同義詞。


from nltk.corpus import wordnet


synonyms = []

word = 'mother'

for synonym in wordnet.synsets(word):

   for item in synonym.lemmas():

      if word != synonym.name() and len(synonym.lemma_names()) > 1:

        synonyms.append(item.name())


print(synonyms)

['mother', 'female_parent', 'mother', 'fuss', 'overprotect', 'beget', 'get', 'engender', 'father', 'mother', 'sire', 'generate', 'bring_forth']

Py字典


在此示例中,我使用 PyDictionary 查找“mother”一詞的同義詞,該詞典查詢synonym.com。此示例中的同義詞包括單詞“mom”和“mum”。此示例還包括 WordNet 未生成的其他同義詞。


但是,PyDictionary 還生成了“媽媽”的同義詞列表。這與“母親”這個詞無關(guān)。PyDictionary 似乎是從頁面的形容詞部分而不是名詞部分提取此列表。計算機很難區(qū)分形容詞媽媽和名詞媽媽。


from PyDictionary import PyDictionary

dictionary_mother = PyDictionary('mother')


print(dictionary_mother.getSynonyms())

# output 

[{'mother': ['mother-in-law', 'female parent', 'supermom', 'mum', 'parent', 'mom', 'momma', 'para I', 'mama', 'mummy', 'quadripara', 'mommy', 'quintipara', 'ma', 'puerpera', 'surrogate mother', 'mater', 'primipara', 'mammy', 'mamma']}]


dictionary_mum = PyDictionary('mum')


print(dictionary_mum.getSynonyms())

# output 

[{'mum': ['incommunicative', 'silent', 'uncommunicative']}]

其他一些可能的方法是使用牛津詞典 API 或查詢 thesaurus.com。這兩種方法也都有缺陷。例如,牛津詞典 API 需要 API 密鑰和基于查詢編號的付費訂閱。thesaurus.com 缺少可能對單詞分組有用的潛在同義詞。


https://www.thesaurus.com/browse/mother

synonyms: mom, parent, ancestor, creator, mommy, origin, predecessor, progenitor, source, child-bearer, forebearer, procreator

更新

為語料庫中的每個潛在單詞生成精確的同義詞列表很困難,并且需要多管齊下的方法。下面的代碼使用 WordNet 和 PyDictionary 創(chuàng)建同義詞的超集。與所有其他答案一樣,這種組合方法也會導致詞頻的過度計算。我一直在嘗試通過在最終的同義詞詞典中組合鍵和值對來減少這種過度計數(shù)。后一個問題比我預期的要困難得多,可能需要我提出自己的問題來解決。最后,我認為根據(jù)您的用例,您需要確定哪種方法最有效,并且可能需要結(jié)合多種方法。


感謝您提出這個問題,因為它使我能夠了解解決復雜問題的其他方法。


from string import punctuation

from nltk.corpus import stopwords

from nltk.corpus import wordnet

from PyDictionary import PyDictionary


input_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother

         that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of

         her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She

         has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the

         greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend,

         This to me you have always been. Through the good times and the bad, Your understanding I have had."""



def normalize_textual_information(text):

   # split text into tokens by white space

   token = text.split()


   # remove punctuation from each token

   table = str.maketrans('', '', punctuation)

   token = [word.translate(table) for word in token]


   # remove any tokens that are not alphabetic

   token = [word.lower() for word in token if word.isalpha()]


   # filter out English stop words

   stop_words = set(stopwords.words('english'))


   # you could add additional stops like this

   stop_words.add('cannot')

   stop_words.add('could')

   stop_words.add('would')


   token = [word for word in token if word not in stop_words]


   # filter out any short tokens

   token = [word for word in token if len(word) > 1]

   return token



def generate_word_frequencies(words):

   # list to hold word frequencies

   word_frequencies = []


   # loop through the tokens and generate a word count for each token

   for word in words:

      word_frequencies.append(words.count(word))


   # aggregates the words and word_frequencies into tuples and coverts them into a dictionary

   word_frequencies = (dict(zip(words, word_frequencies)))


   # sort the frequency of the words from low to high

   sorted_frequencies = {key: value for key, value in 

   sorted(word_frequencies.items(), key=lambda item: item[1])}


 return sorted_frequencies



def get_synonyms_internet(word):

   dictionary = PyDictionary(word)

   synonym = dictionary.getSynonyms()

   return synonym


 

words = normalize_textual_information(input_text)


all_synsets_1 = {}

for word in words:

  for synonym in wordnet.synsets(word):

    if word != synonym.name() and len(synonym.lemma_names()) > 1:

      for item in synonym.lemmas():

        if word != item.name():

          all_synsets_1.setdefault(word, []).append(str(item.name()).lower())


all_synsets_2 = {}

for word in words:

  word_synonyms = get_synonyms_internet(word)

  for synonym in word_synonyms:

    if word != synonym and synonym is not None:

      all_synsets_2.update(synonym)


 word_relationship = {**all_synsets_1, **all_synsets_2}


 frequencies = generate_word_frequencies(words)

 word_matches = []

 word_set = {}

 duplication_check = set()


 for word, frequency in frequencies.items():

    for keyword, synonym in word_relationship.items():

       match = [x for x in synonym if word == x]

       if word == keyword or match:

         match = ' '.join(map(str, match))

         if match not in word_set or match not in duplication_check or word not in duplication_check:

            duplication_check.add(word)

            duplication_check.add(match)

            word_matches.append([keyword, match, frequency])


 # used to hold the final keyword and frequencies

 final_results = {}


 # list comprehension to obtain the primary keyword and its frequencies

 synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]


 # iterate synonym_matches and output total frequency count for a specific keyword

 for item in synonym_matches:

    if item[0] not in final_results.keys():

      frequency_count = 0

      frequency_count = frequency_count + item[1]

      final_results[item[0]] = frequency_count

 else:

    frequency_count = frequency_count + item[1]

    final_results[item[0]] = frequency_count


# do something with the final results



查看完整回答
反對 回復 2023-07-11
?
慕斯王

TA貢獻1864條經(jīng)驗 獲得超2個贊

這是一個難題,最佳解決方案取決于您要解決的用例。這是一個難題,因為要組合單詞,您需要理解單詞的語義。您可以將mom和 組合mother在一起,因為它們在語義上相關(guān)。


識別兩個單詞是否在語義上相關(guān)的一種方法是關(guān)聯(lián)分布式單詞嵌入(向量),如 word2vec、Glove、fasttext 等。您可以找到所有單詞向量與某個單詞的余弦相似度,并且可以選取前 5 個接近的單詞并創(chuàng)建新單詞。


使用 word2vec 的示例


# Load a pretrained word2vec model

import gensim.downloader as api

model = api.load('word2vec-google-news-300')


vectors = [model.get_vector(w) for w in words]

for i, w in enumerate(vectors):

   first_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][1]

   second_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][2]

   

   print (f"{words[i]} + {words[first_best_match]}")

   print (f"{words[i]} + {words[second_best_match]}")  

輸出:


mom + mother

mom + teacher

mother + mom

mother + teacher

london + mom

london + life

life + mother

life + mom

teach + teacher

teach + mom

teacher + teach

teacher + mother

您可以嘗試對余弦相似度設(shè)置閾值,并僅選擇余弦相似度大于此閾值的那些。


語義相似性的一個問題是,它們在語義上可能是相反的,因此它們是相似的(男人-女人),另一方面(男人-國王)在語義上相似,因為它們是相同的。


查看完整回答
反對 回復 2023-07-11
?
梵蒂岡之花

TA貢獻1900條經(jīng)驗 獲得超5個贊

您想要實現(xiàn)的是語義文本相似性。

例如 :

#@title Load the Universal Sentence Encoder's TF Hub module

from absl import logging


import tensorflow as tf


import tensorflow_hub as hub

import matplotlib.pyplot as plt

import numpy as np

import os

import pandas as pd

import re

import seaborn as sns


module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]

model = hub.load(module_url)

print ("module %s loaded" % module_url)

def embed(input):

? return model(input)


def plot_similarity(labels, features, rotation):

? corr = np.inner(features, features)

? sns.set(font_scale=1.2)

? g = sns.heatmap(

? ? ? corr,

? ? ? xticklabels=labels,

? ? ? yticklabels=labels,

? ? ? vmin=0,

? ? ? vmax=1,

? ? ? cmap="YlOrRd")

? g.set_xticklabels(labels, rotation=rotation)

? g.set_title("Semantic Textual Similarity")


def run_and_plot(messages_):

? message_embeddings_ = embed(messages_)

? plot_similarity(messages_, message_embeddings_, 90)


messages = [

? ? "Mother",

? ? "Mom",

? ? "Mama",

? ? "Dog",

? ? "Cat"

]


run_and_plot(messages)

http://img3.sycdn.imooc.com/64aceccf0001cbb406410465.jpg


查看完整回答
反對 回復 2023-07-11
?
繁星coding

TA貢獻1797條經(jīng)驗 獲得超4個贊

另一種古怪的方法是使用舊的 PyDictionary 庫來解決這個問題。您可以使用

dictionary.getSynonyms()

函數(shù)循環(huán)遍歷列表中的所有單詞并對它們進行分組。列出的所有可用同義詞將被覆蓋并映射到一組。允許您分配最終變量并對同義詞求和。在你的例子中。您選擇最后一個單詞“Mother”,它會顯示同義詞的最終計數(shù)。


查看完整回答
反對 回復 2023-07-11
?
慕容708150

TA貢獻1831條經(jīng)驗 獲得超4個贊

您可以生成詞嵌入向量并使用一些聚類算法。最后,您需要調(diào)整算法的超參數(shù)以獲得高精度的結(jié)果。


from sklearn.cluster import DBSCAN

from sklearn.decomposition import PCA


import spacy


import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D


# Load the large english model

nlp = spacy.load("en_core_web_lg")


tokens = nlp("dog cat banana apple teaching teacher mom mother mama mommy berlin paris")


# Generate word embedding vectors

vectors = np.array([token.vector for token in tokens])

vectors.shape

# (12, 300)

讓我們使用主成分分析算法來可視化 3 維空間中的嵌入:


pca_vecs = PCA(n_components=3).fit_transform(vectors)

pca_vecs.shape

# (12, 3)


fig = plt.figure(figsize=(6, 6))

ax = fig.add_subplot(111, projection='3d')

xs, ys, zs = pca_vecs[:, 0], pca_vecs[:, 1], pca_vecs[:, 2]

_ = ax.scatter(xs, ys, zs)


for x, y, z, lable in zip(xs, ys, zs, tokens):

    ax.text(x+0.3, y, z, str(lable))

http://img1.sycdn.imooc.com//64acecfd0001ebed03360326.jpg

讓我們使用 DBSCAN 算法對單詞進行聚類:


model = DBSCAN(eps=5, min_samples=1)

model.fit(vectors)


for word, cluster in zip(tokens, model.labels_):

    print(word, '->', cluster)

輸出:


dog -> 0

cat -> 0

banana -> 1

apple -> 2

teaching -> 3

teacher -> 3

mom -> 4

mother -> 4

mama -> 4

mommy -> 4

berlin -> 5

paris -> 6


查看完整回答
反對 回復 2023-07-11
?
千巷貓影

TA貢獻1829條經(jīng)驗 獲得超7個贊

這個想法是使用這個詞典來識別相似的單詞。

簡而言之:運行一些知識發(fā)現(xiàn)算法,根據(jù)英語語法提取知識

這是一個同義詞庫:它有 18MB。

這是同義詞庫的摘錄,您可以嘗試通過某種算法來匹配單詞替換。

{"word":?"ma",?"key":?"ma_1",?"pos":?"noun",?"synonyms":?["mamma",?"momma",?"mama",?"mammy",?"mummy",?"mommy",?"mom",?"mum"]}

對于使用外部 api 的快速修復,這里是鏈接:它們允許使用 api 執(zhí)行更多操作,例如獲取同義詞、查找多個定義、查找押韻單詞等等。



查看完整回答
反對 回復 2023-07-11
  • 6 回答
  • 0 關(guān)注
  • 281 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學習伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號