我需要連接單詞4Gand mobile phonesorInternet以便將有關(guān)技術(shù)的句子聚集在一起。我有以下句子:4G is the fourth generation of broadband network.4G is slow. 4G is defined as the fourth generation of mobile technologyI bought a new mobile phone. 我需要在同一簇中考慮上述句子。目前還沒有,可能是因?yàn)樗鼪]有找到 4G 和移動(dòng)之間的關(guān)系。我嘗試使用firstwordnet.synsets來查找連接4G到互聯(lián)網(wǎng)或手機(jī)的同義詞,但不幸的是它沒有找到任何連接。將我正在做的句子聚類如下:rom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.cluster import KMeansfrom sklearn.metrics import adjusted_rand_scoreimport numpytexts = ["4G is the fourth generation of broadband network.", "4G is slow.", "4G is defined as the fourth generation of mobile technology", "I bought a new mobile phone."]# vectoization of the sentencesvectorizer = TfidfVectorizer(stop_words="english")X = vectorizer.fit_transform(texts)words = vectorizer.get_feature_names()print("words", words)n_clusters=3number_of_seeds_to_try=10max_iter = 300number_of_process=2 # seads are distributedmodel = KMeans(n_clusters=n_clusters, max_iter=max_iter, n_init=number_of_seeds_to_try, n_jobs=number_of_process).fit(X)labels = model.labels_# indices of preferible words in each clusterordered_words = model.cluster_centers_.argsort()[:, ::-1]print("centers:", model.cluster_centers_)print("labels", labels)print("intertia:", model.inertia_)texts_per_cluster = numpy.zeros(n_clusters)for i_cluster in range(n_clusters): for label in labels: if label==i_cluster: texts_per_cluster[i_cluster] +=1 print("Top words per cluster:")for i_cluster in range(n_clusters): print("Cluster:", i_cluster, "texts:", int(texts_per_cluster[i_cluster])), for term in ordered_words[i_cluster, :10]: print("\t"+words[term])print("\n")print("Prediction")任何對此的幫助將不勝感激。
1 回答

喵喵時(shí)光機(jī)
TA貢獻(xiàn)1846條經(jīng)驗(yàn) 獲得超7個(gè)贊
密集的詞嵌入(如來自 word2vec 或類似算法)可能會(huì)有所幫助。
他們會(huì)將單詞轉(zhuǎn)換為密集的高維向量,其中具有相似含義/用法的單詞彼此接近。甚至:空間中的某些方向通常會(huì)與單詞之間的關(guān)系類型大致相關(guān)。
因此,在具有足夠代表性(大且多樣化)的文本上訓(xùn)練的詞向量將使向量彼此'4G'
接近'mobile'
,然后如果您的句子表示是從詞向量引導(dǎo)的,這可能有助于您的聚類。
使用單個(gè)詞向量對句子建模的一種快速方法是使用所有句子詞向量的平均值作為句子向量。這太簡單了,無法對多種意義進(jìn)行建模(尤其是那些來自語法和詞序的意義),但通??梢宰鳛橐粋€(gè)很好的基線,特別是對于廣泛話題的問題。
另一種計(jì)算“Word Mover's Distance”將句子視為詞向量集(不對其進(jìn)行平均),并且可以進(jìn)行句子到句子的距離計(jì)算,其效果比簡單的平均值更好,但計(jì)算時(shí)間較長會(huì)變得非常昂貴句子。
添加回答
舉報(bào)
0/150
提交
取消