TfidfVectorizer 賦予停用詞高權(quán)重
給定以下代碼:import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerimport urllib.request # the lib that handles the url stufffrom bs4 import BeautifulSoupimport unicodedatadef remove_control_characters(s): base = "" for ch in s: if unicodedata.category(ch)[0]!="C": base = base + ch.lower() else: base = base + " " return base moby_dick_url='http://www.gutenberg.org/files/2701/2701-0.txt'soul_of_japan = 'http://www.gutenberg.org/files/12096/12096-0.txt'def extract_body(url): with urllib.request.urlopen(url) as s: data = BeautifulSoup(s).body()[0].string stripped = remove_control_characters(data) return strippedmoby = extract_body(moby_dick_url) bushido = extract_body(soul_of_japan)corpus = [moby,bushido]vectorizer = TfidfVectorizer(use_idf=False, smooth_idf=True)tf_idf = vectorizer.fit_transform(corpus)df_tfidf = pd.DataFrame(tf_idf.toarray(), columns=vectorizer.get_feature_names(), index=["Moby", "Bushido"])df_tfidf[["the", "whale"]]我希望“鯨魚”在“白鯨記”中的 tf-idf 得分相對(duì)較高,但在“武士道:日本之魂”中得分較低,而“the”在兩者中得分較低。然而,我得到相反的結(jié)果。計(jì)算的結(jié)果是:| | the | whale ||-------|-----------|----------||Moby | 0.707171 | 0.083146 ||Bushido| 0.650069 | 0.000000 |這對(duì)我來說毫無意義。誰能指出我在思考或編碼中犯的錯(cuò)誤?
查看完整描述