1 回答

TA貢獻1789條經(jīng)驗 獲得超8個贊
最有可能的是,您的錯誤來自您對每個文件中的所有單詞使用相同的字典。term_in_document
幾條評論
len(sorted(...))
它浪費資源對不需要排序的東西(排序并不便宜),因為你只得到長度。按數(shù)字讀取文件根本沒有意義,要做到這一點,您最終會調(diào)用文件系統(tǒng)多個時間來讀取整個目錄的文件名,因為您每次讀取一個目錄時都會列出文件。
文件應(yīng)該在處理為我們關(guān)閉文件的語句中打開。
with
變量和函數(shù)應(yīng)使用,而類應(yīng)使用 。
this_notation
ThisNotation
您在單詞列表上迭代兩次只是為了獲得十進制對數(shù)。
之后的邏輯非常令人困惑,您似乎正在對每個單詞出現(xiàn)次數(shù)的十進制對數(shù)進行RMS(均方根),但您不會將其除以單詞數(shù)。之后,你又得到了對數(shù)。你應(yīng)該更好地定義你的問題。當我獲得新信息時,我將編輯我的答案。
import re
import os
import math
import heapq
def read_file(path):
with open(path, 'r', encoding='latin-1') as f:
return f.read()
DELIM = '[ \n\t0123456789;:.,/\(\)\"\'-]+'
def tokenize(text):
return re.split(DELIM, text.lower())
def index_text_files_rr(path):
postings = {}
doc_lengths = {}
term_in_document = {}
files = sorted(os.listdir(path))
for i, file in enumerate(files):
file_path = os.path.join(path, file)
s = read_file(file_path)
words = tokenize(s)
length = 0
# We will store pairs of the word with the decimal logarithm of
# the word count here to use it later
words_and_logs = []
for word in words:
# Discard empty words
if word != '':
# Compute the decimal logarithm of the word count
log = math.log10(words.count(word))
# Add the square of the decimal logarithm to the length
length += log**2
# Store the word and decimal logarithm pair
words_and_logs.append((word, log))
# Compute the square root of the sum of the squares
# of the decimal logarithms of the words count
doc_lengths[i] = math.sqrt(length)
# Iterate over our stored pairs where we already have the
# decimal logarithms computed so we do not have to do it again
for word, log in words_and_logs:
# No need to discard empty words here as we discarded them before
# so words_and_logs will not have the empty word
term_in_document.setdefault(log / doc_lengths[i], set()).add(i)
postings[w] = term_in_document
return postings
def query_rr(postings, qtext):
words = tokenize(qtext)
doc_scores = {}
for i in range(N):
score = 0
for w in words:
tf = words.count(w)
df = len(postings[w])
idf = math.log10(N / (df+1))
query_weights = tf * idf
for w in words:
if w in postings:
score = score + query_weights
doc_scores[i] = score
res = heapq.nlargest(10, doc_scores)
return res
postings = index_text_files_rr('docs')
print(query_rr(postings, 'hello'))
添加回答
舉報