2 回答

TA貢獻(xiàn)1744條經(jīng)驗(yàn) 獲得超4個贊
我的兩個理論是 (1) 內(nèi)存映射文件并為每個值搜索使用多行正則表達(dá)式,以及 (2) 將工作分配給多個子進(jìn)程。我將兩者結(jié)合起來,得出以下結(jié)論。也許可以在父進(jìn)程中執(zhí)行 mmap 并共享,但我走的是簡單的路線,只是在每個子進(jìn)程中都這樣做,假設(shè)操作系統(tǒng)會為您找出有效的共享。
import multiprocessing as mp
import os
import mmap
import re
def _value_find_worker_init(filename):
"""Called when initializing mp.Pool to open an mmaped file in subprocesses.
The file is `global mmap_file` so that the worker can find it.
"""
global mmap_file
filenames_fd = os.open(filename, os.O_RDONLY)
mmap_file = mmap.mmap(filenames_fd, length=os.stat(filename).st_size,
access=mmap.ACCESS_READ)
def _value_find_worker(value):
"""Return a list of matching lines in `global mmap_file`"""
# multiline regex for findall
regex = b"(?m)^.*?" + value + b".*?$"
matched = re.compile(regex).findall(mmap_file)
print(regex, matched)
return matched
def find_unique():
with open("UniqueValueList.txt", "rb") as g:
uniqueValues = [line.strip() for line in g]
with open('UniqueValueList.txt', "rb") as g:
uniqueValues = [line.strip() for line in g]
with mp.Pool(initializer=_value_find_worker_init,
initargs=("Filenames_File.txt",)) as pool:
matched_values = set()
for matches in pool.imap_unordered(_value_find_worker, uniqueValues):
matched_values.update(matches)
with open("Filenames_With_Unique_Values.txt", "wb") as outfile:
outfile.writelines(value + b"\n" for value in matched_values)
find_unique()

TA貢獻(xiàn)1831條經(jīng)驗(yàn) 獲得超10個贊
我們可以將文件對象用作迭代器。迭代器會逐行返回每一行,可以處理。這不會將整個文件讀入內(nèi)存,適合在 Python 中讀取大文件。
添加回答
舉報(bào)