2 回答

TA貢獻(xiàn)1829條經(jīng)驗(yàn) 獲得超6個(gè)贊
這是您的解決方案......
def cross(listMaster, listSlave, criteria="email"):
if criteria == "email":
returnUnique = listMaster[:] # create a copy of the master list
emails_in_master = set()
for item in listMaster:
emails_in_master.add(item[2]) # add the master emails to the set
for item in listSlave:
if item[2] in emails_in_master:
returnUnique.append(item)
return returnUnique
您的算法是 O(n^2),因?yàn)槟诒闅v一個(gè)列表,然后在每次迭代上述內(nèi)容時(shí)搜索另一個(gè)列表。這會(huì)導(dǎo)致指數(shù)運(yùn)行時(shí)間,這基本上是您可以獲得的最糟糕的結(jié)果。您需要嘗試使算法具有線性復(fù)雜度,以便獲得不錯(cuò)的運(yùn)行時(shí)間。
你的算法基本上如下:
loop for n: # this costs n
loop for n: # this costs n for each of the n's above
add an item or continue # so total, this is O(n * n)
你想要的是以下內(nèi)容:
loop for n: # this costs n
build a lookup
loop for n: # this costs n
add item if in lookup or continue # so total, this is O(n)
我在本地機(jī)器上將測試數(shù)據(jù)生成到 CSV 中。這是我創(chuàng)建 CSV 的方式...
>>> import csv
>>> from faker import Faker
>>> fake = Faker()
>>> with open('masters.csv', 'wb') as csvfile:
... writer = csv.writer(csvfile, delimiter=',', quotechar='"')
... for i in range(20000):
... writer.writerow([fake.name(), fake.address(), fake.email(), fake.job(), fake.ssn()])
...
>>> with open('slaves.csv', 'wb') as csvfile:
... writer = csv.writer(csvfile, delimiter=',', quotechar='"')
... for i in range(20000):
... writer.writerow([fake.name(), fake.address(), fake.email(), fake.job(), fake.ssn()])
...
一旦生成了這些(注意每個(gè)文件有 20k,因?yàn)樯?200 萬的時(shí)間太長了),我構(gòu)建了以下測試套件來比較不同的方法......
import csv
import unittest
email = lambda l: l[2]
class TestListComparison(unittest.TestCase):
@classmethod
def setUpClass(cls):
cls.masters = []
cls.slaves = []
with open('masters.csv', 'rb') as master_csv:
reader = csv.reader(master_csv, delimiter=',', quotechar='"')
cls.masters = list(reader)
with open('slaves.csv', 'rb') as slave_csv:
reader = csv.reader(slave_csv, delimiter=',', quotechar='"')
cls.slaves = list(reader)
def test_make_sure_lists_are_well_formed(self):
self.assertEqual(len(self.masters), len(self.slaves))
self.assertEqual(len(self.masters), 20000)
def test_list_combination_original(self):
emailListSlave = []
returnUnique = []
for item in self.slaves:
emailListSlave.append(email(item))
for no, item in enumerate(self.masters): # O(n)
if email(item) not in self.slaves: # O(n)
returnUnique.append(item) # O(1)
# Total complexity: O(n * n * 1) => O(n^2)
def test_list_combination_using_lookup(self):
lookup = set()
returnUnique = self.masters[:] # create a copy of masters list
for master in self.masters: # loop over the master list O(n)
lookup.add(email(master)) # add the email to the set O(1)
for slave in self.slaves: # loop over the list again O(n)
if email(slave) in lookup: # check the lookup O(1)
returnUnique.append(slave) # add the item to the list O(1)
# Total complexity: O(n + n) => O(2n) => O(n)
下面是運(yùn)行結(jié)果:
請(qǐng)注意,查找測試花費(fèi)了大約 15 毫秒,而原始算法花費(fèi)了大約 14 秒。那快了幾個(gè)數(shù)量級(jí)。

TA貢獻(xiàn)1825條經(jīng)驗(yàn) 獲得超6個(gè)贊
它如此緩慢的原因是搜索是在線性時(shí)間內(nèi)進(jìn)行的。使用關(guān)鍵字作為搜索字符串的字典。應(yīng)該有很大的不同。
添加回答
舉報(bào)