首頁猿問使用 dict.items()...

使用 dict.items() 為大型數(shù)據(jù)集優(yōu)化字典查找

Python

冉冉說 2021-10-12 10:42:10

我是新手，最近幾個(gè)月開始在 pyhton 中編碼。我有一個(gè)腳本，它需要一個(gè)蛋白質(zhì)組（2850 個(gè)字符串的 800 Kb 文件）并根據(jù)一個(gè)大型數(shù)據(jù)集（作為 id:protein_string 的字典保存在代碼中的 2300 萬個(gè)字符串的 8Gb 文件）檢查每個(gè)單獨(dú)的蛋白質(zhì)（protein_string）并報(bào)告 Ids所有相同的字符串（每個(gè)字符串最多可以報(bào)告 8500 個(gè) ID）。當(dāng)前腳本需要 4 小時(shí)才能運(yùn)行。一般可以采取什么措施來加快進(jìn)程，以及如何將我的腳本轉(zhuǎn)換為多處理或多線程（不確定差異），以便進(jìn)行比較的代碼部分？import sysfrom Bio import AlignIOfrom Bio import SeqIOfrom Bio.Seq import Seqimport timestart_time = time.time()databasefile = sys.argv[1]queryfile = sys.argv[2]file_hits = "./" + sys.argv[2].split("_protein")[0] + "_ZeNovo_hits_v1.txt"file_report = "./" + sys.argv[2].split("_protein")[0] + "_ZeNovo_report_v1.txt"format = "fasta"output_file = open(file_hits, 'w')output_file_2 = open(file_report,'w')sequences_dict = {}output_file.write("{}\t{}\n".format("protein_query", "hits"))for record in SeqIO.parse(databasefile, format): sequences_dict[record.description] = str(record.seq)print("processed database in --- {:.3f} seconds ---".format(time.time() - start_time))processed_counter = 0for record in SeqIO.parse(queryfile, format): query_seq = str(record.seq) count = 0 output_file.write("{}\t".format(record.description)) for id, seq in sequences_dict.items(): if seq == query_seq: count += 1 output_file.write("{}\t".format(id)) processed_counter += 1 output_file.write("\n") print("processed protein "+str(processed_counter)) output_file_2.write(record.description+'\t'+str(count)+'\t'+str(len(record.seq))+'\t'+str(record.seq)+'\n')output_file.close()output_file_2.close()print("Done in --- {:.3f} seconds ---".format(time.time() - start_time))

查看完整描述

2 回答

萬千封印

TA貢獻(xiàn)1891條經(jīng)驗(yàn) 獲得超3個(gè)贊

初步在我看來，它可能更有意義，而不是將您的 dict 存儲(chǔ)為{ id : seq }，將其存儲(chǔ)為{ seq : [id_list] }. 由于聽起來每個(gè)序列有很多重復(fù)，這將節(jié)省訪問特定序列的所有 ID 的時(shí)間。您可以在讀取數(shù)據(jù)時(shí)使用defaultdict帶有默認(rèn)值的 a 作為空列表來執(zhí)行此操作，并且當(dāng)您讀取 ID 和序列時(shí)，您可以使用將其添加到 dict 中sequences_dict[record.seq].append(record.description)。

讓我知道這是否有幫助，以及我是否可以提供其他幫助。

反對(duì) 回復(fù) 2021-10-12

料青山看我應(yīng)如是

TA貢獻(xiàn)1772條經(jīng)驗(yàn) 獲得超8個(gè)贊

按照 Sam Hollenbach 的建議，我可能會(huì)對(duì)您的代碼進(jìn)行以下 (4) 次更改。

import sys

from Bio import AlignIO

from Bio import SeqIO

from Bio.Seq import Seq

import time

start_time = time.time()

from collections import defaultdict

databasefile = sys.argv[1]

queryfile = sys.argv[2]

file_hits = "./" + sys.argv[2].split("_protein")[0] + "_ZeNovo_hits_v1.txt"

file_report = "./" + sys.argv[2].split("_protein")[0] + "_ZeNovo_report_v1.txt"

_format = "fasta" #(change 1)

output_file = open(file_hits, 'w')

output_file_2 = open(file_report,'w')

sequences_dict = defaultdict(list)

output_file.write("{}\t{}\n".format("protein_query", "hits"))

for record in SeqIO.parse(databasefile, _format):

sequences_dict[record.seq].append(record.description) #(change 2)

#sequences_dict[record.description] = str(record.seq)

print("processed database in --- {:.3f} seconds ---".format(time.time() - start_time))

processed_counter = 0

for record in SeqIO.parse(queryfile, _format):

query_seq = record.seq #(change 3)

count = 0

output_file.write("{}\t".format(record.description))

if query_seq in sequences_dict: #(change 4)

count = len(sequences_dict[query_seq])

output_file.write('\t'.join(sequences_dict[query_seq]) + "\n")

processed_counter += 1

print("processed protein", processed_counter)

output_file_2.write(record.description+'\t'+str(count)+

'\t'+str(len(record.seq))+'\t'+str(record.seq)+'\n')

output_file.close()

output_file_2.close()

print("Done in --- {:.3f} seconds ---".format(time.time() - start_time))

更改 #1： - 將格式變量的名稱更改為 _format（以避免與 Python 術(shù)語“格式”發(fā)生沖突，并在使用該名稱的代碼中進(jìn)行更改。

更改 #2：使用record.seq作為字典的鍵并將附加record.description到列表中（作為值）

更改 #3：無需強(qiáng)制record.seq轉(zhuǎn)換str- 它已經(jīng)是一個(gè)字符串。

更改 #4：這 3 行將比在原始代碼中迭代字典更快地定位任何匹配的記錄。

我不確定output_file.write("{}\t".format(record.description))應(yīng)該如何處理。

另外，不能說我已經(jīng)找到了一個(gè)完整的工作程序所需的所有更改。如果您在嘗試建議的更改后有任何疑問，請(qǐng)告訴我。

反對(duì) 回復(fù) 2021-10-12

2 回答
0 關(guān)注
190 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

使用 dict.items() 為大型數(shù)據(jù)集優(yōu)化字典查找

使用 dict.items() 為大型數(shù)據(jù)集優(yōu)化字典查找

2 回答

添加回答