首頁(yè) 猿問(wèn) 清理從掃描的 .pdf 中提取的文本數(shù)據(jù)

清理從掃描的 .pdf 中提取的文本數(shù)據(jù)

Python

人到中年有點(diǎn)甜 2021-12-17 15:49:19

我正在創(chuàng)建一個(gè)腳本來(lái)從掃描的 pdf 中提取文本，以創(chuàng)建一個(gè) JSON 字典，以便稍后在 MongoDB 中實(shí)現(xiàn)。我遇到的問(wèn)題是，通過(guò) Textract 模塊使用 tesseract-ocr 成功提取了所有文本，但它正在被 python 讀取，因此 PDF 上的所有空白都被轉(zhuǎn)換為 '\n' 使得提取文本變得非常困難必要的信息。我試過(guò)用一堆代碼來(lái)清理它，但它仍然不是很可讀。它擺脫了所有冒號(hào)，我認(rèn)為這將使識(shí)別鍵和值更容易。stringedText = str(text)cleanText = rmStop.replace('\n','')splitText = re.split(r'\W+', cleanText)caseingText = [word.lower() for word in splitText]cleanOne = [word for word in caseingText if word != 'n']dexStop = cleanOne.index("od260")dexStart = cleanOne.index("sheet")clean = cleanOne[dexStart + 1:dexStop]我仍然留下相當(dāng)多的不干凈幾乎處理過(guò)的數(shù)據(jù)。所以在這一點(diǎn)上，我知道如何使用它。這就是我提取數(shù)據(jù)的方式text = textract.process(filename, method="tesseract", language="eng")我也試過(guò) nltk，它取出了一些數(shù)據(jù)，讓它更容易閱讀，但仍有很多 \n 混淆了數(shù)據(jù)。這是 nltk 代碼：stringedText = str(text)stop_words = set(stopwords.words('english'))tokens = word_tokenize(stringedText)rmStop = [i for i in tokens if not i in ENGLISH_STOP_WORDS]這是我從我嘗試的第一次清理中得到的：['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']從那我需要一個(gè) JSON 數(shù)組，它看起來(lái)像："lacz-rp" : { "Date" : "21-feb-2019", "Sequence" : "gatctctaccatggcgcacatttccccgaaaagtgc" "Order No." : "15775199" "Ref No." : "207335463" }

查看完整描述

1 回答

守著一只汪

TA貢獻(xiàn)1872條經(jīng)驗(yàn) 獲得超4個(gè)贊

您可以使用換行符轉(zhuǎn)換您的 \n。請(qǐng)使用以下;

formatted_text = text.replace('\\n', '\n')

這將用輸出中的實(shí)際換行符替換轉(zhuǎn)義的換行符。

反對(duì) 回復(fù) 2021-12-17

1 回答
0 關(guān)注
173 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書(shū)簽

微信客服

購(gòu)課補(bǔ)貼
聯(lián)系客服咨詢(xún)優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

清理從掃描的 .pdf 中提取的文本數(shù)據(jù)

清理從掃描的 .pdf 中提取的文本數(shù)據(jù)

1 回答

添加回答