首頁猿問閱讀Word文檔并獲取每個(gè)標(biāo)題的文本

閱讀Word文檔并獲取每個(gè)標(biāo)題的文本

Python

哈士奇WWW 2021-05-12 09:41:00

我有一個(gè)Microsoft Word文檔，我需要提取文本并將其按文檔的每個(gè)部分結(jié)構(gòu)化為數(shù)據(jù)框。文檔的每個(gè)部分均以標(biāo)題開頭。標(biāo)題在Word中的格式設(shè)置為“標(biāo)題2”。例如：這是第一節(jié)這是第一部分的文本。這是文檔的第二部分這是第二部分的內(nèi)容。我需要獲取數(shù)據(jù)框中每個(gè)節(jié)的文本，其中在AI列中將具有節(jié)名稱，在BI列中將具有節(jié)文本。我是Python的新手，正在嘗試docx打包，但是我唯一能做的就是根據(jù)我在stackoverflow中找到的函數(shù)獲取全文函數(shù)（readDocx）：#! python3from docx import Documentdef getText(filename): doc = Document(filename) fullText = [] for para in doc.paragraphs: fullText.append(para.text) return '\n'.join(fullText)獲取文本的代碼：import readDocxtest = readDocx.getText('THE FILE.docx')我能夠找到標(biāo)識(shí)標(biāo)題的循環(huán)。問題是如何遍歷文檔并獲取數(shù)據(jù)框中的每個(gè)標(biāo)題和文本：from docx import Documentfrom docx.shared import Inchesdocs = Document("THE FILE.docx")for paragraph in docs.paragraphs: if paragraph.style.name=='Heading 2': print (paragraph.text)

查看完整描述