首頁猿問如何從包含“<?...

如何從包含“<?>”的標簽中解析文本

Python

嗶嗶one 2023-02-12 19:00:31

我的目標是獲取文本： 27. The method according to claim 23 wherein...How do I go about retrieving the text inside a tag that contains <?. 我相信他們被谷歌搜索稱為 php 短標簽。我正在使用 lxml、xpaths，他們似乎只是沒有將其注冊為標簽或節(jié)點。我試過 itertext() 但效果不佳。 <claim id="CLM-00027" num="00027"> <claim-text> <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys. <?insert-end id="REI-00005" ?></claim-text> </claim>

查看完整描述

2 回答

UYOU

TA貢獻1878條經驗獲得超4個贊

下面是一段代碼，它使用 XPath 到達最深的“有效”標簽，然后從那里getchildren一直tail深入到實際文本。

import lxml

xml=""" <claim id="CLM-00027" num="00027">

<claim-text> <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys. <?insert-end id="REI-00005" ?></claim-text>

</claim>"""

root = lxml.etree.fromstring(xml)

e = root.xpath("/claim/claim-text")

res = e[0].getchildren()[0].tail

print(res)

輸出：

'27。24.根據權利要求23所述的方法，其中所述非晶態(tài)金屬選自Zr基合金、Ti基合金、Al基合金、Fe基合金、La基合金、Cu基合金、Mg基合金、Pt基合金，和Pd基合金。

反對回復 2023-02-12

守著一只汪

TA貢獻1872條經驗獲得超4個贊

通過索引訪問特定的子節(jié)點。

from xml.etree import ElementTree as ET

tree = ET.parse('path_to_your.xml')

root = tree.getroot()

print(root[0].text)

輸出：

27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.

反對回復 2023-02-12