我的新項(xiàng)目是從Naxos 音樂(lè)術(shù)語(yǔ)詞匯表中提取數(shù)據(jù),這是一個(gè)很好的資源,我想處理其文本數(shù)據(jù)并將其提取到數(shù)據(jù)庫(kù)中,以便在我將創(chuàng)建的另一個(gè)更簡(jiǎn)單的網(wǎng)站上使用。我唯一的問(wèn)題是糟糕的 XHTML 格式。在 W3C XHTML驗(yàn)證提出了318個(gè)錯(cuò)誤和警告54。即使是我發(fā)現(xiàn)的HTML Tidier也無(wú)法解決所有問(wèn)題。我使用的是 Python 3.67,我正在解析的頁(yè)面是 ASP。我已經(jīng)測(cè)試了 LXML 和 Python XML 模塊,但都失敗了。任何人都可以建議任何其他整理器或模塊嗎?或者我必須使用某種原始文本操作(糟糕?。??我的代碼:LXML:from lxml import etreefile = open("glossary.asp", "r", encoding="ISO-8859-1")parsed = etree.parse(file)錯(cuò)誤: Traceback (most recent call last): File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module> parsed = etree.parse(file) File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128>>> Python XML(使用整理后的 XHTML):import xml.etree.ElementTree as ETfile = open("tidy.html", "r", encoding="ISO-8859-1")root = ET.fromstring(file.read())# Top-level elementsprint(root.findall("."))錯(cuò)誤: Traceback (most recent call last): File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module> root = ET.fromstring(file.read()) File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML parser.feed(text) File "<string>", line Nonexml.etree.ElementTree.ParseError: undefined entity: line 526, column 33
解析錯(cuò)誤的 XHTML
喵喵時(shí)光機(jī)
2021-09-25 16:52:04