3 回答

TA貢獻(xiàn)1827條經(jīng)驗 獲得超4個贊
使用ElementTree庫(請注意,我的答案使用核心 python 庫,而其他答案使用外部庫。)
要抓取前三個句子,只需將這些行添加到您的代碼中:
section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"
txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)
print(txt)
輸出:
Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.
希望這有幫助!

TA貢獻(xiàn)1797條經(jīng)驗 獲得超4個贊
另一種方法。
from simplified_scrapy import SimplifiedDoc, utils, req
# Basic
xml = '''<ROOT><A><B><C>The Value</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
print (doc.select('A>B>C'))
# Multiple
xml = '''<ROOT><A><B><C>The Value 1</C></B></A><A><B><C>The Value 2</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
# print (doc.selects('A').select('B').select('C'))
print (doc.selects('A').select('B>C'))
# Mixed structure
xml = '''<ROOT><A><other>no B</other></A><A><other></other><B>no C</B></A><A><B><C>The Value</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
nodes = doc.selects('A').selects('B').select('C')
for node in nodes:
for c in node:
if c:
print (c)
結(jié)果:
{'tag': 'C', 'html': 'The Value'}
[{'tag': 'C', 'html': 'The Value 1'}, {'tag': 'C', 'html': 'The Value 2'}]
{'tag': 'C', 'html': 'The Value'}

TA貢獻(xiàn)1856條經(jīng)驗 獲得超17個贊
您可以使用lxml,您可以通過安裝pip install lxml
。
from simplified_scrapy import SimplifiedDoc, utils, req
# Basic
xml = '''<ROOT><A><B><C>The Value</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
print (doc.select('A>B>C'))
# Multiple
xml = '''<ROOT><A><B><C>The Value 1</C></B></A><A><B><C>The Value 2</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
# print (doc.selects('A').select('B').select('C'))
print (doc.selects('A').select('B>C'))
# Mixed structure
xml = '''<ROOT><A><other>no B</other></A><A><other></other><B>no C</B></A><A><B><C>The Value</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
nodes = doc.selects('A').selects('B').select('C')
for node in nodes:
? for c in node:
? ? if c:
? ? ? print (c)
結(jié)果:
{'tag': 'C', 'html': 'The Value'}
[{'tag': 'C', 'html': 'The Value 1'}, {'tag': 'C', 'html': 'The Value 2'}]
{'tag': 'C', 'html': 'The Value'}
添加回答
舉報