慕的地6264312
2022-06-22 18:00:09
我想從一個站點(https://www.vanglaini.org/)收集鏈接:/hmarchhak/102217 并將其打印為https://www.vanglaini.org/hmarchhak/102217。請幫忙 看圖import requestsimport pandas as pdfrom bs4 import BeautifulSoupsource = requests.get('https://www.vanglaini.org/').textsoup = BeautifulSoup(source, 'lxml')for article in soup.find_all('article'): headline = article.a.text summary=article.p.text link = article.a.href print(headline) print(summary) print(link)print()這是我的代碼。
1 回答

慕無忌1623718
TA貢獻1744條經(jīng)驗 獲得超4個贊
除非我遺漏了一些標題和摘要似乎是相同的文本。您可以使用:hasbs4 4.7.1+ 來確保您article有一個孩子href;這似乎去掉了article不屬于主體的標簽元素,我懷疑這實際上是你的目標
from bs4 import BeautifulSoup as bs
import requests
base = 'https://www.vanglaini.org'
r = requests.get(base)
soup = bs(r.content, 'lxml')
for article in soup.select('article:has([href])'):
headline = article.h5.text.strip()
summary = re.sub(r'\n+|\r+',' ',article.p.text.strip())
link = f"{base}{article.a['href']})"
print(headline)
print(summary)
print(link)
添加回答
舉報
0/150
提交
取消