首頁(yè) 猿問(wèn) 使用 BeautifulSoup...

使用 BeautifulSoup 解析單個(gè)類(lèi)中的不同元素

Python

RISEBY 2022-07-05 17:09:34

背景：我對(duì) Python 相當(dāng)有經(jīng)驗(yàn)，但對(duì) BeautifulSoup 完全是個(gè)菜鳥(niǎo)我試圖從一個(gè)類(lèi)中獲取 3 個(gè)值。我正在使用的頁(yè)面看起來(lái)有一系列元素，如下所示：<blockquote><a name="title"><p><B>Title</b> <table frame="hsides" border="1" cellspacing="0" cellpadding="2" bordercolor="darkblue"><tr><td><font face="arial" size="2" color="#0000CC"><b><I>Subtitle</I>: Top Text.</b></font></td></tr></table> Body Text.<a name="title2".... etc</blockquote>目前，我只是將所有文本轉(zhuǎn)儲(chǔ)到這樣的列表中：page_html = soup(page, 'html.parser')text = []for a in page_html.select('a'): text.append(a.text)這將返回每行如下所示的結(jié)果：Title Subtitle: Top Text. Body Text.我真正想要的是能夠?qū)⒚總€(gè)解析a成數(shù)據(jù)框中的一行，看起來(lái)像：col1 col2 col3Title Subtitle: Top Text. Body Text.但坦率地說(shuō)，我有點(diǎn)過(guò)頭了。

查看完整描述

2 回答

湖上湖

TA貢獻(xiàn)2003條經(jīng)驗(yàn) 獲得超2個(gè)贊

如果您的所有<a>標(biāo)簽都相同，則可以使用：

from bs4 import BeautifulSoup

import pandas as pd

page = '''<blockquote>

<a name="title"><p><B>Title</b> <table frame="hsides" border="1" cellspacing="0" cellpadding="2" bordercolor="darkblue"><tr><td><font face="arial" size="2" color="#0000CC"><b><I>Subtitle</I>: Top Text.</b></font></td></tr></table> Body Text.</blockquote>

'''

soup = BeautifulSoup(page, "html.parser")

text = []

for texts in soup.find_all('a'):

paragraph = texts.find('p')

title = texts.find('b').text

subtitle = texts.find_all('b')[1].text

other = ''.join(paragraph.find_all(text=True, recursive=False))

d = {'col1': [title], 'col2': [subtitle],'col3' : [other]}

df = pd.DataFrame(data=d)

print(df)

輸出：

col1 col2 col3

0 Title Subtitle: Top Text. Body Text.

反對(duì) 回復(fù) 2022-07-05

慕的地6264312

TA貢獻(xiàn)1817條經(jīng)驗(yàn) 獲得超6個(gè)贊

僅使用您共享的 HTML 片段：

from bs4 import BeautifulSoup

content = '<a name="title"><p><B>Title</b> ' \

'<table frame="hsides" border="1" cellspacing="0" cellpadding="2" bordercolor="darkblue">' \

'<tr><td><font face="arial" size="2" color="#0000CC"><b><I>Subtitle</I>: Top Text.</b></font>' \

'</td></tr></table> Body Text.'

soup = BeautifulSoup(content, 'html.parser')

articles = soup.find_all('a')

for article in articles:

paragraph = article.find('p')

print({

'title': article.find('b').text,

'subtitle': article.select('table i')[0].text,

'body': ''.join(paragraph.find_all(text=True, recursive=False))

})

由于問(wèn)題主要是關(guān)于 BeautifulSoup，而不是關(guān)于 Pandas，我認(rèn)為字典就足夠了，你可以自己將它放入數(shù)據(jù)框或其他數(shù)據(jù)結(jié)構(gòu)中嗎？

結(jié)果：

{'title': 'Title', 'subtitle': 'Subtitle', 'body': ' Body Text.'}

反對(duì) 回復(fù) 2022-07-05

2 回答
0 關(guān)注
144 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書(shū)簽

微信客服

購(gòu)課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

使用 BeautifulSoup 解析單個(gè)類(lèi)中的不同元素

使用 BeautifulSoup 解析單個(gè)類(lèi)中的不同元素

2 回答

添加回答