首頁猿問嘗試從 DataFrame...

嘗試從 DataFrame 中的源中刪除 html 格式

Python

德瑪西亞99 2023-09-12 19:04:16

我有一個包含推文來源的數(shù)據(jù)框。源的格式如下：<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>我正在嘗試找到一種方法來剝離 html 并保留 url。我對正則表達(dá)式不太熟悉，無法真正找到解決方案。任何幫助都會很棒。

查看完整描述

3 回答

手掌心

TA貢獻(xiàn)1942條經(jīng)驗獲得超3個贊

您可以首先通過將標(biāo)簽設(shè)置為BeautifulSoup對象來獲取 url?。如果它已經(jīng)是一個 BeautifulSoup 對象那么你可以直接應(yīng)用它

.find("a").get("href")

如果沒有，那么您可以將其設(shè)為 BeautifulSoup 對象。

from bs4 import BeautifulSoup #pip install beautifulsoup4

a_tag ='<a rel="nofollow">Twitter for iPhone</a>'

soup = BeautifulSoup(a_tag,"html5lib") #pip install html5lib

print(soup.find("a").get("href"))

#output - > http://twitter.com/download/iphone

然后用這個函數(shù)去掉html，文字就剩下了

import re

def remove_html_tags(raw_html):

? ? cleanr = re.compile("<.*?>")

? ? clean_text = re.sub(cleanr,'',raw_html)

? ? return clean_text

output = remove_html_tags(a_tag)

print(output)

#output -> Twitter for iPhone

反對回復(fù) 2023-09-12

BIG陽

TA貢獻(xiàn)1859條經(jīng)驗獲得超6個贊

您可以使用 python?urlextract模塊從任何字符串中提取 URL -

from urlextract import URLExtract

text = '''

<a rel="nofollow">Twitter for iPhone</a>

'''

text = text.replace(' ', '').replace('=','')

extractor = URLExtract()

print(extractor.find_urls(text))

輸出-

['http://twitter.com/download/iphone']

反對回復(fù) 2023-09-12

慕姐4208626

TA貢獻(xiàn)1852條經(jīng)驗獲得超7個贊

您可以拆分“”。并獲取第二個元素。

.split('"')[1]

https://docs.python.org/3/library/stdtypes.html?highlight=split#str.split

反對回復(fù) 2023-09-12

3 回答
0 關(guān)注
184 瀏覽

關(guān)注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

嘗試從 DataFrame 中的源中刪除 html 格式

嘗試從 DataFrame 中的源中刪除 html 格式

3 回答

添加回答