第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時綁定郵箱和手機(jī)立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

無法使用 BeautifulSoup 抓取圖像 url

無法使用 BeautifulSoup 抓取圖像 url

慕哥6287543 2023-07-18 15:47:34
我的抓取代碼是。from bs4 import BeautifulSoupimport reroot_tag=["article",{"class":"story"}]image_tag=["img",{"":""},"org-src"]header=["h3",{"class":"story-title"}]news_tag=["a",{"":""},"href"]txt_data=["p",{"":""}]import requestsua1 = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'ua2 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome'headers = {'User-Agent': ua2,? ? ? ? ? ?'Accept': 'text/html,application/xhtml+xml,application/xml;' \? ? ? ? ? ? ? ? ? ? ?'q=0.9,image/webp,*/*;q=0.8'}session = requests.Session()response = session.get("website-link", headers=headers)webContent = response.contentbs = BeautifulSoup(webContent, 'lxml')all_tab_data = bs.findAll(root_tag[0], root_tag[1])output=[]for div in all_tab_data:? ? image_url = None? ? div_img = str(div)? ? match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)? ? print(match)? ? # match = re.search(r"([^\\s]+(\\.(?i)(jpg|png|gif|bmp))$)",div)? ? if match != None:? ? ? ? image_url = str(match.group(0))? ? else:? ? ? ? image_url = div.find(image_tag[0], image_tag[1]).get(image_tag[2])? ? if image_url !=None:? ? ? ? if image_url[0] == '/' and image_url[1] != '/':? ? ? ? ? ? image_url = main_url + image_url? ? ? ? if image_url[0] == '/' and image_url[1] == '/':? ? ? ? ? ? image_url="https://" + image_url[2:]? ? output.append(image_url)它只給出一個 image_url,然后給出錯誤 AttributeError: 'NoneType' object has no attribute 'get'
查看完整描述

1 回答

?
揚(yáng)帆大魚

TA貢獻(xiàn)1799條經(jīng)驗(yàn) 獲得超9個贊

您可能應(yīng)該嘗試重用解析庫,而不是自己解析這些部分。考慮這種方法:


from bs4 import BeautifulSoup

import re


root_tag =  ["article", {"class":"story"}]

image_tag = ["img", {"":""}, "org-src"]

header =    ["h3", {"class":"story-title"}]

news_tag =  ["a", {"":""}, "href"]

txt_data =  ["p", {"":""}]




# import requests

# ua1 = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

# ua2 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome'

# headers = {'User-Agent': ua2,

#            'Accept': 'text/html,application/xhtml+xml,application/xml;' \

#                      'q=0.9,image/webp,*/*;q=0.8'}

# session = requests.Session()

# response = session.get("https://www.reuters.com/energy-environment", headers=headers)

# webContent = response.content


# file = open('output', 'wb')

# file.write(webContent)

# file.close()

file = open('output', 'r')

webContent = file.read()



bs = BeautifulSoup(webContent, 'html.parser')

all_tab_data = bs.findAll(*root_tag)


output = []

for div in all_tab_data:

    image_url = None

    div_img = str(div)

    article_section = BeautifulSoup(div_img, 'html.parser')

    article_images = article_section.findAll(*image_tag)

    if article_images is not None:

        output.extend([i.get('org-src') for i in article_images if i and i.get('org-src') is not None])



查看完整回答
反對 回復(fù) 2023-07-18
  • 1 回答
  • 0 關(guān)注
  • 120 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號