首頁猿問如何使用動態(tài) HTML...

如何使用動態(tài) HTML (Python) 從網(wǎng)頁中抓取數(shù)據(jù)？

JavaScript

炎炎設(shè)計 2023-05-25 17:26:12

我正在嘗試找出如何從以下網(wǎng)址抓取數(shù)據(jù)：https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx這是數(shù)據(jù)類型：似乎所有內(nèi)容都是從數(shù)據(jù)庫中填充并通過 javascript 加載到網(wǎng)頁中的。我過去使用seleniumand做過類似的事情PhantomJS，但我不知道如何在 Python 中獲取這些數(shù)據(jù)字段。不出所料，我不能用于pd.read_html此類問題。是否可以解析以下結(jié)果：from selenium import webdriverurl="https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx"browser = webdriver.PhantomJS()browser.get(url)content = browser.page_source或者可能訪問實際的底層數(shù)據(jù)？如果沒有，除了幾個小時的復(fù)制和粘貼之外，還有什么其他方法？編輯：基于下面的答案，從@thenullptr 我已經(jīng)能夠訪問材料但只能在第 1 頁上。我如何調(diào)整它以跨越所有頁面 [建議正確解析]？我的最終目標(biāo)是將其放入熊貓數(shù)據(jù)框中import requestsfrom bs4 import BeautifulSoupr = requests.post( url = 'https://search.aap.org/nicu/', data = {'SearchCriteria.Level':'1', 'X-Requested-With':'XMLHttpRequest'}, ) #key:valuehtml = r.text# Parsing the HTML soup = BeautifulSoup(html.split("</script>")[-1].strip(), "html")div = soup.find("div", {"id": "main"})div = soup.findAll("div", {"class":"blue-border panel list-group"})def f(x): ignore_fields = ['Collapse all','Expand all'] output = list(filter(bool, map(str.strip, x.text.split("\n")))) output = list(filter(lambda x: x not in ignore_fields, output)) return outputresults = pd.Series(list(map(f, div))[0])

查看完整描述

目前暫無任何回答