首頁猿問有沒有辦法優(yōu)化for循環(huán)？Sele...

有沒有辦法優(yōu)化for循環(huán)？Selenium 需要很長時間才能抓取 38 頁

Python

慕碼人2483693 2023-12-20 10:14:01

我正在嘗試通過 Selenium 和 python抓取https://arxiv.org/search/?query=healthcare&searchtype=allI 。for 循環(huán)執(zhí)行時間太長。我嘗試使用無頭瀏覽器和 PhantomJS 進行抓取，但它不會抓取抽象字段（需要通過單擊更多按鈕來擴展抽象字段）import pandas as pdimport seleniumimport reimport timefrom selenium.common.exceptions import NoSuchElementExceptionfrom selenium.webdriver import Firefoxbrowser = Firefox()url_healthcare = 'https://arxiv.org/search/?query=healthcare&searchtype=all'browser.get(url_healthcare)dfs = []for i in range(1, 39): articles = browser.find_elements_by_tag_name('li[class="arxiv-result"]') for article in articles: title = article.find_element_by_tag_name('p[class="title is-5 mathjax"]').text arxiv_id = article.find_element_by_tag_name('a').text.replace('arXiv:','') arxiv_link = article.find_elements_by_tag_name('a')[0].get_attribute('href') pdf_link = article.find_elements_by_tag_name('a')[1].get_attribute('href') authors = article.find_element_by_tag_name('p[class="authors"]').text.replace('Authors:','') try: link1 = browser.find_element_by_link_text('▽ More') link1.click() except: time.sleep(0.1) abstract = article.find_element_by_tag_name('p[class="abstract mathjax"]').text date = article.find_element_by_tag_name('p[class="is-size-7"]').text date = re.split(r"Submitted|;",date)[1] tag = article.find_element_by_tag_name('div[class="tags is-inline-block"]').text.replace('\n', ',') try: doi = article.find_element_by_tag_name('div[class="tags has-addons"]').text doi = re.split(r'\s', doi)[1] except NoSuchElementException: doi = 'None' all_combined = [title, arxiv_id, arxiv_link, pdf_link, authors, abstract, date, tag, doi] dfs.append(all_combined) print('Finished Extracting Page:', i)

查看完整描述

2 回答

qq_花開花謝_0

TA貢獻1835條經(jīng)驗獲得超7個贊

以下實現(xiàn)在16 秒內(nèi)實現(xiàn)了這一目標(biāo)。

為加快執(zhí)行進程，我采取了以下措施：

完全刪除Selenium（無需點擊）
對于abstract, 使用BeautifulSoup的輸出并稍后對其進行處理
添加multiprocessing以顯著加快該過程

from multiprocessing import Process, Manager

import requests?

from bs4 import BeautifulSoup

import re

import time

start_time = time.time()

def get_no_of_pages(showing_text):

? ? no_of_results = int((re.findall(r"(\d+,*\d+) results for all",showing_text)[0].replace(',','')))

? ? pages = no_of_results//200 + 1

? ? print("total pages:",pages)

? ? return pages?

def clean(text):

? ? return text.replace("\n", '').replace("? ",'')

def get_data_from_page(url,page_number,data):

? ? print("getting page",page_number)

? ? response = requests.get(url+"start="+str(page_number*200))

? ? soup = BeautifulSoup(response.content, "lxml")

? ??

? ? arxiv_results = soup.find_all("li",{"class","arxiv-result"})

? ? for arxiv_result in arxiv_results:

? ? ? ? paper = {}?

? ? ? ? paper["titles"]= clean(arxiv_result.find("p",{"class","title is-5 mathjax"}).text)

? ? ? ? links = arxiv_result.find_all("a")

? ? ? ? paper["arxiv_ids"]= links[0].text.replace('arXiv:','')

? ? ? ? paper["arxiv_links"]= links[0].get('href')

? ? ? ? paper["pdf_link"]= links[1].get('href')

? ? ? ? paper["authors"]= clean(arxiv_result.find("p",{"class","authors"}).text.replace('Authors:',''))

? ? ? ? split_abstract = arxiv_result.find("p",{"class":"abstract mathjax"}).text.split("▽ More\n\n\n",1)

? ? ? ? if len(split_abstract) == 2:

? ? ? ? ? ? paper["abstract"] = clean(split_abstract[1].replace("△ Less",''))

? ? ? ? else:?

? ? ? ? ? ? paper["abstract"] = clean(split_abstract[0].replace("△ Less",''))

? ? ? ? paper["date"] = re.split(r"Submitted|;",arxiv_results[0].find("p",{"class":"is-size-7"}).text)[1]

? ? ? ? paper["tag"] = clean(arxiv_results[0].find("div",{"class":"tags is-inline-block"}).text)?

? ? ? ? doi = arxiv_results[0].find("div",{"class":"tags has-addons"})? ? ? ?

? ? ? ? if doi is None:

? ? ? ? ? ? paper["doi"] = "None"

? ? ? ? else:

? ? ? ? ? ? paper["doi"] = re.split(r'\s', doi.text)[1]?

? ? ? ? data.append(paper)

? ??

? ? print(f"page {page_number} done")

if __name__ == "__main__":

? ? url = 'https://arxiv.org/search/?searchtype=all&query=healthcare&abstracts=show&size=200&order=-announced_date_first&'

? ? response = requests.get(url+"start=0")

? ? soup = BeautifulSoup(response.content, "lxml")

? ? with Manager() as manager:

? ? ? ? data = manager.list()??

? ? ? ? processes = []

? ? ? ? get_data_from_page(url,0,data)

? ? ? ? showing_text = soup.find("h1",{"class":"title is-clearfix"}).text

? ? ? ? for i in range(1,get_no_of_pages(showing_text)):

? ? ? ? ? ? p = Process(target=get_data_from_page, args=(url,i,data))

? ? ? ? ? ? p.start()

? ? ? ? ? ? processes.append(p)

? ? ? ? for p in processes:

? ? ? ? ? ? p.join()

? ? ? ? print("Number of entires scraped:",len(data))

? ? ? ? stop_time = time.time()

? ? ? ? print("Time taken:", stop_time-start_time,"seconds")

輸出：

>>> python test.py

getting page 0

page 0 done

total pages: 10

getting page 1

getting page 4

getting page 2

getting page 6

getting page 5

getting page 3

getting page 7

getting page 9

getting page 8

page 9 done

page 4 done

page 1 done

page 6 done

page 2 done

page 7 done

page 3 done

page 5 done

page 8 done

Number of entires scraped: 1890

Time taken: 15.911492586135864 seconds

反對回復(fù) 2023-12-20

白衣非少年

TA貢獻1155條經(jīng)驗獲得超0個贊

您可以根據(jù)要求嘗試一下美麗的湯做法。無需點擊更多鏈接。

from requests import get

from bs4 import BeautifulSoup

# you can change the size to retrieve all the results at one shot.

url = 'https://arxiv.org/search/?query=healthcare&searchtype=all&abstracts=show&order=-announced_date_first&size=50&start=0'

response = get(url,verify = False)

soup = BeautifulSoup(response.content, "lxml")

#print(soup)

queryresults = soup.find_all("li", attrs={"class": "arxiv-result"})

for result in queryresults:

title = result.find("p",attrs={"class": "title is-5 mathjax"})

print(title.text)

#If you need full abstract content - try this (you do not need to click on more button

for result in queryresults:

abstractFullContent = result.find("span",attrs={"class": "abstract-full has-text-grey-dark mathjax"})

print(abstractFullContent.text)

輸出：

Interpretable Deep Learning for Automatic Diagnosis of 12-lead Electrocardiogram

Leveraging Technology for Healthcare and Retaining Access to Personal Health Data to Enhance Personal Health and Well-being

Towards new forms of particle sensing and manipulation and 3D imaging on a smartphone for healthcare applications

反對回復(fù) 2023-12-20

2 回答
0 關(guān)注
253 瀏覽

關(guān)注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

有沒有辦法優(yōu)化for循環(huán)？Selenium 需要很長時間才能抓取 38 頁

有沒有辦法優(yōu)化for循環(huán)？Selenium 需要很長時間才能抓取 38 頁

2 回答

添加回答

有沒有辦法優(yōu)化for循環(huán)？Selenium 需要很長時間才能抓取 38 頁