第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

有沒有辦法優(yōu)化for循環(huán)?Selenium 需要很長時間才能抓取 38 頁

有沒有辦法優(yōu)化for循環(huán)?Selenium 需要很長時間才能抓取 38 頁

慕碼人2483693 2023-12-20 10:14:01
我正在嘗試通過 Selenium 和 python抓取https://arxiv.org/search/?query=healthcare&searchtype=allI 。for 循環(huán)執(zhí)行時間太長。我嘗試使用無頭瀏覽器和 PhantomJS 進(jìn)行抓取,但它不會抓取抽象字段(需要通過單擊更多按鈕來擴展抽象字段)import pandas as pdimport seleniumimport reimport timefrom selenium.common.exceptions import NoSuchElementExceptionfrom selenium.webdriver import Firefoxbrowser = Firefox()url_healthcare = 'https://arxiv.org/search/?query=healthcare&searchtype=all'browser.get(url_healthcare)dfs = []for i in range(1, 39):    articles = browser.find_elements_by_tag_name('li[class="arxiv-result"]')    for article in articles:        title = article.find_element_by_tag_name('p[class="title is-5 mathjax"]').text        arxiv_id = article.find_element_by_tag_name('a').text.replace('arXiv:','')        arxiv_link = article.find_elements_by_tag_name('a')[0].get_attribute('href')         pdf_link = article.find_elements_by_tag_name('a')[1].get_attribute('href')        authors = article.find_element_by_tag_name('p[class="authors"]').text.replace('Authors:','')        try:                link1 = browser.find_element_by_link_text('▽ More')                link1.click()        except:                time.sleep(0.1)        abstract = article.find_element_by_tag_name('p[class="abstract mathjax"]').text        date = article.find_element_by_tag_name('p[class="is-size-7"]').text        date = re.split(r"Submitted|;",date)[1]        tag = article.find_element_by_tag_name('div[class="tags is-inline-block"]').text.replace('\n', ',')                try:            doi = article.find_element_by_tag_name('div[class="tags has-addons"]').text            doi = re.split(r'\s', doi)[1]         except NoSuchElementException:            doi = 'None'        all_combined = [title, arxiv_id, arxiv_link, pdf_link, authors, abstract, date, tag, doi]        dfs.append(all_combined)    print('Finished Extracting Page:', i)
查看完整描述

2 回答

?
qq_花開花謝_0

TA貢獻(xiàn)1835條經(jīng)驗 獲得超7個贊

以下實現(xiàn)在16 秒內(nèi)實現(xiàn)了這一目標(biāo)。

為加快執(zhí)行進(jìn)程,我采取了以下措施:

  • 完全刪除Selenium(無需點擊)

  • 對于abstract, 使用BeautifulSoup的輸出并稍后對其進(jìn)行處理

  • 添加multiprocessing以顯著加快該過程

from multiprocessing import Process, Manager

import requests?

from bs4 import BeautifulSoup

import re

import time


start_time = time.time()


def get_no_of_pages(showing_text):

? ? no_of_results = int((re.findall(r"(\d+,*\d+) results for all",showing_text)[0].replace(',','')))

? ? pages = no_of_results//200 + 1

? ? print("total pages:",pages)

? ? return pages?


def clean(text):

? ? return text.replace("\n", '').replace("? ",'')


def get_data_from_page(url,page_number,data):

? ? print("getting page",page_number)

? ? response = requests.get(url+"start="+str(page_number*200))

? ? soup = BeautifulSoup(response.content, "lxml")

? ??

? ? arxiv_results = soup.find_all("li",{"class","arxiv-result"})


? ? for arxiv_result in arxiv_results:

? ? ? ? paper = {}?

? ? ? ? paper["titles"]= clean(arxiv_result.find("p",{"class","title is-5 mathjax"}).text)

? ? ? ? links = arxiv_result.find_all("a")

? ? ? ? paper["arxiv_ids"]= links[0].text.replace('arXiv:','')

? ? ? ? paper["arxiv_links"]= links[0].get('href')

? ? ? ? paper["pdf_link"]= links[1].get('href')

? ? ? ? paper["authors"]= clean(arxiv_result.find("p",{"class","authors"}).text.replace('Authors:',''))


? ? ? ? split_abstract = arxiv_result.find("p",{"class":"abstract mathjax"}).text.split("▽ More\n\n\n",1)

? ? ? ? if len(split_abstract) == 2:

? ? ? ? ? ? paper["abstract"] = clean(split_abstract[1].replace("△ Less",''))

? ? ? ? else:?

? ? ? ? ? ? paper["abstract"] = clean(split_abstract[0].replace("△ Less",''))


? ? ? ? paper["date"] = re.split(r"Submitted|;",arxiv_results[0].find("p",{"class":"is-size-7"}).text)[1]

? ? ? ? paper["tag"] = clean(arxiv_results[0].find("div",{"class":"tags is-inline-block"}).text)?

? ? ? ? doi = arxiv_results[0].find("div",{"class":"tags has-addons"})? ? ? ?

? ? ? ? if doi is None:

? ? ? ? ? ? paper["doi"] = "None"

? ? ? ? else:

? ? ? ? ? ? paper["doi"] = re.split(r'\s', doi.text)[1]?


? ? ? ? data.append(paper)

? ??

? ? print(f"page {page_number} done")



if __name__ == "__main__":

? ? url = 'https://arxiv.org/search/?searchtype=all&query=healthcare&abstracts=show&size=200&order=-announced_date_first&'


? ? response = requests.get(url+"start=0")

? ? soup = BeautifulSoup(response.content, "lxml")


? ? with Manager() as manager:

? ? ? ? data = manager.list()??

? ? ? ? processes = []

? ? ? ? get_data_from_page(url,0,data)



? ? ? ? showing_text = soup.find("h1",{"class":"title is-clearfix"}).text

? ? ? ? for i in range(1,get_no_of_pages(showing_text)):

? ? ? ? ? ? p = Process(target=get_data_from_page, args=(url,i,data))

? ? ? ? ? ? p.start()

? ? ? ? ? ? processes.append(p)


? ? ? ? for p in processes:

? ? ? ? ? ? p.join()


? ? ? ? print("Number of entires scraped:",len(data))


? ? ? ? stop_time = time.time()


? ? ? ? print("Time taken:", stop_time-start_time,"seconds")

輸出:


>>> python test.py

getting page 0

page 0 done

total pages: 10

getting page 1

getting page 4

getting page 2

getting page 6

getting page 5

getting page 3

getting page 7

getting page 9

getting page 8

page 9 done

page 4 done

page 1 done

page 6 done

page 2 done

page 7 done

page 3 done

page 5 done

page 8 done

Number of entires scraped: 1890

Time taken: 15.911492586135864 seconds


查看完整回答
反對 回復(fù) 2023-12-20
?
白衣非少年

TA貢獻(xiàn)1155條經(jīng)驗 獲得超0個贊

您可以根據(jù)要求嘗試一下美麗的湯做法。無需點擊更多鏈接。


from requests import get

from bs4 import BeautifulSoup


# you can change the size to retrieve all the results at one shot.


url = 'https://arxiv.org/search/?query=healthcare&searchtype=all&abstracts=show&order=-announced_date_first&size=50&start=0'

response = get(url,verify = False)

soup = BeautifulSoup(response.content, "lxml")

#print(soup)

queryresults = soup.find_all("li", attrs={"class": "arxiv-result"})


for result in queryresults:

    title = result.find("p",attrs={"class": "title is-5 mathjax"})

    print(title.text)


#If you need full abstract content - try this (you do not need to click on more button

    for result in queryresults:

        abstractFullContent = result.find("span",attrs={"class": "abstract-full has-text-grey-dark mathjax"})

        print(abstractFullContent.text)

輸出:


 Interpretable Deep Learning for Automatic Diagnosis of 12-lead Electrocardiogram

            

  Leveraging Technology for Healthcare and Retaining Access to Personal Health Data to Enhance Personal Health and Well-being

  Towards new forms of particle sensing and manipulation and 3D imaging on a smartphone for healthcare applications



查看完整回答
反對 回復(fù) 2023-12-20
  • 2 回答
  • 0 關(guān)注
  • 226 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號