第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

Selenium:根據(jù)網(wǎng)站每個類別的頁面數(shù)量進(jìn)行抓取

Selenium:根據(jù)網(wǎng)站每個類別的頁面數(shù)量進(jìn)行抓取

繁花如伊 2023-09-12 19:53:35
我在這個網(wǎng)站上進(jìn)行了網(wǎng)絡(luò)抓?。篽ttp://www.legorafi.fr/ 它適用于每個類別(政治等),但對于每個類別,我循環(huán)瀏覽相同數(shù)量的頁面。我希望能夠根據(jù)該網(wǎng)站中每個類別的頁面數(shù)量來抓取所有頁面。我這樣做是為了循環(huán)瀏覽頁面:import timefrom selenium import webdriverfrom selenium.common.exceptions import NoSuchElementExceptionfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.action_chains import ActionChainsimport newspaperimport requestsfrom newspaper.utils import BeautifulSoupfrom newspaper import Article#categories = ['france/politique','france/societe', 'monde-libre', 'france/economie/', 'culture', 'people', 'sports', 'hi-tech', 'sciences']papers = []driver = webdriver.Chrome(executable_path="/Users/name/Downloads/chromedriver 4")#driver.get('http://www.legorafi.fr/')for category in categories:    url = 'http://www.legorafi.fr/category/' + category    #WebDriverWait(self.driver, 10)    driver.get(url)    Foo()        time.sleep(2)    pagesToGet = 120pagesToGet = 120title = []content = []for page in range(1, pagesToGet+1):    print('Processing page :', page)    #url = 'http://www.legorafi.fr/category/france/politique/page/'+str(page)    print(driver.current_url)    #print(url)        raw_html = requests.get(url)    soup = BeautifulSoup(raw_html.text, 'html.parser')    for articles_tags in soup.findAll('div', {'class': 'articles'}):        for article_href in articles_tags.find_all('a', href=True):            if not str(article_href['href']).endswith('#commentaires'):                urls_set.add(article_href['href'])                papers.append(article_href['href'])我想循環(huán)瀏覽所有這些類別,并根據(jù)每個類別的頁數(shù)。categories = ['france/politique','france/societe', 'monde-libre', 'france/economie/', 'culture', 'people', 'sports', 'hi-tech', 'sciences']我該怎么做 ?
查看完整描述

1 回答

?
慕斯709654

TA貢獻(xiàn)1840條經(jīng)驗 獲得超5個贊

下面的代碼能夠遍歷所有類別并提取數(shù)據(jù)。該代碼肯定需要更多的測試和一些增強的錯誤處理。


PS祝你在這個編碼項目中好運。


import requests


import time

from random import randint

from datetime import datetime


from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium.common.exceptions import NoSuchElementException


from newspaper.utils import BeautifulSoup

from newspaper import Article


chrome_options = Options()

chrome_options.add_argument("--test-type")

chrome_options.add_argument('--ignore-certificate-errors')

chrome_options.add_argument('--disable-extensions')

chrome_options.add_argument('disable-infobars')

chrome_options.add_argument("--incognito")

# chrome_options.add_argument('--headless')


# window size as an argument is required in headless mode

# chrome_options.add_argument('window-size=1920x1080')

driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)


papers = []

urls_set = set()



def get_articles(link):

   while True:

      try:

        next_link = driver.find_element_by_link_text("Suivant")

        if next_link:

            raw_html = requests.get(url)

            soup = BeautifulSoup(raw_html.text, 'html.parser')

            for articles_tags in soup.findAll('div', {'class': 'articles'}):

                for article_href in articles_tags.find_all('a', href=True):

                    if not str(article_href['href']).endswith('#commentaires'):

                        article = Article(article_href['href'])

                        article.download()

                        article.parse()

                        if article.url is not None:

                            article_url = article_href['href']

                            title = article.title

                            publish_date = datetime.strptime(str(article.publish_date),

                                                             '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')

                            

                            text_of_article = article.text.replace('\n', '')


            driver.execute_script("arguments[0].scrollIntoView(true);", next_link)

            next_link.click()


            # Initiates a random wait to prevent the

            # harvesting operation from starting before

            # the page has completely loaded

            time.sleep(randint(2, 4))


    except NoSuchElementException:

        return




 legorafi_urls = {'monde-libre': 'http://www.legorafi.fr/category/monde-libre',

             'politique': 'http://www.legorafi.fr/category/france/politique',

             'societe': 'http://www.legorafi.fr/category/france/societe',

             'economie': 'http://www.legorafi.fr/category/france/economie',

             'culture': 'http://www.legorafi.fr/category/culture',

             'people': 'http://www.legorafi.fr/category/people',

             'sports': 'http://www.legorafi.fr/category/sports',

             'hi-tech': 'http://www.legorafi.fr/category/hi-tech',

             'sciences': 'http://www.legorafi.fr/category/sciences',

             'ledito': 'http://www.legorafi.fr/category/ledito/'

             }



for category, url in legorafi_urls.items():

   if url:

     browser = driver.get(url)

     driver.implicitly_wait(30)

     get_articles(browser)

  else:

     driver.quit()



查看完整回答
反對 回復(fù) 2023-09-12
  • 1 回答
  • 0 關(guān)注
  • 141 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號