第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

<legend id="t9vff"><track id="t9vff"><dfn id="t9vff"></dfn></track></legend>

^{<thead id="t9vff"></thead>}

<sub id="t9vff"></sub>

我的購(gòu)物車

已加入門課程

購(gòu)物車?yán)锟湛杖缫?/h3>
快去這里選購(gòu)你中意的課程

實(shí)戰(zhàn)課

體系課

我的訂單中心

去購(gòu)物車

全部開發(fā)者教程

Scrapy 入門教程

爬蟲框架基礎(chǔ)篇

Scrapy 爬蟲框架介紹使用 Requests 庫(kù)請(qǐng)求網(wǎng)址 Scrapy 默認(rèn)的網(wǎng)頁(yè)解析器 Xpath Redis 數(shù)據(jù)庫(kù)的基本操作 MongoDB 數(shù)據(jù)庫(kù)的基本操作一個(gè)簡(jiǎn)單的爬蟲實(shí)例：互動(dòng)出版網(wǎng)爬蟲第一個(gè)基于 Scrapy 框架的爬蟲

Scrapy 框架初級(jí)篇

Scrapy 運(yùn)行架構(gòu)與數(shù)據(jù)處理流程簡(jiǎn)介 Scrapy 框架的 Shell 工具使用 Scrapy 常用命令及其分析 Scrapy中的Request和Response Scrapy 中的 Pipline 管道 Scrapy 中的中間件 Scrapy 配置介紹及常見優(yōu)化配置 Scrapy 抓取起點(diǎn)中文網(wǎng)：實(shí)現(xiàn)登錄和認(rèn)證 Scrapy 抓取今日頭條：抓取每日熱點(diǎn)新聞

Scrapy 框架高級(jí)篇

網(wǎng)站反爬蟲繞過技術(shù)分析 Splash 服務(wù)初體驗(yàn) 深入使用 Splash 服務(wù) Selenium 自動(dòng)化測(cè)試工具介紹 Scrapy與 Selenium 的結(jié)合使用 Scrapy 的分布式實(shí)現(xiàn)

Scrapy 框架源碼篇

Twisted 框架基礎(chǔ) 深入分析 Scrapy 下載器原理深入理解 Scrapy 中間件深入分析 Scrapy 的 Pipeline 原理深入分析 crawl 命令的執(zhí)行過程

首頁(yè) 慕課教程 Scrapy 入門教程 Scrapy與 Selenium 的結(jié)合使用

沈無奇 · 更新于 2020-09-23

上一節(jié)

Selenium 自動(dòng)化測(cè)試工具介紹

Scrapy 的分布式實(shí)現(xiàn)

下一節(jié)

Scrapy與 Selenium 的結(jié)合

今天我們來使用 Scrapy 和 Selenium 結(jié)合爬取京東商城中網(wǎng)絡(luò)爬蟲相關(guān)的書籍?dāng)?shù)據(jù)。

1. 需求分析與初步實(shí)現(xiàn)

今天我們的目的是使用 Scrapy 和 Selenium 結(jié)合來爬取京東商城中搜索 “網(wǎng)絡(luò)爬蟲” 得到的所有圖書數(shù)據(jù)，類似于下面這樣的數(shù)據(jù)：

圖片描述

京東商城搜索網(wǎng)絡(luò)爬蟲

搜索出的結(jié)果有9800+條數(shù)據(jù)，共計(jì)100頁(yè)。我們現(xiàn)在要抓取所有的和網(wǎng)絡(luò)爬蟲相關(guān)的書籍?dāng)?shù)據(jù)。有一個(gè)問題需要注意，搜索的100頁(yè)數(shù)據(jù)中必定存在重復(fù)的結(jié)果，我們可以依據(jù)圖書的詳細(xì)地址來進(jìn)行去重。此外，我們提取的圖書數(shù)據(jù)字段有：

圖書名；
價(jià)格；
評(píng)價(jià)數(shù)；
店鋪名稱；
圖書詳細(xì)地址；

需求已經(jīng)非常明確，現(xiàn)在開始使用 Selenium 和 Scrapy 框架結(jié)合來完成這一需求。來看看如果我們是單純使用 Selenium 工具，該如何完成數(shù)據(jù)爬取呢？這里會(huì)有一個(gè)問題需要注意：按下搜索按鈕后，顯示的數(shù)據(jù)只有30條，只有使用鼠標(biāo)向下滾動(dòng)后，才會(huì)加載更多數(shù)據(jù)，最終顯示60條結(jié)果，然后才會(huì)到達(dá)翻頁(yè)的地方。在 selenium 中我們可以使用如下兩行代碼實(shí)現(xiàn)滾動(dòng)條滑到最底端：

height = driver.execute_script("return document.body.scrollHeight;")
driver.execute_script(f"window.scrollBy(0, {height})")
time.sleep(2)

可以看到，上面兩行代碼主要是執(zhí)行 js 語(yǔ)句。第一行代碼是得到頁(yè)面的底部位置，第二行代碼是使用 scrollBy() 方法控制頁(yè)面滾動(dòng)條移動(dòng)到底部。接下來，我們來看看頁(yè)面數(shù)據(jù)的提取，直接右鍵 F12，可以通過 xpath 表達(dá)式得到所有需要抓取的數(shù)據(jù)。為此，我編寫了一個(gè)根據(jù)頁(yè)面代碼提取圖書數(shù)據(jù)的方法，具體如下：

def parse_book_data(html):
    etree_html = etree.HTML(html)
    # 獲取列表
    gl_items = etree_html.xpath('//div[@id="J_goodsList"]/ul/li')
    print('總共獲取數(shù)據(jù):{}'.format(len(gl_items)))
    res = []
    for item in gl_items:
        book_name_em = item.xpath('.//div[@class="p-name"]/a/em/text()')[0]
        book_name_font = item.xpath('.//div[@class="p-name"]/a/em/font/text()')
        book_name_font = "".join(book_name_font) if book_name_font else ""
        # 獲取圖書名
        book_name = f"{book_name_em}{book_name_font}"
        # 獲取圖書的詳細(xì)介紹地址
        book_detail_url = item.xpath('.//div[@class="p-name"]/a/@href')[0]
        # 獲取圖書價(jià)格
        price = item.xpath('.//div[@class="p-price"]/strong/i/text()')[0]
        # 獲取評(píng)論數(shù)
        comments = item.xpath('.//div[@class="p-commit"]/strong/a/text()')[0]
        # 獲取店鋪名稱
        shop_name = item.xpath('.//div[@class="p-shopnum"]/a/text()')
        shop_name = shop_name[0] if shop_name else ""

        data = {}

        data['book_name'] = book_name
        data['book_detail_url'] = book_detail_url
        data['price'] = price
        data['comments'] = comments
        data['shop_name'] = shop_name

        res.append(data)
    # 返回頁(yè)面解析的結(jié)果
    print('本頁(yè)獲取的結(jié)果：{}'.format(res))
    return res

現(xiàn)在來思考下如何能使用 selenium 一頁(yè)一頁(yè)訪問？我給出了如下代碼：

def get_page_data(driver, page):
    """
    :driver 驅(qū)動(dòng)
    :page   第幾頁(yè)
    """
    # 請(qǐng)求當(dāng)前頁(yè)
    if page > 1:
        WebDriverWait(driver, 10).until(
            EC.visibility_of_element_located((By.ID, 'J_bottomPage'))
        )
        driver.find_element_by_xpath(f'//div[@id="J_bottomPage"]/span/a[text()="{page}"]').click()
        time.sleep(2)
    
    # 滾動(dòng)到最下面，出現(xiàn)京東圖書剩余書籍?dāng)?shù)據(jù)
    height = driver.execute_script("return document.body.scrollHeight;")
    driver.execute_script(f"window.scrollBy(0, {height})")
    time.sleep(2)
    return parse_book_data(driver.page_source)

對(duì)于第一頁(yè)的訪問是在輸入關(guān)鍵字<網(wǎng)絡(luò)爬蟲>后點(diǎn)擊按鈕得到的，我們不需要放到這個(gè)函數(shù)來得到，只需要滾動(dòng)到底部得到所有的圖書數(shù)據(jù)即可；而對(duì)于第2頁(yè)之后的頁(yè)面，我們需要使用 selenium 的模擬鼠標(biāo)點(diǎn)擊功能，點(diǎn)擊下對(duì)應(yīng)頁(yè)后便能跳轉(zhuǎn)得到該頁(yè)，然后再滾動(dòng)到底部，就可以得到整頁(yè)的搜索結(jié)果。我們來看看完整的實(shí)現(xiàn)：

import time
import random
import re

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains

from lxml import etree

def get_page_data(driver, page):
    """
    :driver 驅(qū)動(dòng)
    :page   第幾頁(yè)
    """
    # 具體代碼參考上面
    # ...
    
    
def parse_book_data(html):
    """
    解析頁(yè)面圖書數(shù)據(jù)
    """
    # 具體代碼參考上面
    # ...
    

options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ['enable-automation'])
driver = webdriver.Chrome(options=options, executable_path="C:/Users/Administrator/AppData/Local/Google/Chrome/Application/chromedriver.exe")
driver.maximize_window()

driver.get("https://www.jd.com/")

# 輸入網(wǎng)絡(luò)爬蟲，然后點(diǎn)擊搜索
driver.find_element_by_id('key').send_keys('網(wǎng)絡(luò)爬蟲')
driver.find_elements_by_xpath('//div[@role="serachbox"]/button')[0].click()

time.sleep(2)

max_page = 100
for i in range(1, max_page + 1):
    get_page_data(driver, i)

下面來看看代碼執(zhí)行的效果，這里為了能盡快執(zhí)行完，我將 max_page 參數(shù)調(diào)整為10，只獲取10頁(yè)搜索結(jié)果，一共是600條數(shù)據(jù)：

從上面的演示中，可以看到最后每頁(yè)抓取的數(shù)據(jù)都是60條。

2. Scrapy 與 Selenium 結(jié)合爬取京東圖書數(shù)據(jù)

接下來我們對(duì)上面的代碼進(jìn)行調(diào)整和 Scrapy 框架結(jié)合，而第一步需要做的就是建立好相應(yīng)的工程：

# 創(chuàng)建爬蟲項(xiàng)目
PS D:\shencong\scrapy-lessons\code\chap17> scrapy startproject jdbooks
# ...
# 進(jìn)入到spider目錄，使用genspider命令創(chuàng)建爬蟲文件
PS D:\shencong\scrapy-lessons\code\chap17\jd_books\jd_books\spiders> scrapy genspider jd www.jd.com

創(chuàng)建好工程后就是編寫 items.py 中的 JdBooksItem 類，這非常簡(jiǎn)單，直接根據(jù)我們前面定義好的字段編寫相應(yīng)的代碼即可：

class JdBooksItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    book_name = scrapy.Field()
    price = scrapy.Field()
    comments = scrapy.Field()
    shop_name = scrapy.Field()
    book_detail_url = scrapy.Field()

整個(gè)項(xiàng)目的難點(diǎn)是如何實(shí)現(xiàn)下一頁(yè)數(shù)據(jù)的爬??？前面可以使用 selenium 去自動(dòng)點(diǎn)擊頁(yè)號(hào)而進(jìn)入下一個(gè)，然而在 Scrapy 中卻不太好這樣處理。我們通過分析京東搜索的 URL 后發(fā)現(xiàn)，其搜索的 URL 可以簡(jiǎn)化為如下形式：https://search.jd.com/Search?keyword=搜索關(guān)鍵字&page=(頁(yè)號(hào)* 2 - 1)，我們只需要提供搜索的關(guān)鍵字以及相應(yīng)的請(qǐng)求頁(yè)號(hào)即可。例如下圖所示：

圖片描述

京東搜索 URL 參數(shù)

因此我們?cè)?settings.py 中準(zhǔn)備兩個(gè)參數(shù)：一個(gè)是搜索的關(guān)鍵字，另一個(gè)是爬取的最大頁(yè)數(shù)。具體的形式如下：

# settings.py
# ...
KEYWORD = "網(wǎng)絡(luò)爬蟲"
MAX_PAGE = 10

緊接著我們可以構(gòu)造出請(qǐng)求不同頁(yè)的 URL 并交給 Scrapy 的引擎和調(diào)度器去處理，對(duì)應(yīng)的 Spider 代碼如下：

# 代碼位置：jd_books/jd_books/spiders/jd.py

from urllib.parse import quote
from scrapy import Spider, Request

from jd_books.items import JdBooksItem


class JdSpider(Spider):
    name = 'jd'
    allowed_domains = ['www.jd.com']
    start_urls = ['http://www.jd.com/']
    base_url = "https://search.jd.com/Search?keyword={}&page={}"

    def start_requests(self):
        keyword = self.settings.get('KEYWORD', "Python")
        for page in range(1, self.settings.get('MAX_PAGE') + 1):
            url = self.base_url.format(quote(keyword), page * 2 - 1)
            yield Request(url=url, callback=self.parse_books, dont_filter=True)

    def parse_books(self, response):
        goods_list = response.xpath('//div[@id="J_goodsList"]/ul/li')
        print('本頁(yè)獲取圖書數(shù)目:{}'.format(len(goods_list)))
        for good in goods_list:
            book_name_em = good.xpath('.//div[@class="p-name"]/a/em/text()').extract()[0]
            book_name_font = good.xpath('.//div[@class="p-name"]/a/em/font/text()').extract()
            book_name_font = "".join(book_name_font) if book_name_font else ""
            book_name = f"{book_name_em}{book_name_font}"
            book_detail_url = good.xpath('.//div[@class="p-name"]/a/@href').extract()[0]
            price = good.xpath('.//div[@class="p-price"]/strong/i/text()').extract()[0]
            comments = good.xpath('.//div[@class="p-commit"]/strong/a/text()').extract()[0]
            shop_name = good.xpath('.//div[@class="p-shopnum"]/a/text()').extract()[0]
            
            item = JdBooksItem()
            item['book_name'] = book_name
            item['book_detail_url'] = book_detail_url
            item['price'] = price
            item['comments'] = comments
            item['shop_name'] = shop_name

            yield item

上面的代碼就是單純的生成多頁(yè)的 Request 請(qǐng)求 (start_requests() 方法) 和解析網(wǎng)頁(yè)數(shù)據(jù) (parse_books() 方法)。這個(gè)解析數(shù)據(jù)完全依賴于我們獲取完整的頁(yè)面源碼，那么如何在 Scrapy 中使用 selenium 去請(qǐng)求 URL 然后獲取頁(yè)面源碼呢？答案就是下載中間件。我們?cè)诰帉懸粋€(gè)下載中間件，攔截發(fā)送的 request 請(qǐng)求，對(duì)于請(qǐng)求京東圖書數(shù)據(jù)的請(qǐng)求我們會(huì)切換成 selenium 的方式去獲取網(wǎng)頁(yè)源碼，然后將得到的頁(yè)面源碼封裝成 Response 響應(yīng)并返回。在生成 Scrapy 項(xiàng)目中已經(jīng)為我們準(zhǔn)備好了一個(gè) middleware.py 文件，我們按照上面的思路來完成相應(yīng)代碼，具體內(nèi)容如下：

import time

from scrapy import signals
from scrapy.http.response.html import HtmlResponse

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter

options = webdriver.ChromeOptions()
# 注意，使用這個(gè)參數(shù)我們就不會(huì)看到啟動(dòng)的google瀏覽器，無界面運(yùn)行
options.add_argument('-headless')
options.add_experimental_option("excludeSwitches", ['enable-automation'])

class JdBooksSpiderMiddleware:
    # 保持不變
    # ...
    

class JdBooksDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def __init__(self):
        self.driver = webdriver.Chrome(options=options, executable_path="C:/Users/Administrator/AppData/Local/Google/Chrome/Application/chromedriver.exe")

    # ...

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        print('使用 selenium 請(qǐng)求頁(yè)面:{}'.format(request.url))
        if request.url.startswith("https://search.jd.com/Search"):
            # 如果是獲取京東圖書數(shù)據(jù)的請(qǐng)求，使用selenium方式獲取頁(yè)面
            self.driver.get(request.url)
            time.sleep(2)
            # 將滾動(dòng)條拖到最底端，獲取一頁(yè)完整的60條數(shù)據(jù)
            height = self.driver.execute_script("return document.body.scrollHeight;")
            self.driver.execute_script(f"window.scrollBy(0, {height})")
            time.sleep(2)
            # 將最后渲染得到的頁(yè)面源碼作為響應(yīng)返回
            return HtmlResponse(url=request.url, body=self.driver.page_source, request=request, encoding='utf-8', status=200)
        
    # ...

緊接著，我們需要將這個(gè)下載中間件在 settings.py 中啟用：

DOWNLOADER_MIDDLEWARES = {
   'jd_books.middlewares.JdBooksDownloaderMiddleware': 543,
}

最后我們來完成下數(shù)據(jù)的存儲(chǔ)，繼續(xù)使用 mongodb 來保存抓取到的數(shù)據(jù)。從實(shí)際測(cè)試中發(fā)現(xiàn)京東的搜索結(jié)果在100頁(yè)中肯定會(huì)有不少重復(fù)的數(shù)據(jù)。因此我們的 item pipelines 需要完成2個(gè)處理，分別是去重和保存。來直接看代碼：

import pymongo
from scrapy.exceptions import DropItem

from itemadapter import ItemAdapter


class JdBooksPipeline:
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(host='47.115.61.209', port=27017)
        self.client.admin.authenticate("admin", "shencong1992")
        db = self.client.scrapy_manual
        self.collection = db.jd_books

    def process_item(self, item, spider):
        try:
            book_info = {
                'book_name': item['book_name'],
                'comments': item['comments'],
                'book_detail_url': item['book_detail_url'],
                'shop_name': item['shop_name'],
                'price': item['price'],
            }
            self.collection.insert_one(book_info)
        except Exception as e:
            print("插入數(shù)據(jù)異常:{}".format(str(e)))
        return item

    def close_spider(self, spider):
        self.client.close()


class DuplicatePipeline:
    """
    去除重復(fù)的數(shù)據(jù)，重復(fù)數(shù)據(jù)直接拋出異常，不會(huì)進(jìn)入下一個(gè)流水線處理
    """
    def __init__(self):
        self.book_url_set = set() 

    def process_item(self, item, spider):
        if item['book_detail_url'] in self.book_url_set:
            print('重復(fù)搜索結(jié)果:book={}, url={}'.format(item['book_name'], item['book_detail_url']))
            raise DropItem('duplicate book info, drop it')
        self.book_url_set.add(item['book_detail_url'])
        return item

我們直接使用 Item 的 book_detail_url 字段來判斷數(shù)據(jù)是否重復(fù)。此外，同樣需要將這兩個(gè) Item Pipelines 在 settings.py 中啟用，且保證 DuplicatePipeline 需要先于 JdBooksPipeline 處理：

ITEM_PIPELINES = {
   'jd_books.pipelines.DuplicatePipeline': 200,
   'jd_books.pipelines.JdBooksPipeline': 300,
}

最后剩下一步就是禁止遵守 Robot 協(xié)議：

ROBOTSTXT_OBEY = True

至此，我們的 Scrapy 和 Selenium 結(jié)合爬取京東圖書數(shù)據(jù)的項(xiàng)目就算完成了。為了快速演示效果，我們將最大請(qǐng)求頁(yè)設(shè)置為10，然后運(yùn)行代碼看看實(shí)際的爬取效果：

3. 小結(jié)

本小節(jié)中我們使用 scrapy 和 selenium 結(jié)合完成了一個(gè)京東圖書的爬取案例，從這個(gè)案例中我們能看到了 Scrapy 強(qiáng)大的第三方結(jié)合能力，包括前面的 Splash 服務(wù)。

4. 參考文獻(xiàn)

[1] Scrapy 對(duì)接 Selenium

上一節(jié)

Selenium 自動(dòng)化測(cè)試工具介紹

下一節(jié)

Scrapy 的分布式實(shí)現(xiàn)

我要提出意見反饋

索引目錄

Scrapy與 Selenium 的結(jié)合

1. 需求分析與初步實(shí)現(xiàn)

2. Scrapy 與 Selenium 結(jié)合爬取京東圖書數(shù)據(jù)

3. 小結(jié)

4. 參考文獻(xiàn)

購(gòu)課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

^{<thead id="hnjvp"></thead>}

<style id="hnjvp"></style>