第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

<style id="8xtu3"></style>

<cite id="8xtu3"><rp id="8xtu3"><form id="8xtu3"></form></rp></cite>

慕課網(wǎng)首頁
免費(fèi)課
實(shí)戰(zhàn)課
體系課
發(fā)現(xiàn)
評(píng)價(jià) 教程專欄手記
商業(yè)合作
企業(yè)服務(wù) 講師入駐

我的購物車

已加入門課程

購物車?yán)锟湛杖缫?/h3>
快去這里選購你中意的課程

實(shí)戰(zhàn)課

體系課

我的訂單中心

全部開發(fā)者教程

Scrapy 入門教程

爬蟲框架基礎(chǔ)篇

Scrapy 爬蟲框架介紹使用 Requests 庫請(qǐng)求網(wǎng)址 Scrapy 默認(rèn)的網(wǎng)頁解析器 Xpath Redis 數(shù)據(jù)庫的基本操作 MongoDB 數(shù)據(jù)庫的基本操作一個(gè)簡(jiǎn)單的爬蟲實(shí)例：互動(dòng)出版網(wǎng)爬蟲第一個(gè)基于 Scrapy 框架的爬蟲

Scrapy 框架初級(jí)篇

Scrapy 運(yùn)行架構(gòu)與數(shù)據(jù)處理流程簡(jiǎn)介 Scrapy 框架的 Shell 工具使用 Scrapy 常用命令及其分析 Scrapy中的Request和Response Scrapy 中的 Pipline 管道 Scrapy 中的中間件 Scrapy 配置介紹及常見優(yōu)化配置 Scrapy 抓取起點(diǎn)中文網(wǎng)：實(shí)現(xiàn)登錄和認(rèn)證 Scrapy 抓取今日頭條：抓取每日熱點(diǎn)新聞

Scrapy 框架高級(jí)篇

網(wǎng)站反爬蟲繞過技術(shù)分析 Splash 服務(wù)初體驗(yàn) 深入使用 Splash 服務(wù) Selenium 自動(dòng)化測(cè)試工具介紹 Scrapy與 Selenium 的結(jié)合使用 Scrapy 的分布式實(shí)現(xiàn)

Scrapy 框架源碼篇

Twisted 框架基礎(chǔ) 深入分析 Scrapy 下載器原理深入理解 Scrapy 中間件深入分析 Scrapy 的 Pipeline 原理深入分析 crawl 命令的執(zhí)行過程

首頁慕課教程 Scrapy 入門教程 Splash 服務(wù)初體驗(yàn)

沈無奇 · 更新于 2020-09-16

上一節(jié)

網(wǎng)站反爬蟲繞過技術(shù)分析

深入使用 Splash 服務(wù)

下一節(jié)

Splash服務(wù)初體驗(yàn)

今天我們來看看 Splash 服務(wù)在 Scrapy 框架中的應(yīng)用。本次實(shí)踐的網(wǎng)站依舊是頭條新聞的熱點(diǎn)新數(shù)據(jù)，這次我們不用在分析熱點(diǎn)新聞數(shù)據(jù)的獲取以及各種加密、解密這么麻煩的事情了，直接使用渲染后的結(jié)果提取數(shù)據(jù)，方便省事。

1. Splash 介紹

Splash 是一個(gè) JavaScript 渲染服務(wù)，是一個(gè)帶有 HTTP API 的輕量級(jí)瀏覽器，同時(shí)它對(duì)接了 Python 中的 Twisted和 QT 庫。利用它，我們同樣可以實(shí)現(xiàn)動(dòng)態(tài)渲染頁面的抓取。該服務(wù)最簡(jiǎn)單且最常用的搭建方式是使用 docker，我們直接來看如何在一臺(tái)云主機(jī)上安裝并啟動(dòng) Splash 服務(wù)。

安裝 Docker，可以參考文獻(xiàn)1，操作環(huán)境為 CentOS 7.8，親測(cè)有效；

# 安裝必要的依賴包
[root@server2 ~]# yum install -y yum-utils device-mapper-persistent-data lvm2
# 添加docker的安裝源
[root@server2 ~]# yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
Loaded plugins: fastestmirror
adding repo from: https://download.docker.com/linux/centos/docker-ce.repo
grabbing file https://download.docker.com/linux/centos/docker-ce.repo to /etc/yum.repos.d/docker-ce.repo
repo saved to /etc/yum.repos.d/docker-ce.repo
# 安裝最新版本的 docker
[root@server2 ~]# sudo yum install docker-ce
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Package 3:docker-ce-19.03.12-3.el7.x86_64 already installed and latest version
Nothing to do

啟動(dòng) docker 服務(wù)，然后可以使用 docker 命令：

[root@server2 ~]# systemctl start docker

使用 docker 安裝 Splash 服務(wù)：

[root@server2 ~]# sudo docker run -p 8050:8050 scrapinghub/splash
2020-08-02 12:28:27+0000 [-] Log opened.
2020-08-02 12:28:27.980032 [-] Xvfb is started: ['Xvfb', ':1020290545', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2020-08-02 12:28:28.171896 [-] Splash version: 3.4.1
2020-08-02 12:28:28.249359 [-] Qt 5.13.1, PyQt 5.13.1, WebKit 602.1, Chromium 73.0.3683.105, sip 4.19.19, Twisted 19.7.0, Lua 5.2
2020-08-02 12:28:28.249582 [-] Python 3.6.9 (default, Nov  7 2019, 10:44:02) [GCC 8.3.0]
2020-08-02 12:28:28.249670 [-] Open files limit: 1048576
2020-08-02 12:28:28.249718 [-] Can't bump open files limit
2020-08-02 12:28:28.278146 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2020-08-02 12:28:28.278310 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2020-08-02 12:28:28.429778 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2020-08-02 12:28:28.430058 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2020-08-02 12:28:28.430491 [-] Site starting on 8050
2020-08-02 12:28:28.430580 [-] Starting factory <twisted.web.server.Site object at 0x7f37918771d0>
2020-08-02 12:28:28.430855 [-] Server listening on http://0.0.0.0:8050

注意：本人的機(jī)器上已經(jīng)安裝了 Splash 服務(wù)鏡像，所以使用 docker run 命令將直接啟動(dòng)該鏡像。如果是第一次啟動(dòng)，則會(huì)先去鏡像倉庫拉去該鏡像，然后再啟動(dòng)，這會(huì)有一點(diǎn)耗時(shí)。

完成上面的操作后，我們來直接訪問云主機(jī)的8050端口，來看看相關(guān)的頁面并進(jìn)行說明：
圖片描述

Splash服務(wù)的首頁

其中最核心的地方就是待渲染的 url 地址和對(duì)應(yīng)的控制腳本了。我們來實(shí)際操作一番，來看下面的演示視頻：

這個(gè)視頻中我只是簡(jiǎn)單地將頭條熱點(diǎn)新聞的網(wǎng)址放到了待渲染的 URL 地址輸入框中，然后修改等待渲染的時(shí)間為 2秒，直接點(diǎn)擊【Render Me!】按鈕，過一會(huì)就看到了被渲染的頭條熱點(diǎn)新聞頁面。腳本中默認(rèn)返回 HTML、圖片以及請(qǐng)求的統(tǒng)計(jì)結(jié)果，這些我們?cè)诮Y(jié)果頁面中都看到了。接下來我們就在 Scrapy 中結(jié)合這個(gè) Splash 服務(wù)來爬取看到的熱點(diǎn)新聞數(shù)據(jù)。

2. Scrapy-Splash 插件

上面我們已經(jīng)看到了 Splash 的強(qiáng)大之處，而這正是 Scrapy 框架所無法做到的，它只能爬取靜態(tài)的網(wǎng)頁而無法直接爬取經(jīng)過動(dòng)態(tài)渲染的網(wǎng)頁數(shù)據(jù)，因此有了這種 Scrapy 和 Splash 的結(jié)合使用的想法，也產(chǎn)生了 Scrapy-Splash 插件。Scrapy-Splash 插件使用的正是Splash HTTP API，因此我們?cè)诰帉憣?duì)應(yīng)的爬蟲程序時(shí)需要啟動(dòng)一個(gè) Splash 服務(wù)，然后 scrapy-splash 模塊會(huì)通過調(diào)用 api 的方式將我們需要渲染的網(wǎng)頁以及相應(yīng)的腳本帶過去執(zhí)行，然后拿到渲染后的頁面，再交給 Scrapy 框架去執(zhí)行。

我們?nèi)ス倬W(wǎng)看看 Scrapy-Splash 插件的使用：

安裝 scrapy-splash 插件：`pip install scrapy-splash`;

另外該插件在 Scrapy 中的配置和使用均在 github 上有詳細(xì)的介紹，許多關(guān)于 scrapy-splash 的使用文章內(nèi)容均來源于此，這里就不做過多介紹，我們直接在實(shí)戰(zhàn)中使用即可，至于背后的配置讀取原理，就需要各位去仔細(xì)閱讀插件源碼了。

3. 再戰(zhàn)今日頭條熱點(diǎn)新聞爬取

上一部分我們費(fèi)了九牛二虎之力爬取的頭條熱點(diǎn)新聞，今天要使用 scrapy-splash 插件和 splash 服務(wù)輕輕松松完成。我們省去前面創(chuàng)建 scrapy 爬蟲的過程，直接看重點(diǎn)：

首先是定義 items：

import scrapy


class ToutiaoSpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    source = scrapy.Field()
    comments = scrapy.Field()
    passed_time = scrapy.Field()

然后是爬蟲的核心代碼：

from scrapy import Spider
from scrapy_splash.request import SplashRequest

from toutiao_spider.items import ToutiaoSpiderItem

script = """
function main(splash, args)
    assert(splash:go(args.url))
    splash:wait(2)
    return {
        html = splash:html(),
        png = splash:png(),
        har = splash:har(),
    }
end
"""

class TouTiaoSpider(Spider):
    name = "toutiao_spider"

    def start_requests(self):
        splah_args = {
            "lua_source": script,
            # 這個(gè)非常重要
            'wait': 5,
        }
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
        }
        yield SplashRequest(url="https://www.toutiao.com/ch/news_hot/", args=splah_args, headers=headers)


    def parse(self, response):
        news_list = response.css('div.wcommonFeed ul li')
        print("共{}條數(shù)據(jù)".format(len(news_list)))
        for news in news_list:
            title = news.css('div.title-box a::text').extract_first()
            source = news.css('div.footer a:nth-child(2)::text').extract_first()
            comments = news.css('div.footer a:nth-child(3)::text').extract_first()
            passed_time = news.css('div.footer span::text').extract_first()

            if not title:
                continue

            items = ToutiaoSpiderItem()
            items['title'] = title.strip()
            # 使用split()再join()的方式是為了清除前后的\xa0
            items['source'] = "".join(source.split())
            items['comments'] = "".join(comments.split())
            items['passed_time'] = "".join(passed_time.split())
            print(f'抓取數(shù)據(jù)到：{items}')

            # yield items

這里的代碼相比 Scrapy 代碼變化的只有一個(gè)地方，就是對(duì)應(yīng)生成的 Request 請(qǐng)求，我們需要替換成 scrapy-splash 插件中的 SplashRequest 類。在該類中最重要的就是 args 參數(shù)，這里我們會(huì)帶上相應(yīng)的 lua 執(zhí)行腳本，也就是前面 Splash 服務(wù)的網(wǎng)頁上看到的那個(gè)腳本。此外，這里我換成了 CSS 選擇器去解析網(wǎng)頁數(shù)據(jù)，其實(shí)和 xpath 方式并沒有什么不同；

最后來看 settings.py 中的配置，和官方推薦的方式保持一致即可，不過我做了一些改動(dòng)：

指定 SPLASH_URL，即 Splash 服務(wù)地址，這里對(duì)應(yīng)的值為 http://47.115.61.209:8050/;

添加 scrapy-splash 的中間件：

SPIDER_MIDDLEWARES = {
   'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

添加 scrapy-splash 的下載中間件：

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
}

另外，scrapy-splash 也提供了一個(gè)重復(fù)過濾器類：
```
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
```

不過這樣爬取時(shí)我們遇到了如下的報(bào)錯(cuò)：

2020-08-02 22:07:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.toutiao.com/robots.txt> (referer: None)
2020-08-02 22:07:26 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.toutiao.com/ch/news_hot/>
2020-08-02 22:07:26 [scrapy.core.engine] INFO: Closing spider (finished)

頭條網(wǎng)站使用 robots 協(xié)議禁止我們爬取它的網(wǎng)站，當(dāng)然我們只需要無視這樣的協(xié)議，繼續(xù)執(zhí)行爬取動(dòng)作即可。在 settings.py 中將遵守 robots 協(xié)議的開關(guān)禁止即可：

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

來看看最后的執(zhí)行效果：

PS D:\shencong\scrapy-lessons\code\chap16\toutiao_spider> scrapy crawl toutiao_spider

給大家留一個(gè)課后作業(yè)：今日頭條的新聞是需要鼠標(biāo)向下滑動(dòng)，然后才會(huì)加載更多熱點(diǎn)新聞，那么我們?nèi)绾卫眠@個(gè) Splash 服務(wù)來能實(shí)現(xiàn)抓取更多的頭條熱點(diǎn)數(shù)據(jù)呢？

4. 小結(jié)

本小節(jié)中我們使用 Splash 服務(wù)來輔助我們爬取渲染后的網(wǎng)頁，這樣可以極大的減少我們?nèi)シ治鼍W(wǎng)頁數(shù)據(jù)的獲取方式，簡(jiǎn)化了網(wǎng)站爬取的難度。今天介紹的 Splash 服務(wù)還有許許多多我們需要去深入和學(xué)習(xí)的，比如對(duì)渲染后的頁面執(zhí)行 js 動(dòng)作等等，只有掌握了這些，才能更好的利用 Splash 為我們獲得想要的數(shù)據(jù)。

5. 參考文獻(xiàn)

[1] Centos7上安裝docker

[2] Scrapy框架之Scrapy-Splash的使用

上一節(jié)

網(wǎng)站反爬蟲繞過技術(shù)分析

下一節(jié)

深入使用 Splash 服務(wù)

我要提出意見反饋

索引目錄

Splash服務(wù)初體驗(yàn)

1. Splash 介紹

2. Scrapy-Splash 插件

3. 再戰(zhàn)今日頭條熱點(diǎn)新聞爬取

4. 小結(jié)

5. 參考文獻(xiàn)

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)