首頁慕課教程 Scrapy 入門教程 Scrapy 抓取今日頭條：抓取每日熱點新聞

沈無奇 · 更新于 2020-09-11

上一節(jié)

Scrapy 抓取起點中文網(wǎng)：實現(xiàn)登錄和認證

網(wǎng)站反爬蟲繞過技術(shù)分析

下一節(jié)

Scrapy 抓取今日頭條：抓取每日熱點新聞

今天我們來基于 Scrapy 框架完成一個新聞數(shù)據(jù)抓取爬蟲，本小節(jié)中我們將進一步學習 Scrapy 框架的，來抓取異步 ajax 請求的數(shù)據(jù)，同時學習 Scrapy 的日志配置、郵件發(fā)送等功能。

1. 今日頭條熱點新聞數(shù)據(jù)抓取分析

今天的爬取對象是今日頭條的熱點新聞，下面的視頻演示了如何找到頭條新聞網(wǎng)站在獲取熱點新聞的 HTTP 請求：

從視頻中我們可以看到頭條新聞獲取網(wǎng)站的接口示例如下：

https://www.toutiao.com/api/pc/feed/?category=news_hot&utm_source=toutiao&widen=1&max_behot_time=1597152177&max_behot_time_tmp=1597152177&tadrequire=true&as=A1955F33D209BD8&cp=5F32293B3DE80E1&_signature=_02B4Z6wo0090109cl1gAAIBCcqbHy0H-dDdPWZPAAIzuFTZSh6NBsUuEpf13PktqrmxS-ZD4dEDZ6Ezcpyjo31hg62slsekkigwdRlS0FHfPsOvx.KRyeJBdEf5QI8nLcwEMyziL1YdPK6VD8f

像這樣的 http 請求時比較難模擬的，我們需要知道請求中所有參數(shù)的獲取規(guī)則，特別是一些進行加密的方式，需要從前端中找出來并手工實現(xiàn)。比如這里的 URL，前幾個參數(shù)都是固定值，其中 as、cp 和 _signature 則非常難獲取，需要有極強的前端功底，網(wǎng)上也有大神對這些值的生成進行了分析和解密，當然這些不是我們學習的重點。

最后一個問題：一次請求得到10條左右的新聞數(shù)據(jù)，那么像實現(xiàn)視頻中那樣更新更多新聞的請求，該如何完成呢？仔細分析下連續(xù)的刷新請求，我們會發(fā)現(xiàn)上述的 URL 請求結(jié)果中有這樣一個參數(shù)：max_behot_time。

圖片描述

第一次請求max_behot_time值為0

圖片描述

next中的max_behot_time等于最后一條數(shù)據(jù)的behot_time值

關(guān)于這個參數(shù)，我們得到兩條信息：

第一次請求熱點新聞數(shù)據(jù)時，該參數(shù)為0；
接下來的每次請求，帶上的 max_behot_time 值為上一次請求熱點新聞數(shù)據(jù)結(jié)果中的 next 字段中的 max_behot_time 鍵對應(yīng)的值。它表示的是一個時間戳，其實就是意味著請求的熱點新聞數(shù)據(jù)需要在這個時間之后；

有了這樣的信息，我們來基于 requests 庫，純手工實現(xiàn)一把頭條熱點新聞數(shù)據(jù)的抓取。我們按照如下的步驟來完成爬蟲代碼：

準備基本變量，包括請求的基本 URL、請求參數(shù)、請求頭等；

hotnews_url = "https://www.toutiao.com/api/pc/feed/?"

params = {
    'category': 'news_hot',
    'utm_source': 'toutiao',
    'widen': 1,
    'max_behot_time': '',
    'max_behot_time_tmp': '',
}

headers = {
    'referer': 'https://www.toutiao.com/ch/news_hot/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
}
cookies = {'tt_webid':'6856365980324382215'} 
max_behot_time = '0'

注意：上面的 cookies 中的 tt_webid 字段值可以通過右鍵看到，不過用處不大。

圖片描述

tt_webid值的獲取

準備三個個方法：get_request_data() 、get_as_cp() 和 save_to_json()。其中第二個函數(shù)是網(wǎng)上有人對頭條的 js 生成 as 和 cp 參數(shù)的代碼進行了翻譯，目前看來似乎還能使用；

def get_request_data(url, headers):
    response = requests.get(url=url, headers=headers)
    return json.loads(response.text)


def get_as_cp():  
    # 該函數(shù)主要是為了獲取as和cp參數(shù)，程序參考今日頭條中的加密js文件：home_4abea46.js
    zz = {}
    now = round(time.time())
    e = hex(int(now)).upper()[2:] 
    a = hashlib.md5() 
    a.update(str(int(now)).encode('utf-8'))
    i = a.hexdigest().upper()
    if len(e) != 8:
        zz = {'as':'479BB4B7254C150',
        'cp':'7E0AC8874BB0985'}
        return zz
    n = i[:5]
    a = i[-5:]
    r = ''
    s = ''
    for i in range(5):
        s = s + n[i] + e[i]
    for j in range(5):
        r = r + e[j + 3] + a[j]
    zz ={
        'as': 'A1' + s + e[-3:],
        'cp': e[0:3] + r + 'E1'
    }
    return zz


def save_to_json(datas, file_path, key_list):
    """
    保存 json 數(shù)據(jù)
    """
    print('寫入數(shù)據(jù)到文件{}中，共計{}條新聞數(shù)據(jù)!'.format(file_path, len(datas)))
    with codecs.open(file_path, 'a+', 'utf-8') as f:
        for d in datas:
            cleaned_data = {}
            for key in key_list:
                if key in d:
                    cleaned_data[key] = d[key]
            print(json.dumps(cleaned_data, ensure_ascii=False))
            f.write("{}\n".format(json.dumps(cleaned_data, ensure_ascii=False)))

最后一步就是實現(xiàn)模擬刷新請求數(shù)據(jù)。下一次的請求會使用上一次請求結(jié)果中的 max_behot_time 值，這樣能連續(xù)獲取熱點新聞數(shù)據(jù)，模擬頭條頁面向下的刷新過程；

# 模擬向下下刷新5次獲取新聞數(shù)據(jù)
refresh_count = 5
for _ in range(refresh_count):
    new_params = copy.deepcopy(params)
    zz = get_as_cp()
    new_params['as'] = zz['as']
    new_params['cp'] = zz['cp']
    new_params['max_behot_time'] = max_behot_time
    new_params['max_behot_time_tmp'] = max_behot_time
    request_url = "{}{}".format(hotnews_url, urlencode(new_params))
    print(f'本次請求max_behot_time = {max_behot_time}')
    datas = get_request_data(request_url, headers=headers, cookies=cookies)
    max_behot_time = datas['next']['max_behot_time']
    save_to_json(datas['data'], "result.json", key_list)

    time.sleep(2)

最后來看看完整抓取熱點新聞數(shù)據(jù)的代碼運行過程，如下：

2. 基于 Scrapy 框架的頭條熱點新聞數(shù)據(jù)爬取

還是按照我們以前的套路來進行，第一步是使用 startproject 命令創(chuàng)建熱點新聞項目：

[root@server ~]# cd scrapy-test/
[root@server scrapy-test]# pyenv activate scrapy-test
pyenv-virtualenv: prompt changing will be removed from future release. configure `export PYENV_VIRTUALENV_DISABLE_PROMPT=1' to simulate the behavior.
(scrapy-test) [root@server scrapy-test]# scrapy startproject toutiao_hotnews
New Scrapy project 'toutiao_hotnews', using template directory '/root/.pyenv/versions/3.8.1/envs/scrapy-test/lib/python3.8/site-packages/scrapy/templates/project', created in:
    /root/scrapy-test/toutiao_hotnews

You can start your first spider with:
    cd toutiao_hotnews
    scrapy genspider example example.com
(scrapy-test) [root@server scrapy-test]#

接著，根據(jù)我們要抓取的新聞數(shù)據(jù)字段，先定義好 Item：

import scrapy


class ToutiaoHotnewsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    abstract = scrapy.Field()
    source = scrapy.Field()  
    source_url = scrapy.Field()
    comments_count = scrapy.Field()
    behot_time = scrapy.Field()

有了 Item 之后，我們需要新建一個 Spider，可以使用 genspider 命令生成，也可以手工編寫一個 Python 文件，代碼內(nèi)容如下：

# 代碼位置：toutiao_hotnews/toutiao_hotnews/spiders/hotnews.py
import copy
import hashlib
from urllib.parse import urlencode
import json
import time

from scrapy import Request, Spider

from toutiao_hotnews.items import ToutiaoHotnewsItem


hotnews_url = "https://www.toutiao.com/api/pc/feed/?"
params = {
    'category': 'news_hot',
    'utm_source': 'toutiao',
    'widen': 1,
    'max_behot_time': '',
    'max_behot_time_tmp': '',
    'as': '',
    'cp': ''
}
headers = {
    'referer': 'https://www.toutiao.com/ch/news_hot/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
}
cookies = {'tt_webid':'6856365980324382215'} 
max_behot_time = '0'

def get_as_cp():  
    # 該函數(shù)主要是為了獲取as和cp參數(shù)，程序參考今日頭條中的加密js文件：home_4abea46.js
    zz = {}
    now = round(time.time())
    e = hex(int(now)).upper()[2:] 
    a = hashlib.md5() 
    a.update(str(int(now)).encode('utf-8'))
    i = a.hexdigest().upper()
    if len(e) != 8:
        zz = {'as':'479BB4B7254C150',
        'cp':'7E0AC8874BB0985'}
        return zz
    n = i[:5]
    a = i[-5:]
    r = ''
    s = ''
    for i in range(5):
        s = s + n[i] + e[i]
    for j in range(5):
        r = r + e[j + 3] + a[j]
    zz ={
        'as': 'A1' + s + e[-3:],
        'cp': e[0:3] + r + 'E1'
    }
    return zz


class HotnewsSpider(Spider):
    name = 'hotnews'
    allowed_domains = ['www.toutiao.com']
    start_urls = ['http://www.toutiao.com/']
    # 記錄次數(shù)，注意停止
    count = 0

    def _get_url(self, max_behot_time):
        new_params = copy.deepcopy(params)
        zz = get_as_cp()
        new_params['as'] = zz['as']
        new_params['cp'] = zz['cp']
        new_params['max_behot_time'] = max_behot_time
        new_params['max_behot_time_tmp'] = max_behot_time
        return  "{}{}".format(hotnews_url, urlencode(new_params))
       
    def start_requests(self):
        """
        第一次爬取
        """
        request_url = self._get_url(max_behot_time)
        self.logger.info(f"we get the request url : {request_url}")
        yield Request(request_url, headers=headers, cookies=cookies, callback=self.parse)

    def parse(self, response):
        """
        根據(jù)得到的結(jié)果得到獲取下一次請求的結(jié)果
        """
        self.count += 1
        datas = json.loads(response.text)
        data = datas['data']
        for d in data:
            item = ToutiaoHotnewsItem()
            item['title'] = d['title']
            item['abstract'] = d.get('abstract', '')
            item['source'] = d['source']
            item['source_url'] = d['source_url']
            item['comments_count'] = d.get('comments_count', 0)
            item['behot_time'] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(d['behot_time']))
            self.logger.info(f'得到的item={item}')
            yield item

        if self.count < self.settings['REFRESH_COUNT']:
            max_behot_time = datas['next']['max_behot_time']
            self.logger.info("we get the next max_behot_time: {}, and the count is {}".format(max_behot_time, self.count))
            yield Request(self._get_url(max_behot_time), headers=headers, cookies=cookies)

這里的代碼之前一樣，第一次構(gòu)造 Request 請求在 start_requests() 方法中，接下來在根據(jù)每次請求結(jié)果中獲取 max_behot_time 值再進行下一次請求。另外我使用了全局計算變量 count 來模擬刷新的次數(shù)，控制請求熱點新聞次數(shù)，防止無限請求下去。此外，Scrapy logger 在每個 spider 實例中提供了一個可以訪問和使用的實例，我們再需要打印日志的地方直接使用 self.logger 即可，它對應(yīng)日志的配置如下：

# 代碼位置：toutiao_hotnews/settings.py
# 注意設(shè)置下下載延時
DOWNLOAD_DELAY = 5
# ...
#是否啟動日志記錄，默認True
LOG_ENABLED = True 
LOG_ENCODING = 'UTF-8'
#日志輸出文件，如果為NONE，就打印到控制臺
LOG_FILE = 'toutiao_hotnews.log'
#日志級別，默認DEBUG
LOG_LEVEL = 'INFO'
# 日志日期格式 
LOG_DATEFORMAT = "%Y-%m-%d %H:%M:%S"
#日志標準輸出，默認False，如果True所有標準輸出都將寫入日志中，比如代碼中的print輸出也會被寫入到
LOG_STDOUT = False

接下來是 Item Pipelines 部分，這次我們將抓取到的新聞保存到 MySQL 數(shù)據(jù)庫中。此外，我們還有一個需求就是選擇當前最新的10條新聞發(fā)送到本人郵件，這樣每天早上就能定時收到最新的頭條新聞，豈不美哉。首先我想給自己的郵件發(fā)送 HTML 格式的數(shù)據(jù)，然后列出最新的10條新聞，因此第一步是是準備好模板熱點新聞的模板頁面，具體模板頁面如下：

# 代碼位置: toutiao_hotnews/html_template.py
hotnews_template_html = """
<!DOCTYPE html>
<html>
<head>
	<title>頭條熱點新聞一覽</title>
</head>
<style type="text/css">
</style>
<body>
<div class="container">
<h3 style="margin-bottom: 10px">頭條熱點新聞一覽</h3>
$news_list
</div>
</body>
</html>
"""

要注意一點，Scrapy 的郵箱功能只能發(fā)送文本內(nèi)容，不能發(fā)送 HTML 內(nèi)容。為了能支持發(fā)送 HTML 內(nèi)容，我繼承了原先的 MailSender 類，并對原先的 send() 方法稍做改動：

# 代碼位置: mail.py

import logging 
from email import encoders as Encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email.mime.nonmultipart import MIMENonMultipart
from email.mime.text import MIMEText
from email.utils import COMMASPACE, formatdate

from scrapy.mail import MailSender
from scrapy.utils.misc import arg_to_iter

logger = logging.getLogger(__name__)

class HtmlMailSender(MailSender):
    def send(self, to, subject, body, cc=None, mimetype='text/plain', charset=None, _callback=None):
        from twisted.internet import reactor
         
        #####去掉了與attachs參數(shù)相關(guān)的判斷語句,其余代碼不變#############
        msg = MIMEText(body, 'html', 'utf-8')
        ##########################################################

        to = list(arg_to_iter(to))
        cc = list(arg_to_iter(cc))

        msg['From'] = self.mailfrom
        msg['To'] = COMMASPACE.join(to)
        msg['Date'] = formatdate(localtime=True)
        msg['Subject'] = subject
        rcpts = to[:]
        if cc:
            rcpts.extend(cc)
            msg['Cc'] = COMMASPACE.join(cc)

        if charset:
            msg.set_charset(charset)

        if _callback:
            _callback(to=to, subject=subject, body=body, cc=cc, attach=attachs, msg=msg)

        if self.debug:
            logger.debug('Debug mail sent OK: To=%(mailto)s Cc=%(mailcc)s '
                         'Subject="%(mailsubject)s" Attachs=%(mailattachs)d',
                         {'mailto': to, 'mailcc': cc, 'mailsubject': subject,
                          'mailattachs': len(attachs)})
            return

        dfd = self._sendmail(rcpts, msg.as_string().encode(charset or 'utf-8'))
        dfd.addCallbacks(
            callback=self._sent_ok,
            errback=self._sent_failed,
            callbackArgs=[to, cc, subject, len(attachs)],
            errbackArgs=[to, cc, subject, len(attachs)],
        )
        reactor.addSystemEventTrigger('before', 'shutdown', lambda: dfd)
        return dfd

緊接著就是我們的 pipelines.py 文件中的代碼：

import logging
from string import Template
from itemadapter import ItemAdapter
import pymysql


from toutiao_hotnews.mail import HtmlMailSender
from toutiao_hotnews.items import ToutiaoHotnewsItem
from toutiao_hotnews.html_template import hotnews_template_html
from toutiao_hotnews import settings

class ToutiaoHotnewsPipeline:
    logger = logging.getLogger('pipelines_log')

    def open_spider(self, spider):
        # 使用自己的MailSender類
        self.mailer = HtmlMailSender().from_settings(spider.settings)
        # 初始化連接數(shù)據(jù)庫
        self.db = pymysql.connect(
            host=spider.settings.get('MYSQL_HOST', 'localhost'),                 
            user=spider.settings.get('MYSQL_USER', 'root'),
            password=spider.settings.get('MYSQL_PASS', '123456'),
            port=spider.settings.get('MYSQL_PORT', 3306),
            db=spider.settings.get('MYSQL_DB_NAME', 'mysql'),
            charset='utf8'
        ) 
        self.cursor = self.db.cursor()

    def process_item(self, item, spider):
        # 插入sql語句
        sql = "insert into toutiao_hotnews(title, abstract, source, source_url, comments_count, behot_time) values (%s, %s, %s, %s, %s, %s)"
        if item and isinstance(item, ToutiaoHotnewsItem):
            self.cursor.execute(sql, (item['title'], item['abstract'], item['source'], item['source_url'], item['comments_count'], item['behot_time']))
        return item

    def query_data(self, sql):
        data = {}
        try:
            self.cursor.execute(sql)
            data = self.cursor.fetchall()
        except Exception as e:
            logging.error('database operate error:{}'.format(str(e)))
            self.db.rollback()
        return data

    def close_spider(self, spider):
        sql = "select  title, source_url, behot_time from toutiao_hotnews where 1=1 order by behot_time limit 10"
        # 獲取10條最新的熱點新聞
        data = self.query_data(sql)
        news_list = ""
        # 生成html文本主體
        for i in range(len(data)):
            news_list += "<div><span>{}、<a href=https://www.toutiao.com{}>{} [{}]</a></span></div>".format(i + 1, data[i][1], data[i][0], data[i][2])
        msg_content = Template(hotnews_template_html).substitute({"news_list": news_list})
        self.db.commit()
        self.cursor.close()
        self.db.close()
        self.logger.info("最后統(tǒng)一發(fā)送郵件")
        # 必須加return，不然會報錯
        return self.mailer.send(to=["2894577759@qq.com"], subject="這是一個測試", body=msg_content, cc=["2894577759@qq.com"])

這里我們會將 MySQL 的配置統(tǒng)一放到 settings.py 文件中，然后使用 spider.settings 來讀取響應(yīng)的信息。其中 open_spider() 方法用于初始化連接數(shù)據(jù)庫，process_item() 方法用于生成 SQL 語句并提交插入動作，最后的 close_spider() 方法用于提交數(shù)據(jù)庫執(zhí)行動作、關(guān)閉數(shù)據(jù)庫連接以及發(fā)送統(tǒng)一新聞熱點郵件。下面是我們將這個 Pipeline 在 settings.py 中開啟以及配置數(shù)據(jù)庫信息、郵件服務(wù)器信息，同時也要注意關(guān)閉遵守 Robot 協(xié)議，這樣爬蟲才能正常執(zhí)行。


ROBOTSTXT_OBEY = False

# 啟動對應(yīng)的pipeline
ITEM_PIPELINES = {
   'toutiao_hotnews.pipelines.ToutiaoHotnewsPipeline': 300,
}

# 數(shù)據(jù)庫配置
MYSQL_HOST = "180.76.152.113"
MYSQL_PORT = 9002
MYSQL_USER = "store"
MYSQL_PASS = "數(shù)據(jù)庫密碼"
MYSQL_DB_NAME = "ceph_check"

# 郵箱配置
MAIL_HOST = 'smtp.qq.com'
MAIL_PORT = 25
MAIL_FROM = '2894577759@qq.com'
MAIL_PASS = '你的授權(quán)碼'
MAIL_USER = '2894577759@qq.com'

來看看我們這個頭條新聞爬蟲的爬取效果，視頻演示如下：

3. 小結(jié)

本小節(jié)中我們繼續(xù)帶領(lǐng)大家完成一個 Scrapy 框架的實戰(zhàn)案例，繼續(xù)學習了 Scrapy 中關(guān)于日志的配置、郵件發(fā)送等功能。這一小節(jié)，大家有收獲了嗎？

上一節(jié)

Scrapy 抓取起點中文網(wǎng)：實現(xiàn)登錄和認證

下一節(jié)

網(wǎng)站反爬蟲繞過技術(shù)分析

我要提出意見反饋

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索 清空

我的購物車

已加入門課程

購物車里空空如也

Scrapy 入門教程

前端開發(fā)

JavaScript

JavaScript 入門教程

TypeScript 入門教程

Vue 入門教程

Ajax 入門教程

ES6-10 入門教程

Yarn 入門教程

ECharts 入門教程

HTML & CSS

CSS3 入門教程

雪碧圖入門教程

移動端布局教程

Html5 入門教程

Sass 入門教程

HTML 入門教程

canvas 入門教程

uni-app 入門教程

服務(wù)端相關(guān)

服務(wù)器

Nginx 入門教程

HTTP 入門教程

Docker 入門教程

Shell 入門教程

Linux 入門教程

開發(fā)工具

Gradle 入門教程

Vim 編輯器教程

RESTful 規(guī)范教程

Dreamweaver 教程

Markdown 入門教程

Maven 入門教程

Eclipse 編輯器教程

GitHub 入門教程

Android Studio 編輯器教程

PyCharm 編輯器教程

Sublime Text 使用教程

Postman 教程

Git入門教程

熱門服務(wù)端語言

C 語言入門教程

Go 入門教程

Kotlin 教程

Ruby 入門教程

ThinkPHP 入門教程

Java

基礎(chǔ)應(yīng)用

Java 入門教程

Android 入門教程

算法入門教程

數(shù)據(jù)結(jié)構(gòu)入門教程

Lambda 表達式教程

Java 并發(fā)原理入門教程

設(shè)計模式入門教程

Java并發(fā)工具

JVM 入門教程

RabbitMQ 入門教程

網(wǎng)絡(luò)編程入門教程

后端通用面試教程

框架應(yīng)用

Spring Boot 入門教程

Spring 入門教程

Hibernate 入門教程

MyBatis 入門教程

Spring MVC 入門教程

Swagger 入門教程

Zookeeper 入門教程

Netty 教程

Spring Security

微服務(wù)

Spring Cloud Hystrix

Python

基礎(chǔ)應(yīng)用

最近搜索清空