使用scrapy爬取新浪電影庫,結(jié)果前兩頁正常,之后的頁面全是重復(fù),剛開始以為是時(shí)間戳的問題,后來加入時(shí)間戳還是有問題,求解答……(新浪電影庫網(wǎng)址:http://ent.sina.com.cn/ku/movie_search_index.d.html?page=1&cTime=1547163277&pre=next) import?scrapy
from?sina.items?import?MovieItem
from?scrapy_splash?import?SplashRequest
import?time
import?re
class?SinaspiderSpider(scrapy.Spider):
????name?=?'sinaspider'
????allowed_domains?=?['ent.sina.com.cn']
????start_urls?=?['http://ent.sina.com.cn/ku/movie_search_index.d.html?page=1&cTime=1546971817&pre=next']
????def?start_requests(self):
????????for?url?in?self.start_urls:
????????????yield?SplashRequest(url,args={'images':?0,?'timeout':?3})
????def?parse(self,?response):
????????'''
????????1。獲取文章列表頁中的文章url并交給scrapy下載后并進(jìn)行解析
????????2。獲取下一頁的url并交給scrapy,下載完成后交給parse
????????'''
???????
????????for?sel?in?response.css('ul.tv-list?li'):
????????????director?=?sel.css('.item-intro.left?p:nth-child(3)::text').extract_first()
????????????yield?{'director':?director}
????????href?=?response.css('.next-t.nextPage::attr(href)').extract_first()
????????if?href:
????????????t?=?str(int(time.time()*1000))
????????????temp?=?re.match('.*page=(\d+).*',?href)
????????????p?=?int(temp.group(1))+1
????????????url?=?'http://ent.sina.com.cn/ku/movie_search_index.d.html?page='+str(p)+'&cTime='+t+'&pre=next'
????????????yield?SplashRequest(url,?args={'images':?0,?'timeout':?3})
使用scrapy爬取新浪電影庫,只能爬取到前兩頁內(nèi)容,后邊全是重復(fù)頁面
慕設(shè)計(jì)0386610
2019-01-14 09:31:04