我正在嘗試抓取以下頁(yè)面的結(jié)果:http://www.peekyou.com/work/autodesk/page=1頁(yè)面= 1,2,3,4 ...依結(jié)果依此類推。因此,我正在獲取一個(gè)php文件來運(yùn)行搜尋器,以針對(duì)不同的頁(yè)碼運(yùn)行它。代碼(用于單個(gè)頁(yè)面)如下:`import sys from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from scrapy.http import Request #from scrapy.crawler import CrawlerProcess class DmozSpider(BaseSpider): name = "peekyou_crawler" start_urls = ["http://www.peekyou.com/work/autodesk/page=1"]; def parse(self, response): hxs = HtmlXPathSelector(response) discovery = hxs.select('//div[@class="nextPage"]/table/tr[2]/td/a[contains(@title,"Next")]') print len(discovery) print "Starting the actual file" items = hxs.select('//div[@class="resultCell"]') count = 0 for newsItem in items: print newsItem url=newsItem.select('h2/a/@href').extract() name = newsItem.select('h2/a/span/text()').extract() count = count + 1 print count print url[0] print name[0] print "\n"`Autodesk結(jié)果頁(yè)面有18頁(yè)。當(dāng)我運(yùn)行代碼以爬網(wǎng)所有頁(yè)面時(shí),爬網(wǎng)程序僅從頁(yè)面2獲取數(shù)據(jù),而不是從所有頁(yè)面獲取數(shù)據(jù)。同樣,我將公司名稱更改為其他名稱。同樣,它會(huì)刮掉一些頁(yè)面,而不會(huì)休息。我在每個(gè)頁(yè)面上都收到http響應(yīng)200。另外,即使我繼續(xù)運(yùn)行它,它也會(huì)繼續(xù)抓取相同頁(yè)面,但并非總是抓取所有頁(yè)面。是否知道我的方法可能有什么錯(cuò)誤或我缺少什么?
添加回答
舉報(bào)
0/150
提交
取消