報(bào)錯(cuò)---SyntaxError: invalid syntax,實(shí)在無(wú)奈,找了幾天還沒(méi)找出原因
#環(huán)境是eclipse,python3.5 #spider_main #?coding:?utf-8? from?baike_python?import?html_downloader,?url_manager,?html_parser,?html_outputer class?SpiderMain(object): ????def?__init__(self):#構(gòu)造方法 ????????self.urls?=?url_manager.UrlManager()#初始化URL管理器 ????????self.downloader?=?html_downloader.HtmlDownloader()#初始化網(wǎng)頁(yè)下載器 ????????self.parser?=?html_parser.HtmlParser()#初始化網(wǎng)頁(yè)解析器 ????????self.outputer?=?html_outputer.HtmlOutputer()#初始化輸出器 ???????????? ????def?craw(self,?root_url):#開(kāi)始執(zhí)行爬蟲(chóng)的方法 ????????count?=?1#計(jì)數(shù)器,計(jì)數(shù)爬取頁(yè)面的總數(shù)量 ????????count2?=?0#計(jì)數(shù)器,計(jì)數(shù)爬取失敗的網(wǎng)頁(yè)個(gè)數(shù) ????????self.urls.add_new_url(root_url)#傳入網(wǎng)頁(yè)入口 ????????while?self.urls.has_new_url():#對(duì)網(wǎng)頁(yè)內(nèi)包括的連接網(wǎng)頁(yè)循環(huán)抓取,先判斷URL管理器不空 ????????????try:#有些頁(yè)面可能失效了,要有異常處理 ????????????????new_url?=?self.urls.get_new_url()#獲取URL管理器中的一個(gè)URL ????????????????print?("craw?%d?:?%s?\n"%(count,new_url)#打印當(dāng)前爬去的頁(yè)面 ????????????????html_cont?=?self.downloader.download(new_url) ????????????????new_urls,?new_data?=?self.parser.parse(new_url,html_cont)#把頁(yè)面解析為新連接和網(wǎng)頁(yè)數(shù)據(jù)兩部分,其中new_data?中含有當(dāng)頁(yè)的鏈接、title和summary,new_url是當(dāng)前頁(yè)面中的所有鏈接的集合 ????????????????print?(new_data["title"]+"\n",?new_data["summary"])?? ????????????????self.urls.add_new_urls(new_urls)#新鏈接存入U(xiǎn)RL管理器 ????????????????self.outputer.collect_data(new_data)#網(wǎng)頁(yè)數(shù)據(jù)收集 ????????????????if?count?==?10:#控制打印頁(yè)面的數(shù)量 ????????????????????break ????????????????count?=?count+1??? ????????????except?Exception,e: ????????????????count2?=?count2+1 ????????????????print('e') ????????????????print?("craw?failed") ???????? ????????self.outputer.output_html() ????????print?(str(count-count2)+"?successful,","?while?"+str(count2)+"?failed?") if?__name__=="__main__":?#主函數(shù) ????root_url?=?"http://baike.baidu.com/view/21087.htm"?#入口頁(yè) ????obj_spider?=?SpiderMain()#創(chuàng)建對(duì)象 ????obj_spider.craw(root_url)#啟動(dòng)爬蟲(chóng)
#html_downloader
import?urllib class?HtmlDownloader(object): ???? ???? ????def?download(self,url): ????????if?url?is?None: ????????????return?None? ????????#request?=?urllib2.Request(url) ????????response?=?urllib.request.urlopen(url) ???????? ???????? ????????if?response.getcode()?!=?200: ????????????return?None? ????????return?response.read().decode('utf-8')
感謝查看,找bug真的煩= =
2017-10-20
我的html_downloader代碼頭部加的是下面這個(gè),沒(méi)報(bào)錯(cuò)
import urllib.request
2016-12-13
print?("craw?%d?:?%s?\n"%(count,new_url)
改成print "craw?%d?:?%s?\n"%(count,new_url)
2016-12-03
我不知道是不是這個(gè)問(wèn)題:
?response?=?urllib.request.urlopen(url)
這個(gè)我的是:
?response?=?urllib.urlopen(url)
2016-12-02
這是報(bào)錯(cuò)的圖 spider_main第23行