首頁猿問如何從一個(gè)鏈接中生成一個(gè)已解析的項(xiàng)...

如何從一個(gè)鏈接中生成一個(gè)已解析的項(xiàng)目，并從同一項(xiàng)目列表中的其他鏈接生成其他已解析的項(xiàng)目

Python

楊__羊羊 2021-12-26 15:11:01

問題是我一直在從一個(gè)地方列表中迭代來刮取緯度經(jīng)度和海拔高度。問題是當(dāng)我得到我刮回來的東西時(shí)，我無法將它與我當(dāng)前的 df 鏈接起來，因?yàn)槲业拿Q可能已被修改或跳過。我設(shè)法得到了我所看到的名稱，但由于它是從其他項(xiàng)目的鏈接外部解析的，因此無法正常工作。import scrapyimport pandas as pdfrom ..items import latlonglocItemdf = pd.read_csv('wine_df_final.csv')df = df[pd.notnull(df.real_place)]real_place = list(set(df.real_place))class latlonglocSpider(scrapy.Spider): name = 'latlonglocs' start_urls = [] for place in real_place: baseurl = place.replace(',', '').replace(' ', '+') cleaned_href = f'http://www.google.com/search?q={baseurl}+coordinates+latitude+longitude+distancesto' start_urls.append(cleaned_href) def parse(self, response): items = latlonglocItem() items['base_name'] = response.xpath('string(/html/head/title)').get().split(' coordinates')[0] for href in response.xpath('//*[@id="ires"]/ol/div/h3/a/@href').getall(): if href.startswith('/url?q=https://www.distancesto'): yield response.follow(href, self.parse_distancesto) else: pass yield items def parse_distancesto(self, response): items = latlonglocItem() try: items['appellation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[2]/p/strong)').get() items['latitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[1]/td)').get() items['longitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[2]/td)').get() items['elevation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[10]/td)').get() yield items except Exception: pass#output appellation base_name elevation latitude longitude Chalone, USA Santa Cruz, USA 56.81 35 9.23 發(fā)生的事情是我解析了我尋找的內(nèi)容，然后它進(jìn)入鏈接并解析其余信息。但是，顯然在我的數(shù)據(jù)框中，我得到了與其余項(xiàng)目完全無關(guān)的名稱，即使這樣也很難找到匹配項(xiàng)。我希望將信息傳遞給另一個(gè)函數(shù)，以便將所有項(xiàng)目一起生成。

查看完整描述

2 回答

溫溫醬

TA貢獻(xiàn)1752條經(jīng)驗(yàn) 獲得超4個(gè)贊

這可能有效。我會(huì)評(píng)論我在做什么和你對(duì)我在做什么的理解的一點(diǎn)代碼。

import scrapy

import pandas as pd

from ..items import latlonglocItem

df = pd.read_csv('wine_df_final.csv')

df = df[pd.notnull(df.real_place)]

real_place = list(set(df.real_place))

class latlonglocSpider(scrapy.Spider): # latlonglocSpider is a child class of scrapy.Spider

name = 'latlonglocs'

start_urls = []

for place in real_place:

baseurl = place.replace(',', '').replace(' ', '+')

cleaned_href = f'http://www.google.com/search?q={baseurl}+coordinates+latitude+longitude+distancesto'

start_urls.append(cleaned_href)

def __init__(self): # Constructor for our class

# Since we did our own constructor we need to call the parents constructor

scrapy.Spider.__init__(self)

self.base_name = None # Here is the base_name we can now use class wide

def parse(self, response):

items = latlonglocItem()

items['base_name'] = response.xpath('string(/html/head/title)').get().split(' coordinates')[0]

self.base_name = items['base_name'] # Lets store the base_name in the class

for href in response.xpath('//*[@id="ires"]/ol/div/h3/a/@href').getall():

if href.startswith('/url?q=https://www.distancesto'):

yield response.follow(href, self.parse_distancesto)

else:

pass

yield items

def parse_distancesto(self, response):

items = latlonglocItem()

try:

# If for some reason self.base_name is never assigned in

# parse() then we want to use an empty string instead of the self.base_name

# The following syntax means use self.base_name unless it is None or empty

# in which case just use and empty string.

base_name = self.base_name or "" # If for some reason

items['appellation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[2]/p/strong)').get()

items['latitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[1]/td)').get()

items['longitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[2]/td)').get()

items['elevation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[10]/td)').get()

yield items

except Exception:

pass

反對(duì) 回復(fù) 2021-12-26

慕尼黑的夜晚無繁華

TA貢獻(xiàn)1864條經(jīng)驗(yàn) 獲得超6個(gè)贊

import scrapy

import pandas as pd

from ..items import latlonglocItem

df = pd.read_csv('wine_df_final.csv')

df = df[pd.notnull(df.real_place)]

real_place = list(set(df.real_place))

class latlonglocSpider(scrapy.Spider): # latlonglocSpider is a child class of scrapy.Spider

name = 'latlonglocs'

start_urls = []

for place in real_place:

baseurl = place.replace(',', '').replace(' ', '+')

cleaned_href = f'http://www.google.com/search?q={baseurl}+coordinates+latitude+longitude+distancesto'

start_urls.append(cleaned_href)

def __init__(self): # Constructor for our class

# Since we did our own constructor we need to call the parents constructor

scrapy.Spider.__init__(self)

self.base_name = None # Here is the base_name we can now use class wide

def parse(self, response):

for href in response.xpath('//*[@id="ires"]/ol/div/h3/a/@href').getall():

if href.startswith('/url?q=https://www.distancesto'):

self.base_name = response.xpath('string(/html/head/title)').get().split(' coordinates')[0]

yield response.follow(href, self.parse_distancesto)

else:

pass

def parse_distancesto(self, response):

items = latlonglocItem()

try:

# If for some reason self.base_name is never assigned in

# parse() then we want to use an empty string instead of the self.base_name

# The following syntax means use self.base_name unless it is None or empty

# in which case just use and empty string.

items['base_name'] = self.base_name or "" # If for some reason

items['appellation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[2]/p/strong)').get()

items['latitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[1]/td)').get()

items['longitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[2]/td)').get()

items['elevation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[10]/td)').get()

yield items

except Exception:

pass

并發(fā)請(qǐng)求必須設(shè)置為 1 才能工作并將 base_name 放置在循環(huán)中。

反對(duì) 回復(fù) 2021-12-26

2 回答
0 關(guān)注
171 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

如何從一個(gè)鏈接中生成一個(gè)已解析的項(xiàng)目，并從同一項(xiàng)目列表中的其他鏈接生成其他已解析的項(xiàng)目

如何從一個(gè)鏈接中生成一個(gè)已解析的項(xiàng)目，并從同一項(xiàng)目列表中的其他鏈接生成其他已解析的項(xiàng)目

2 回答

添加回答

如何從一個(gè)鏈接中生成一個(gè)已解析的項(xiàng)目，并從同一項(xiàng)目列表中的其他鏈接生成其他已解析的項(xiàng)目

如何從一個(gè)鏈接中生成一個(gè)已解析的項(xiàng)目，并從同一項(xiàng)目列表中的其他鏈接生成其他已解析的項(xiàng)目