第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時綁定郵箱和手機(jī)立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

網(wǎng)絡(luò)抓取清理 CSV 表格時出現(xiàn)問題

網(wǎng)絡(luò)抓取清理 CSV 表格時出現(xiàn)問題

肥皂起泡泡 2023-04-11 15:43:56
我正在嘗試從表中抓取一些數(shù)據(jù)。我得到了我期望的結(jié)果,但我找不到將它們保存在干凈的 CSV 表中的方法。這是代碼,在結(jié)果和我想要的下面。有什么建議嗎?from bs4 import BeautifulSoupimport urllib.request # web accessimport csvimport reurl = "https://wsc.nmbe.ch/family/87/Senoculidae"page = urllib.request.urlopen(url) # conntect to websitetry:    page = urllib.request.urlopen(url)except:    print("Ups!")soup = BeautifulSoup(page, 'html.parser')regex = re.compile('^speciesTitle')content_lis = soup.find_all('div', attrs={'class': regex})for li in content_lis:    con = li.get_text("#",strip=True).split("\n")[0]    print(con)我得到了這些不錯的輸出:Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| BrazilSenoculus barroanus#Chickering, 1941#|#| PanamaSenoculus bucolicus#Chickering, 1941#|#| Panama但我需要這樣的東西(CSV 用分號或制表符分隔):Senoculus albidus;(F. O. Pickard-Cambridge, 1897);BrazilSenoculus barroanus;Chickering1941;PanamaSenoculus bucolicus;Chickering, 1941;Panama如何刪除字符“|” 和一些空間?有什么建議嗎?
查看完整描述

2 回答

?
幕布斯6054654

TA貢獻(xiàn)1876條經(jīng)驗 獲得超7個贊

嘗試這個:


from bs4 import BeautifulSoup

import urllib.request # web access

import re


url = "https://wsc.nmbe.ch/family/87/Senoculidae"

page = urllib.request.urlopen(url) # conntect to website

try:

    page = urllib.request.urlopen(url)

except:

    print("Ups!")

soup = BeautifulSoup(page, 'html.parser')

#div = soup.find(text=True, recursive=)

regex = re.compile('^speciesTitle')

content_lis = soup.find_all('div', attrs={'class': regex})

file = ''

for cl in content_lis:

    a = cl.select_one('div a strong i')

    b = cl.find(text=True, recursive=False)

    c = cl.select_one('span')

    cc = re.findall("[\w]+", c.text)[0]

    file += f'{a.get_text(strip=True)};{b.strip()};{cc}\n'

with open('file.csv', 'w') as f:

   f.write(file)

保存一個文件:


Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil

Senoculus barroanus;Chickering, 1941;Panama

Senoculus bucolicus;Chickering, 1941;Panama

Senoculus cambridgei;Mello-Leit?o, 1927;Brazil

Senoculus canaliculatus;F. O. Pickard-Cambridge, 1902;Mexico

Senoculus carminatus;Mello-Leit?o, 1927;Brazil

Senoculus darwini;(Holmberg, 1883);Argentina

Senoculus fimbriatus;Mello-Leit?o, 1927;Brazil

Senoculus gracilis;(Keyserling, 1879);Guyana

Senoculus guianensis;Caporiacco, 1947;j

Senoculus iricolor;(Simon, 1880);Brazil

Senoculus maronicus;Taczanowski, 1872;French

等等...


查看完整回答
反對 回復(fù) 2023-04-11
?
慕哥6287543

TA貢獻(xiàn)1831條經(jīng)驗 獲得超10個贊

此代碼基于您的示例數(shù)據(jù)集:


lst=[

'Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| Brazil',

'Senoculus barroanus#Chickering, 1941#|#| Panama',

'Senoculus bucolicus#Chickering, 1941#|#| Panama'

]


lst2 = [s.replace('|',"").split('#') for s in lst]


lst3=[]


for s in lst2:

   lst3.append(';'.join([sx.strip() for sx in s]).replace(';;',';'))


for s in lst3:

   print(s)

輸出


Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil 

Senoculus barroanus;Chickering, 1941;Panama 

Senoculus bucolicus;Chickering, 1941;Panama

--- 根據(jù)請求者評論更新 ---


在最后一個循環(huán)中添加一行:


for li in content_lis:

    con = li.get_text("#",strip=True).split("\n")[0]

    con = ';'.join(sx.strip() for sx in con.replace('|',"").split('#')).replace(';;',';') # add this line

    print(con)


查看完整回答
反對 回復(fù) 2023-04-11
  • 2 回答
  • 0 關(guān)注
  • 144 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號