首頁猿問如何將包含許多注釋行的數(shù)據(jù)文本文件...

如何將包含許多注釋行的數(shù)據(jù)文本文件加載到 pandas 中？

Python

皈依舞 2023-09-26 15:09:56

我正在嘗試將分隔文本文件讀入 python 中的數(shù)據(jù)幀中。當(dāng)我使用時(shí)，分隔符未被識(shí)別pd.read_table。如果我明確設(shè)置sep = ' '，則會(huì)收到錯(cuò)誤：Error tokenizing data. C error。值得注意的是，當(dāng)我使用np.loadtxt().例子：pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', comment = '%', header = None) 00 1850 1 -0.777 0.412 NaN NaN...1 1850 2 -0.239 0.458 NaN NaN...2 1850 3 -0.426 0.447 NaN NaN...3 1850 4 -0.680 0.367 NaN NaN...4 1850 5 -0.687 0.298 NaN NaN...如果我設(shè)置 sep = ' '，則會(huì)收到另一個(gè)錯(cuò)誤：pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', comment = '%', header = None, sep = ' ')ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58查找此錯(cuò)誤，人們建議使用header = None（已經(jīng)完成）并sep = 顯式設(shè)置，但這導(dǎo)致了問題：Python Pandas Error tokenizing data。我查看了第 78 行，沒有發(fā)現(xiàn)任何問題。如果我設(shè)置，error_bad_lines=False我會(huì)得到一個(gè)空的 df，表明每個(gè)條目都有問題。值得注意的是，當(dāng)我使用以下命令時(shí)，這會(huì)起作用np.loadtxt()：pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', comments = '%')) 0 1 2 3 4 5 6 7 8 9 10 110 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN這對(duì)我來說表明文件沒有問題，而是我調(diào)用的方式有問題pd.read_table()。我查看了文檔，np.loadtxt()希望將 sep 設(shè)置為相同的值，但這只是顯示：（delimiter=Nonehttps://numpy.org/doc/stable/reference/ generated /numpy.loadtxt.html ）。我希望能夠?qū)⑵鋵?dǎo)入為 apd.DataFrame并設(shè)置名稱，而不是必須導(dǎo)入為 amatrix然后轉(zhuǎn)換為pd.DataFrame.我錯(cuò)了什么？

查看完整描述

2 回答

慕娘9325324

TA貢獻(xiàn)1783條經(jīng)驗(yàn) 獲得超4個(gè)贊

這個(gè)是相當(dāng)棘手的。請(qǐng)嘗試下面的代碼片段：

import pandas as pd

url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'

df = pd.read_csv(url,

sep='\s+',

comment='%',

usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),

names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',

'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',

'20y.Anomaly', '20y.Unc.'))

反對(duì) 回復(fù) 2023-09-26

料青山看我應(yīng)如是

TA貢獻(xiàn)1772條經(jīng)驗(yàn) 獲得超8個(gè)贊

問題是該文件有 77 行注釋文本，例如'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'

其中兩行是標(biāo)題

有一堆數(shù)據(jù)，然后還有兩個(gè)標(biāo)頭，以及一組新數(shù)據(jù)'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
該解決方案將文件中的兩個(gè)表分成單獨(dú)的數(shù)據(jù)幀。
這不像其他答案那么好，但數(shù)據(jù)被正確地分成不同的數(shù)據(jù)幀。
標(biāo)題很痛苦，手動(dòng)創(chuàng)建自定義標(biāo)題并跳過將標(biāo)題與文本分開的代碼行可能會(huì)更容易。
重要的一點(diǎn)是air與ice數(shù)據(jù)分離。

import requests

import pandas as pd

import math

# read the file with requests

url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'

response = requests.get(url)

data = response.text

# convert data into a list

data = [d.strip().replace('% ', '') for d in data.split('\n')]

# specify the data from the ranges in the file

air_header1 = data[74].split() # not used

air_header2 = [v.strip() for v in data[75].split(',')]

# combine the 2 parts of the header into a single header

air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]

air_data = [v.split() for v in data[77:2125]]

h2o_header1 = data[2129].split() # not used

h2o_header2 = [v.strip() for v in data[2130].split(',')]

# combine the 2 parts of the header into a single header

h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]

h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes

air = pd.DataFrame(air_data, columns=air_header)

h2o = pd.DataFrame(h2o_data, columns=h2o_header)

沒有標(biāo)題代碼

通過使用手動(dòng)標(biāo)頭列表來簡(jiǎn)化代碼。

import pandas as pd

import requests

# read the file with requests

url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'

response = requests.get(url)

data = response.text

# convert data into a list

data = [d.strip().replace('% ', '') for d in data.split('\n')]

# manually created header

headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',

'Annual_Anomaly', 'Annual_Unc.',

'Five-year_Anomaly', 'Five-year_Unc.',

'Ten-year_Anomaly', 'Ten-year_Unc.',

'Twenty-year_Anomaly', 'Twenty-year_Unc.']

# separate the air and h2o data

air_data = [v.split() for v in data[77:2125]]

h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes

air = pd.DataFrame(air_data, columns=headers)

h2o = pd.DataFrame(h2o_data, columns=headers)

air

Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.

0 1850 1 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN

1 1850 2 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN

2 1850 3 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN

h2o

Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.

0 1850 1 -0.724 0.370 NaN NaN NaN NaN NaN NaN NaN NaN

1 1850 2 -0.221 0.430 NaN NaN NaN NaN NaN NaN NaN NaN

2 1850 3 -0.443 0.419 NaN NaN NaN NaN NaN NaN NaN NaN

反對(duì) 回復(fù) 2023-09-26

2 回答
0 關(guān)注
122 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購(gòu)課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

如何將包含許多注釋行的數(shù)據(jù)文本文件加載到 pandas 中？

如何將包含許多注釋行的數(shù)據(jù)文本文件加載到 pandas 中？

2 回答

添加回答

如何將包含許多注釋行的數(shù)據(jù)文本文件加載到 pandas 中？