首頁猿問如何將下載并解壓的文本文件加載到...

如何將下載并解壓的文本文件加載到 pandas 數(shù)據(jù)框中？

Python

繁花不似錦 2023-09-12 19:59:42

以下代碼下載并解壓包含數(shù)千個(gè)文本文件的文件zip_file_url = "https://docsia-temp.s3-sa-east-1.amazonaws.com/docsia-desafio-dataset.zip"res = requests.get(zip_file_url, stream=True) # fazendo o request do dadoprint("fazendo o download...")z = zipfile.ZipFile(io.BytesIO(res.content))print("extraindo os dados")z.extractall("./")print("ok..")如何將這些文件加載到 pandas 數(shù)據(jù)框中？

查看完整描述

1 回答

莫回?zé)o

TA貢獻(xiàn)1865條經(jīng)驗(yàn) 獲得超7個(gè)贊

查看代碼的內(nèi)聯(lián)解釋
代碼使用pathlib模塊來查找已經(jīng)解壓的文件
有 20 種文章類型，這意味著數(shù)據(jù)框字典中有 20 個(gè)鍵dd。
每個(gè)鍵的值是一個(gè)數(shù)據(jù)框，其中包含每種文章類型的所有文章。

每個(gè)數(shù)據(jù)框有 1000 行，每篇文章 1 行。

總共有20000篇文章。
此實(shí)現(xiàn)將保持文章的形狀。

當(dāng)從數(shù)據(jù)框中打印一行時(shí)，文章將采用帶有換行符和標(biāo)點(diǎn)符號(hào)的可讀形式。

要從各個(gè)數(shù)據(jù)幀創(chuàng)建單個(gè)數(shù)據(jù)幀：

dfc = pd.concat(dd.values()).reset_index(drop=True)
這就是'type'在最初創(chuàng)建數(shù)據(jù)框時(shí)添加列的原因。在組合數(shù)據(jù)框中，文章類型將是可識(shí)別的。

這回答了如何將所有文件加載到數(shù)據(jù)框中的問題。
有關(guān)處理文本的更多問題，請(qǐng)?zhí)岢鲂聠栴}。

from pathlib import Path

from io import BytesIO

import requests

import pandas as pd

from collections import defaultdict

from zipfile import ZipFile

######################################################################

# download and save zipped files

# location to save files; this create a pathlib object of the path, and patlib objects have methods, like rglob, parts, and is_file

save_path = Path('data/zipped')

zip_file_url = "https://docsia-temp.s3-sa-east-1.amazonaws.com/docsia-desafio-dataset.zip"

res = requests.get(zip_file_url, stream=True)

with ZipFile(BytesIO(res.content), 'r') as zip_ref:

zip_ref.extractall(save_path)

######################################################################

# find all the files; the methods in this list comprehension are pathlib methods

files = [file for file in list(save_path.rglob('*')) if file.is_file()]

# dict to save dataframes for each file

dd = defaultdict(list)

for file in files:

# extract the type of article from the path

article_type = file.parts[-2].replace('.', '_')

# open the file

with file.open(mode='r', encoding='utf-8', errors='ignore') as f:

# read the lines and combine them into one string inside a list

f = [' '.join([line for line in f.readlines() if line.strip()])]

# create a dataframe from f

df = pd.DataFrame(f, columns=['article'])

# add a column for the article type

df['type'] = article_type

# add the dataframe to the default dict

dd[article_type].append(df.copy())

# each value of the dict is a list of dataframes, iterate through all keys and create a single dataframe for each key

for k, v in dd.items():

# for all the article type, combine all the dataframes into a single dataframe

dd[k] = pd.concat(v).reset_index(drop=True)

print(dd.keys())

[out]:

dict_keys(['alt_atheism', 'comp_graphics', 'comp_os_ms-windows_misc', 'comp_sys_ibm_pc_hardware', 'comp_sys_mac_hardware', 'comp_windows_x', 'misc_forsale', 'rec_autos', 'rec_motorcycles', 'rec_sport_baseball', 'rec_sport_hockey', 'sci_crypt', 'sci_electronics', 'sci_med', 'sci_space', 'soc_religion_christian', 'talk_politics_guns', 'talk_politics_mideast', 'talk_politics_misc', 'talk_religion_misc'])

# print the first article for the alt_atheism key

print(dd['alt_atheism'].iloc[0, 0])

[out]:

Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126

Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew

From: mathew <mathew@mantis.co.uk>

Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers

Subject: Alt.Atheism FAQ: Atheist Resources

Summary: Books, addresses, music -- anything related to atheism

Keywords: FAQ, atheism, books, music, fiction, addresses, contacts

Message-ID: <19930329115719@mantis.co.uk>

Date: Mon, 29 Mar 1993 11:57:19 GMT

Expires: Thu, 29 Apr 1993 11:57:19 GMT

Followup-To: alt.atheism

Distribution: world

Organization: Mantis Consultants, Cambridge. UK.

Approved: news-answers-request@mit.edu

Supersedes: <19930301143317@mantis.co.uk>

Lines: 290

Archive-name: atheism/resources

...

反對(duì) 回復(fù) 2023-09-12

1 回答
0 關(guān)注
113 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

如何將下載并解壓的文本文件加載到 pandas 數(shù)據(jù)框中？

如何將下載并解壓的文本文件加載到 pandas 數(shù)據(jù)框中？

1 回答

添加回答

如何將下載并解壓的文本文件加載到 pandas 數(shù)據(jù)框中？

如何將下載并解壓的文本文件加載到 pandas 數(shù)據(jù)框中？