3 回答

TA貢獻(xiàn)1872條經(jīng)驗(yàn) 獲得超4個(gè)贊
嘗試這個(gè)。更新為包含自動(dòng)計(jì)算行數(shù)的邏輯?;旧衔姨崛≡紨?shù)據(jù)幀索引(行號(hào))的最大值,它在大字符串內(nèi)。
如果我們從使用您提供的示例轉(zhuǎn)換為字符串的數(shù)據(jù)幀開始:
df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6,
data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])
string = str(df)
首先,讓我們提取列名:
import re
import numpy as np
lst = re.split('\n', string)
num_rows = int(lst[lst.index('') -1][0]) + 1
col_names = []
lst = [i for i in lst if i != '']
for i in range(0,len(lst), num_rows + 1):
col_names.append(lst[i])
new_col_names = []
for i in col_names:
new_col_names.append(re.split(' ', i))
final_col_names = []
for i in new_col_names:
final_col_names += i
final_col_names = [i for i in final_col_names if i != '']
final_col_names = [i for i in final_col_names if i != '\\']
然后,讓我們獲取數(shù)據(jù):
for i in col_names:
lst.remove(i)
new_lst = [re.split(r'\s{2,}', i) for i in lst]
new_lst = [i[1:-1] for i in new_lst]
newer_lst = []
for i in range(num_rows):
sub_lst = []
for j in range(i,len(final_col_names), num_rows):
sub_lst += new_lst[j]
newer_lst.append(sub_lst)
reshaped = np.reshape(newer_lst, (num_rows,len(final_col_names)))
最后,我們可以使用數(shù)據(jù)和列名創(chuàng)建重建的數(shù)據(jù)框:
fixed_df = pd.DataFrame(data=reshaped, columns = final_col_names)
我的代碼執(zhí)行了一些循環(huán),因此如果您的原始數(shù)據(jù)幀有數(shù)十萬(wàn)行,這種方法可能需要一段時(shí)間。
添加回答
舉報(bào)