首頁猿問 Python -...

Python - 根據(jù)特定字符串對數(shù)據(jù)框進行分組

Python

慕尼黑的夜晚無繁華 2022-01-11 19:56:09

我試圖在某些邏輯中組合這些字符串和行：s1 = ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt']s2 = [1,1,2,2,2]s3 = ['Harry Potter','Vol 1','Lord of the Rings - Vol 1',np.nan,'Harry Potter']df = pd.DataFrame(list(zip(s1,s2,s3)), columns=['file','id','book'])df數(shù)據(jù)預(yù)覽：file id bookabc.txt 1 Harry Potterabc.txt 1 Vol 1ert.txt 2 Lord of the Ringsert.txt 2 NaNert.txt 2 Harry Potter我有一堆與 id 相關(guān)聯(lián)的文件名列。我有“書”列，其中第 1 卷位于單獨的行中。我知道這個 vol1 只與給定數(shù)據(jù)集中的“哈利波特”相關(guān)聯(lián)。基于'file'和'id'的分組，我如何在'Harry Potter'字符串出現(xiàn)在行中的同一行中組合'Vol 1'？請注意，某些數(shù)據(jù)行沒有 Harry Potter 的 vo1 我在查看文件和 id groupby 時只想要“Vol 1”。2 次嘗試：第一個：不起作用if (df['book'] == 'Harry Potter' and df['book'].str.contains('Vol 1',case=False) in df.groupby(['file','id'])): df.groupby(['file','id'],as_index=False).first()第二：這適用于每個字符串（但不希望它適用于每個“哈利波特”字符串。df.loc[df['book'].str.contains('Harry Potter',case=False,na=False), 'new_book'] = 'Harry Potter - Vol 1'這是我正在尋找的輸出file id bookabc.txt 1 Harry Potter - Vol 1ert.txt 2 Lord of the Rings - Vol 1ert.txt 2 NaNert.txt 2 Harry Potter

查看完整描述

3 回答

楊__羊羊

TA貢獻1943條經(jīng)驗獲得超7個贊

從import re（您將使用它）開始。

然后創(chuàng)建你的數(shù)據(jù)框：

df = pd.DataFrame({

'file': ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt'],

'id': [1, 1, 2, 2, 2],

'book': ['Harry Potter', 'Vol 1', 'Lord of the Rings - Vol 1',

np.nan, 'Harry Potter']})

第一個處理步驟是添加一列，我們稱之為book2，其中包含下一行的book2：

df["book2"] = df.book.shift(-1).fillna('')

我添加fillna('')了用空字符串替換NaN值。

然后定義一個應(yīng)用于每一行的函數(shù)：

def fn(row):

return f"{row.book} - {row.book2}" if row.book == 'Harry Potter'\

and re.match(r'^Vol \d+$', row.book2) else row.book

此函數(shù)檢查book == "Harry Potter" 和book2 是否匹配 "Vol" + 數(shù)字序列。如果是，則返回book + book2，否則僅返回book。

然后我們應(yīng)用這個函數(shù)并將結(jié)果保存在book下：

df["book"] = df.apply(fn, axis=1)

剩下的就是放棄：

book與Vol \d+匹配的行，

book2欄。

代碼是：

df = df.drop(df[df.book.str.match(r'^Vol \d+$').fillna(False)].index)\

.drop(columns=['book2'])

需要 fillna(False)，因為str.match為源內(nèi)容返回NaN == NaN。

反對回復(fù) 2022-01-11

拉莫斯之舞

TA貢獻1820條經(jīng)驗獲得超10個贊

假設(shè)“Vol x”出現(xiàn)在標(biāo)題后面的行上，我將使用通過將 book 列移動 -1 獲得的輔助系列。然后，將該 Series 與 book 列在它以開頭時合并"Vol "并在 books 列以開頭的位置放置行就足夠了"Vol "。代碼可以是：

b2 = df.book.shift(-1).fillna('')

df['book'] = df.book + np.where(b2.str.match('Vol [0-9]+'), ' - ' + b2, '')

print(df.drop(df.loc[df.book.fillna('').str.match('Vol [0-9]+')].index))

如果不能保證數(shù)據(jù)幀中的順序，但如果Vol x行與數(shù)據(jù)幀中具有相同文件和 id 的另一行匹配，則可以將數(shù)據(jù)幀分成兩部分，一個包含Vol x行，一個包含其他行并更新后者來自前者：

g = df.groupby(df.book.fillna('').str.match('Vol [0-9]+'))

for k, v in g:

if k:

df_vol = v

else:

df = v

for row in df_vol.iterrows():

r = row[1]

df.loc[(df.file == r.file)&(df.id==r.id), 'book'] += ' - ' + r['book']

反對回復(fù) 2022-01-11

喵喵時光機

TA貢獻1846條經(jīng)驗獲得超7個贊

利用merge, apply, update, drop_duplicates.

set_index和merge上索引file，id的DF之間'Harry Potter'和df的'Vol 1'; join創(chuàng)建適當(dāng)?shù)淖址⑵滢D(zhuǎn)換為數(shù)據(jù)框

df.set_index(['file', 'id'], inplace=True)

df1 = df[df['book'] == 'Harry Potter'].merge(df[df['book'] == 'Vol 1'], left_index=True, right_index=True).apply(' '.join, axis=1).to_frame(name='book')

Out[2059]:

book

file id

abc.txt 1 Harry Potter Vol 1

更新原來df，drop_duplicate和reset_index

df.update(df1)

df.drop_duplicates().reset_index()

Out[2065]:

file id book

0 abc.txt 1 Harry Potter Vol 1

1 ert.txt 2 Lord of the Rings - Vol 1

2 ert.txt 2 NaN

3 ert.txt 2 Harry Potter

反對回復(fù) 2022-01-11

3 回答
0 關(guān)注
325 瀏覽

關(guān)注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

Python - 根據(jù)特定字符串對數(shù)據(jù)框進行分組

Python - 根據(jù)特定字符串對數(shù)據(jù)框進行分組

3 回答

添加回答