2 回答

TA貢獻(xiàn)1785條經(jīng)驗(yàn) 獲得超4個(gè)贊
計(jì)算序列開始是否有效?然后只需設(shè)置忽略值(標(biāo)志4)。像這樣:
sequence_starts = df.sequence == 2
sequence_ignore = df.sequence == 4
sequence_id = sequence_starts.cumsum()
sequence_id[sequence_ignore] = numpy.nan

TA貢獻(xiàn)1829條經(jīng)驗(yàn) 獲得超6個(gè)贊
我想不出比循環(huán)遍歷整個(gè)事物的“愚蠢”解決方案更好的方法,例如:
import numpy as np
counter = 0
tmp = np.empty_like(df['sequence'].values, dtype=np.float)
for i in range(len(tmp)):
if df['sequence'][i] == 4:
tmp[i] = np.nan
else:
if df['sequence'][i] == 2:
counter += 1
tmp[i] = counter
df['desired_Id_output'] = tmp
當(dāng)然,這對(duì)于 20M 大小的 DataFrame 來說會(huì)很慢。改進(jìn)這一點(diǎn)的一種方法是通過使用numba以下命令進(jìn)行實(shí)時(shí)編譯:
import numba
@numba.njit
def foo(sequence):
# put in appropriate modification of the above code block
return tmp
并用參數(shù)調(diào)用它df['sequence'].values。
添加回答
舉報(bào)