4 回答

TA貢獻(xiàn)2016條經(jīng)驗(yàn) 獲得超9個(gè)贊
看起來(lái)您想要字符串對(duì)的杰卡德距離。groupby
這是使用and的一種方法scipy.spatial.distance.jaccard
:
from scipy.spatial.distance import jaccard
g = df.groupby(df.name.str[0])
df['diff'] = [sim for _, seqs in g.seq for sim in
[float('nan'), jaccard(*map(list,seqs))]]
print(df)
name seq diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0

TA貢獻(xiàn)1951條經(jīng)驗(yàn) 獲得超3個(gè)贊
Levenshtein距離替代:
import Levenshtein
s = df['name'].str[0]
out = df.assign(Diff=s.drop_duplicates(keep='last').map(df.groupby(s)['seq']
.apply(lambda x: Levenshtein.distance(x.iloc[0],x.iloc[-1]))))
name seq Diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0

TA貢獻(xiàn)1865條經(jīng)驗(yàn) 獲得超7個(gè)贊
作為第一步,我使用以下方法重新創(chuàng)建了您的數(shù)據(jù):
#!/usr/bin/env python3
import pandas as pd
# Setup
data = {'name': {1: 'a1', 2: 'a2', 3: 'b1', 4: 'b2', 5: 'c1', 6: 'c2'}, 'seq': {1: 'bbb', 2: 'bbc', 3: 'fff', 4: 'fff', 5: 'aaa', 6: 'acg'}}
df = pd.DataFrame(data)
解決方案 您可以嘗試迭代數(shù)據(jù)框并將seq最后一次迭代的值與當(dāng)前迭代值進(jìn)行比較。為了比較兩個(gè)字符串(存儲(chǔ)在數(shù)據(jù)框的seq列中),您可以應(yīng)用一個(gè)簡(jiǎn)單的列表推導(dǎo),如在此函數(shù)中:
def diff_letters(a,b):
return sum ( a[i] != b[i] for i in range(len(a)) )
迭代 Dataframe 行
diff = ['NA']
row_iterator = df.iterrows()
_, last = next(row_iterator)
# Iterate over the df get populate a list with result of the comparison
for i, row in row_iterator:
if i % 2 == 0:
diff.append(diff_letters(last['seq'],row['seq']))
else:
# for odd row numbers append NA value
diff.append("NA")
last = row
df['diff'] = diff
結(jié)果看起來(lái)像這樣
name seq diff
1 a1 bbb NA
2 a2 bbc 1
3 b1 fff NA
4 b2 fff 0
5 c1 aaa NA
6 c2 acg 2

TA貢獻(xiàn)1801條經(jīng)驗(yàn) 獲得超16個(gè)贊
檢查這個(gè)
import pandas as pd
data = {'name': ['a1', 'a2','b1','b2','c1','c2'],
'seq': ['bbb', 'bbc','fff','fff','aaa','acg']
}
df = pd.DataFrame (data, columns = ['name','seq'])
diffCntr=0
df['diff'] = np.nan
i=0
while i < len(df)-1:
diffCntr=np.nan
item=df.at[i,'seq']
df.at[i,'diff']=diffCntr
diffCntr=0
for j in df.at[i+1,'seq']:
if item.find(j) < 0:
diffCntr +=1
df.at[i+1,'diff']=diffCntr
i +=2
df
結(jié)果是這樣的:
name seq diff
0 a1 bbb NaN
1 a2 bbc 1.0
2 b1 fff NaN
3 b2 fff 0.0
4 c1 aaa NaN
5 c2 acg 2.0
添加回答
舉報(bào)