4 回答

TA貢獻1111條經驗 獲得超0個贊
import pandas as pd
df1 = pd.read_csv('1.csv')
df2 = pd.read_csv('2.csv')
out = pd.merge(df1, df2, on='customer_Email', how='left')
out.loc[~out['customer_Email'].isin(df2.drop_duplicates(subset='customer_Email', keep=False)['customer_Email'].tolist()), 'Fraud'] = None
out
給出:
customer_Email Fraud ID
0 name_0 0.0 0
1 name_1 NaN 1
2 name_1 NaN 5
3 name_2 1.0 2
4 name_3 1.0 3
5 name_4 0.0 4
6 name_1 NaN 1
7 name_1 NaN 5

TA貢獻1815條經驗 獲得超13個贊
“當 customerEmail 中存在重復項時,希望我的 Fraud 列具有空值?!?/p>
所以在你的預期輸出中你忘記添加,name_4 因為customerEmail它也是重復的
df1 = pd.DataFrame({
'customerEmail':['name0','name1','name2','name3','name4','name1'],
'Fraud':[False,True,True,True,False,False]
}
)
df2 = pd.DataFrame({
'customerEmail': ['name0', 'name1', 'name2', 'name3', 'name4', 'name1'],
'ID':[0,1,2,3,4,5]
})
df3=pd.merge(df1, df2, on='customerEmail', how='left')
#here you need to know which customers are duplicated, to fill for them rows in column Fraud
df_duplicates = df3.drop_duplicates(subset=['customerEmail'],keep='last')
print(df_duplicates)
customerEmail Fraud ID
0 name0 False 0
3 name2 True 2
4 name3 True 3
5 name4 False 4
7 name1 False 5
#now for those duplicates fill cells in column Fraud using iloc and np.nan
df_duplicates.loc[:,'Fraud'] = np.nan
print(df_duplicates)
customerEmail Fraud ID
0 name0 NaN 0
3 name2 NaN 2
4 name3 NaN 3
5 name4 NaN 4
7 name1 NaN 5
#so now you have two df's , one df_duplicates with Nans duplicates values above,
#and main df3 with original merged values
#now you need to add those df's using concat , (add column to column )
#but you dont need values with same customerEmail that you used for df_duplicated, so you can delete them using drop_duplicates
result = pd.concat([df3,df_duplicates]).drop_duplicates(subset=['customerEmail','Fraud'])
#after concat True and False values has been coverted to 1.0 and 0 , for we need to change the type to False and True again
result.Fraud = result.Fraud.astype('boolean')
print(result)
customerEmail Fraud ID
0 name0 False 0
1 name1 True 1
3 name2 True 2
4 name3 True 3
5 name4 False 4
6 name1 False 1
0 name0 <NA> 0
3 name2 <NA> 2
4 name3 <NA> 3
5 name4 <NA> 4
7 name1 <NA> 5

TA貢獻1966條經驗 獲得超4個贊
您可以使用重復函數(shù)來keep=False
獲取 df1 和/或 df2 中的重復電子郵件。
下面是對 df1 或 df2 中具有重復電子郵件的任何行添加 N/A 的代碼。
df = pd.merge(DF1, DF2, on='customerEmail', how='left')
dups_1 = set(DF1.customerEmail[DF1.customerEmail.duplicated(keep=False)])? # get duplicated emails in df1
dups_2 = set(DF2.customerEmail[DF2.customerEmail.duplicated(keep=False)])? # get duplicated emails in df2
dups = dups_1.union(dups_2)? ? # get duplicated emails in df1 or df2 (you can also use only dups_1 or only dups_2)
df["Fraud"] = df.apply(lambda row: "N/A" if row.customerEmail in dups else row.Fraud, axis=1)? # put N/A if email in dups

TA貢獻1847條經驗 獲得超11個贊
那么下面的呢?(假設customer_email在 df2 中是唯一的):
df3 = pd.merge(df1, df2, on=['customer_Email'], how="left")
df3["count"] = df3.groupby("customer_Email").cumcount()
df3.loc[df3["count"]>0,"Fraud"] = "N/A"
df3[["customer_Email","Fraud","ID"]]
輸出:
customer_Email Fraud ID
0 name_0 False 0
1 name_1 True 1
2 name_1 N/A 5
3 name_2 True 2
4 name_3 True 3
5 name_4 False 4
6 name_1 N/A 1
7 name_1 N/A 5
添加回答
舉報