2 回答

TA貢獻1836條經(jīng)驗 獲得超4個贊
以下是您最后一個問題的后續(xù)答案。
import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
titanic = titanic.copy()
titanic = titanic.dropna()
titanic['age'].plot.hist(
bins = 50,
title = "Histogram of the age variable"
)
from scipy.stats import zscore
titanic["age_zscore"] = zscore(titanic["age"])
titanic["is_outlier"] = titanic["age_zscore"].apply(
lambda x: x <= -2.5 or x >= 2.5
)
titanic[titanic["is_outlier"]]
ageAndFare = titanic[["age", "fare"]]
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
ageAndFare = scaler.fit_transform(ageAndFare)
ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(ageAndFare)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
ageAndFare.plot.scatter(
x = "age",
y = "fare",
c = clusters,
cmap = cmap,
colorbar = False
)
有關(guān)所有詳細信息,請參閱此鏈接。
https://www.mikulskibartosz.name/outlier-detection-with-scikit-learn/
在今天之前,我從未聽說過“局部異常值因素”。當(dāng)我用谷歌搜索它時,我得到了一些信息,似乎表明它是DBSCAN的衍生物。最后,我認為我的第一個答案實際上是檢測異常值的最佳方法。DBSCAN正在聚類算法,碰巧找到異常值,這些異常值實際上被認為是“噪聲”。我不認為DBSCAN的主要目的不是異常檢測,而是集群。總之,正確選擇超參數(shù)需要一些技巧。此外,DBSCAN在非常大的數(shù)據(jù)集上可能很慢,因為它隱式地需要計算每個采樣點的經(jīng)驗密度,從而導(dǎo)致二次最壞情況的時間復(fù)雜度,這在大型數(shù)據(jù)集上非常慢。

TA貢獻1802條經(jīng)驗 獲得超4個贊
您:我想應(yīng)用 k 均值來消除任何異常。
實際上,KMeas 將檢測異常并將其包含在最近的聚類中。損失函數(shù)是從每個點到其分配的聚類質(zhì)心的最小距離平方和。如果要剔除異常值,請考慮使用 z 得分方法。
import numpy as np
import pandas as pd
# import your data
df = pd.read_csv('C:\\Users\\your_file.csv)
# get only numerics
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
df = newdf
# count rows in DF before kicking out records with z-score over 3
df.shape
# handle NANs
df = df.fillna(0)
from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
df.shape
df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
# count rows in DF before kicking out records with z-score over 3
df.shape
此外,當(dāng)您有空閑時間時,請查看這些鏈接。
https://medium.com/analytics-vidhya/effect-of-outliers-on-k-means-algorithm-using-python-7ba85821ea23
https://statisticsbyjim.com/basics/outliers/
添加回答
舉報