第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

FInding K 均值距離

FInding K 均值距離

青春有我 2022-08-16 16:33:56
我有一個數(shù)據(jù)庫,它有13個特征和1000萬行。我想應(yīng)用 k-mean 來消除任何異常。我的方法是應(yīng)用k-mean,創(chuàng)建一個數(shù)據(jù)點和聚類質(zhì)心之間距離的新列,以及一個平均距離的新列,如果距離大于平均距離,我將刪除整行。但似乎我寫的代碼不起作用。數(shù)據(jù)集示例:https://drive.google.com/open?id=1iB1qjnWQyvoKuN_Pa8Xk4BySzXVTwtUkdf = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)# Dropping columns with low feature importancedel df['AmbTemp_DegC']del df['NacelleOrientation_Deg']del df['MeasuredYawError']#applying kmeans#applying kmeanskmeans = KMeans( n_clusters=8)clusters= kmeans.fit_predict(df)centroids = kmeans.cluster_centers_distance1 = kmeans.fit_transform(df)distance2 = distance1.mean()df['distances']=distance1-distance2df = df[df['distances'] >=0]del df['distances']df.to_csv('/content//drive/My Drive/K TEST.csv', index=False)錯誤:KeyError                                  Traceback (most recent call last)/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)   2896             try:-> 2897                 return self._engine.get_loc(key)   2898             except KeyError:pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 'distances'During handling of the above exception, another exception occurred:KeyError                                  Traceback (most recent call last)9 framespandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 'distances'During handling of the above exception, another exception occurred:
查看完整描述

2 回答

?
HUH函數(shù)

TA貢獻1836條經(jīng)驗 獲得超4個贊

以下是您最后一個問題的后續(xù)答案。


import seaborn as sns

import pandas as pd

titanic = sns.load_dataset('titanic')

titanic = titanic.copy()

titanic = titanic.dropna()

titanic['age'].plot.hist(

  bins = 50,

  title = "Histogram of the age variable"

)



from scipy.stats import zscore

titanic["age_zscore"] = zscore(titanic["age"])

titanic["is_outlier"] = titanic["age_zscore"].apply(

  lambda x: x <= -2.5 or x >= 2.5

)

titanic[titanic["is_outlier"]]



ageAndFare = titanic[["age", "fare"]]

ageAndFare.plot.scatter(x = "age", y = "fare")



from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

ageAndFare = scaler.fit_transform(ageAndFare)

ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])

ageAndFare.plot.scatter(x = "age", y = "fare")



from sklearn.cluster import DBSCAN

outlier_detection = DBSCAN(

  eps = 0.5,

  metric="euclidean",

  min_samples = 3,

  n_jobs = -1)

clusters = outlier_detection.fit_predict(ageAndFare)


clusters



from matplotlib import cm

cmap = cm.get_cmap('Accent')

ageAndFare.plot.scatter(

  x = "age",

  y = "fare",

  c = clusters,

  cmap = cmap,

  colorbar = False

)

http://img1.sycdn.imooc.com//62fb569a00012f4503870264.jpg

有關(guān)所有詳細信息,請參閱此鏈接。

https://www.mikulskibartosz.name/outlier-detection-with-scikit-learn/

在今天之前,我從未聽說過“局部異常值因素”。當(dāng)我用谷歌搜索它時,我得到了一些信息,似乎表明它是DBSCAN的衍生物。最后,我認為我的第一個答案實際上是檢測異常值的最佳方法。DBSCAN正在聚類算法,碰巧找到異常值,這些異常值實際上被認為是“噪聲”。我不認為DBSCAN的主要目的不是異常檢測,而是集群。總之,正確選擇超參數(shù)需要一些技巧。此外,DBSCAN在非常大的數(shù)據(jù)集上可能很慢,因為它隱式地需要計算每個采樣點的經(jīng)驗密度,從而導(dǎo)致二次最壞情況的時間復(fù)雜度,這在大型數(shù)據(jù)集上非常慢。


查看完整回答
反對 回復(fù) 2022-08-16
?
慕虎7371278

TA貢獻1802條經(jīng)驗 獲得超4個贊

您:我想應(yīng)用 k 均值來消除任何異常。


實際上,KMeas 將檢測異常并將其包含在最近的聚類中。損失函數(shù)是從每個點到其分配的聚類質(zhì)心的最小距離平方和。如果要剔除異常值,請考慮使用 z 得分方法。


import numpy as np

import pandas as pd


# import your data

df = pd.read_csv('C:\\Users\\your_file.csv)


# get only numerics

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = df.select_dtypes(include=numerics)


df = newdf


# count rows in DF before kicking out records with z-score over 3

df.shape


# handle NANs

df = df.fillna(0)



from scipy import stats

df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

df.shape



df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats

df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]


# count rows in DF before kicking out records with z-score over 3

df.shape

此外,當(dāng)您有空閑時間時,請查看這些鏈接。


https://medium.com/analytics-vidhya/effect-of-outliers-on-k-means-algorithm-using-python-7ba85821ea23


https://statisticsbyjim.com/basics/outliers/


查看完整回答
反對 回復(fù) 2022-08-16
  • 2 回答
  • 0 關(guān)注
  • 145 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號