K-Means 聚類算法

標(biāo)簽：

算法

K-Means 概念定义：

K-Means 是一种基于距离的排他的聚类划分方法。

上面的 K-Means 描述中包含了几个概念：

聚类（Clustering）：K-Means 是一种聚类分析（Cluster Analysis）方法。聚类就是将数据对象分组成为多个类或者簇 (Cluster)，使得在同一个簇中的对象之间具有较高的相似度，而不同簇中的对象差别较大。
划分（Partitioning）：聚类可以基于划分，也可以基于分层。划分即将对象划分成不同的簇，而分层是将对象分等级。
排他（Exclusive）：对于一个数据对象，只能被划分到一个簇中。如果一个数据对象可以被划分到多个簇中，则称为可重叠的（Overlapping）。
距离（Distance）：基于距离的聚类是将距离近的相似的对象聚在一起。基于概率分布模型的聚类是在一组对象中，找到能符合特定分布模型的对象的集合，他们不一定是距离最近的或者最相似的，而是能完美的呈现出概率分布模型所描述的模型。

K-Means 问题描述：

给定一个 n 个对象的数据集，它可以构建数据的 k 个划分，每个划分就是一个簇，并且 k ≤ n。同时还需满足：

每个组至少包含一个对象。
每个对象必须属于且仅属于一个簇。

Simply speaking, K-Means clustering is an algorithm to classify or to group your objects based on attributes/features, into K number of groups. K is a positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-means clustering is to classify the data.

例如，有如下包含 10 条数据的集合。集合中每项描述了一个人的身高（Height: inches）和体重（Weight: kilograms）。

Height Weight
-------------
(73.0, 72.6) 
(61.0, 54.4) 
(67.0, 99.9) 
(68.0, 97.3) 
(62.0, 59.0) 
(75.0, 81.6) 
(74.0, 77.1) 
(66.0, 97.3) 
(68.0, 93.3) 
(61.0, 59.0)

通过按照身高和体重的聚类，可以将上述 10 条数据分组成 3 类。

Height Weight
-------------
(67.0, 99.9) 
(68.0, 97.3) 
(66.0, 97.3) 
(68.0, 93.3)

(73.0, 72.6) 
(75.0, 81.6) 
(74.0, 77.1)

(61.0, 54.4) 
(62.0, 59.0) 
(61.0, 59.0)

分类结果可以描述为：中等身高并且很重、很高并且中等体重、矮并且轻。如果用图形来观察分组状况则结果一目了然。

K-Means 算法实现：

由于 K-Means 算法值针对给定的完整数据集进行操作，不需要任何特殊的训练数据，所以 K-Means 是一种无监督的机器学习方法（Unsupervised Machine Learning Technique）。

K-Means 算法最常见的实现方式是使用迭代式精化启发法的 Lloyd's algorithm。

给定划分数量 k。创建一个初始划分，从数据集中随机地选择 k 个对象，每个对象初始地代表了一个簇中心（Cluster Centroid）。对于其他对象，计算其与各个簇中心的距离，将它们划入距离最近的簇。
采用迭代的重定位技术，尝试通过对象在划分间移动来改进划分。所谓重定位技术，就是当有新的对象加入簇或者已有对象离开簇的时候，重新计算簇的平均值，然后对对象进行重新分配。这个过程不断重复，直到各簇中对象不再变化为止。

randomly assign all data items to a cluster
loop until no change in cluster assignments
compute centroids for each cluster
reassign each data item to cluster of closest centroid
end

简洁点儿的表述即为：

initialize clustering
loop
update centroids
update clustering
end loop

应用 K-Means 算法到上述身高与体重的示例，聚类过程如下图所示。

K-Means 优缺点：

当结果簇是密集的，而且簇和簇之间的区别比较明显时，K-Means 的效果较好。对于大数据集，K-Means 是相对可伸缩的和高效的，它的复杂度是 O(nkt)，n 是对象的个数，k 是簇的数目，t 是迭代的次数，通常 k << n，且 t << n，所以算法经常以局部最优结束。

K-Means 的最大问题是要求先给出 k 的个数。k 的选择一般基于经验值和多次实验结果，对于不同的数据集，k 的取值没有可借鉴性。另外，K-Means 对孤立点数据是敏感的，少量噪声数据就能对平均值造成极大的影响。

Basic K-Means - Lloyd's algorithm C# 代码实现：

Code below referenced from Machine Learning Using C# Succinctly by James McCaffrey, and article K-Means Data Clustering Using C#.

  1 using System;
  2 
  3 namespace ClusterNumeric
  4 {
  5   class ClusterNumProgram
  6   {
  7     static void Main(string[] args)
  8     {
  9       Console.WriteLine("\nBegin k-means clustering demo\n");
 10 
 11       double[][] rawData = new double[10][];
 12       rawData[0] = new double[] { 73, 72.6 };
 13       rawData[1] = new double[] { 61, 54.4 };
 14       rawData[2] = new double[] { 67, 99.9 };
 15       rawData[3] = new double[] { 68, 97.3 };
 16       rawData[4] = new double[] { 62, 59.0 };
 17       rawData[5] = new double[] { 75, 81.6 };
 18       rawData[6] = new double[] { 74, 77.1 };
 19       rawData[7] = new double[] { 66, 97.3 };
 20       rawData[8] = new double[] { 68, 93.3 };
 21       rawData[9] = new double[] { 61, 59.0 };
 22 
 23       Console.WriteLine("Raw unclustered height (in.) weight (kg.) data:\n");
 24       Console.WriteLine(" ID Height Weight");
 25       Console.WriteLine("---------------------");
 26       ShowData(rawData, 1, true, true);
 27 
 28       int numClusters = 3;
 29       Console.WriteLine("\nSetting numClusters to " + numClusters);
 30 
 31       Console.WriteLine("Starting clustering using k-means algorithm");
 32       Clusterer c = new Clusterer(numClusters);
 33       int[] clustering = c.Cluster(rawData);
 34       Console.WriteLine("Clustering complete\n");
 35 
 36       Console.WriteLine("Final clustering in internal form:\n");
 37       ShowVector(clustering, true);
 38 
 39       Console.WriteLine("Raw data by cluster:\n");
 40       Console.WriteLine(" ID Height Weight");
 41       ShowClustered(rawData, clustering, numClusters, 1);
 42 
 43       Console.WriteLine("\nEnd k-means clustering demo\n");
 44       Console.ReadLine();
 45     }
 46 
 47     static void ShowData(
 48       double[][] data, int decimals,
 49       bool indices, bool newLine)
 50     {
 51       for (int i = 0; i < data.Length; ++i)
 52       {
 53         if (indices == true)
 54           Console.Write(i.ToString().PadLeft(3) + " ");
 55 
 56         for (int j = 0; j < data[i].Length; ++j)
 57         {
 58           double v = data[i][j];
 59           Console.Write(v.ToString("F" + decimals) + "   ");
 60         }
 61 
 62         Console.WriteLine("");
 63       }
 64 
 65       if (newLine == true)
 66         Console.WriteLine("");
 67     }
 68 
 69     static void ShowVector(int[] vector, bool newLine)
 70     {
 71       for (int i = 0; i < vector.Length; ++i)
 72         Console.Write(vector[i] + " ");
 73 
 74       if (newLine == true)
 75         Console.WriteLine("\n");
 76     }
 77 
 78     static void ShowClustered(
 79       double[][] data, int[] clustering,
 80       int numClusters, int decimals)
 81     {
 82       for (int k = 0; k < numClusters; ++k)
 83       {
 84         Console.WriteLine("===================");
 85         for (int i = 0; i < data.Length; ++i)
 86         {
 87           int clusterID = clustering[i];
 88           if (clusterID != k) continue;
 89           Console.Write(i.ToString().PadLeft(3) + " ");
 90           for (int j = 0; j < data[i].Length; ++j)
 91           {
 92             double v = data[i][j];
 93             Console.Write(v.ToString("F" + decimals) + "   ");
 94           }
 95           Console.WriteLine("");
 96         }
 97         Console.WriteLine("===================");
 98       }
 99     }
100   }
101 
102   public class Clusterer
103   {
104     private int numClusters; // number of clusters 
105     private int[] clustering; // index = a tuple, value = cluster ID 
106     private double[][] centroids; // mean (vector) of each cluster 
107     private Random rnd; // for initialization 
108 
109     public Clusterer(int numClusters)
110     {
111       this.numClusters = numClusters;
112       this.centroids = new double[numClusters][];
113       this.rnd = new Random(0); // arbitrary seed 
114     }
115 
116     public int[] Cluster(double[][] data)
117     {
118       int numTuples = data.Length;
119       int numValues = data[0].Length;
120       this.clustering = new int[numTuples];
121 
122       for (int k = 0; k < numClusters; ++k) // allocate each centroid 
123         this.centroids[k] = new double[numValues];
124 
125       InitRandom(data);
126 
127       Console.WriteLine("\nInitial random clustering:");
128       for (int i = 0; i < clustering.Length; ++i)
129         Console.Write(clustering[i] + " ");
130       Console.WriteLine("\n");
131 
132       bool changed = true; // change in clustering? 
133       int maxCount = numTuples * 10; // sanity check 
134       int ct = 0;
135       while (changed == true && ct <= maxCount)
136       {
137         ++ct; // k-means typically converges very quickly 
138         UpdateCentroids(data); // no effect if fail 
139         changed = UpdateClustering(data); // no effect if fail 
140       }
141 
142       int[] result = new int[numTuples];
143       Array.Copy(this.clustering, result, clustering.Length);
144       return result;
145     }
146 
147     private void InitRandom(double[][] data)
148     {
149       int numTuples = data.Length;
150 
151       int clusterID = 0;
152       for (int i = 0; i < numTuples; ++i)
153       {
154         clustering[i] = clusterID++;
155         if (clusterID == numClusters)
156           clusterID = 0;
157       }
158       for (int i = 0; i < numTuples; ++i)
159       {
160         int r = rnd.Next(i, clustering.Length);
161         int tmp = clustering[r];
162         clustering[r] = clustering[i];
163         clustering[i] = tmp;
164       }
165     }
166 
167     private void UpdateCentroids(double[][] data)
168     {
169       int[] clusterCounts = new int[numClusters];
170       for (int i = 0; i < data.Length; ++i)
171       {
172         int clusterID = clustering[i];
173         ++clusterCounts[clusterID];
174       }
175 
176       // zero-out this.centroids so it can be used as scratch 
177       for (int k = 0; k < centroids.Length; ++k)
178         for (int j = 0; j < centroids[k].Length; ++j)
179           centroids[k][j] = 0.0;
180 
181       for (int i = 0; i < data.Length; ++i)
182       {
183         int clusterID = clustering[i];
184         for (int j = 0; j < data[i].Length; ++j)
185           centroids[clusterID][j] += data[i][j]; // accumulate sum 
186       }
187 
188       for (int k = 0; k < centroids.Length; ++k)
189         for (int j = 0; j < centroids[k].Length; ++j)
190           centroids[k][j] /= clusterCounts[k]; // danger? 
191     }
192 
193     private bool UpdateClustering(double[][] data)
194     {
195       // (re)assign each tuple to a cluster (closest centroid) 
196       // returns false if no tuple assignments change OR 
197       // if the reassignment would result in a clustering where 
198       // one or more clusters have no tuples. 
199 
200       bool changed = false; // did any tuple change cluster? 
201 
202       int[] newClustering = new int[clustering.Length]; // proposed result 
203       Array.Copy(clustering, newClustering, clustering.Length);
204 
205       double[] distances = new double[numClusters]; // from tuple to centroids
206 
207       for (int i = 0; i < data.Length; ++i) // walk through each tuple 
208       {
209         for (int k = 0; k < numClusters; ++k)
210           distances[k] = Distance(data[i], centroids[k]);
211 
212         int newClusterID = MinIndex(distances); // find closest centroid 
213         if (newClusterID != newClustering[i])
214         {
215           changed = true; // note a new clustering 
216           newClustering[i] = newClusterID; // accept update 
217         }
218       }
219 
220       if (changed == false)
221         return false; // no change so bail 
222 
223       // check proposed clustering cluster counts 
224       int[] clusterCounts = new int[numClusters];
225       for (int i = 0; i < data.Length; ++i)
226       {
227         int clusterID = newClustering[i];
228         ++clusterCounts[clusterID];
229       }
230 
231       for (int k = 0; k < numClusters; ++k)
232         if (clusterCounts[k] == 0)
233           return false; // bad clustering 
234 
235       Array.Copy(newClustering, clustering, newClustering.Length); // update 
236       return true; // good clustering and at least one change 
237     }
238 
239     // Euclidean distance between two vectors for UpdateClustering() 
240     private static double Distance(double[] tuple, double[] centroid)
241     {
242       double sumSquaredDiffs = 0.0;
243       for (int j = 0; j < tuple.Length; ++j)
244         sumSquaredDiffs += (tuple[j] - centroid[j]) * (tuple[j] - centroid[j]);
245       return Math.Sqrt(sumSquaredDiffs);
246     }
247 
248     // helper for UpdateClustering() to find closest centroid 
249     private static int MinIndex(double[] distances)
250     {
251       int indexOfMin = 0;
252       double smallDist = distances[0];
253       for (int k = 1; k < distances.Length; ++k)
254       {
255         if (distances[k] < smallDist)
256         {
257           smallDist = distances[k];
258           indexOfMin = k;
259         }
260       }
261       return indexOfMin;
262     }
263   }
264 }

运行结果如下：

點(diǎn)擊查看更多內(nèi)容

為 TA 點(diǎn)贊

若覺得本文不錯，就分享一下吧！

評論

評論

共同學(xué)習(xí)，寫下你的評論

評論加載中...

展開查看更多評論

作者其他優(yōu)質(zhì)文章

正在加載中

至尊寶的傳說

手記
篇

粉絲

82

獲贊與收藏

467

關(guān)注作者，訂閱最新文章

閱讀免費(fèi)教程

Python 算法入門教程

15個小節(jié) 29846 1149

算法入門教程

15個小節(jié) 33367 703

后端通用面試教程

41個小節(jié) 32485 366

推薦

評論

收藏

共同學(xué)習(xí)，寫下你的評論



感謝您的支持，我會繼續(xù)努力的～

掃碼打賞，你說多少就多少

贊賞金額會直接到老師賬戶

支付方式

打開微信掃一掃，即可進(jìn)行掃碼打賞哦

今天注冊有機(jī)會得

100積分直接送

付費(fèi)專欄免費(fèi)學(xué)

大額優(yōu)惠券免費(fèi)領(lǐng)

立即參與放棄機(jī)會

點(diǎn)擊
抽獎

慕課手記新用戶專享福利

恭喜你，你的運(yùn)氣太好了，居然抽中了 100個積分！

恭喜你，抽中了價值元的專欄！

太棒了，直接落到你賬戶里！

積分商城里的羅技鼠標(biāo)、機(jī)械鍵盤、
Kindle 閱讀器、小米平衡車
Apple iPad （10.2英寸）、大額優(yōu)惠券
在等著你去兌換了噢

作者：

免費(fèi)贈送

兌換碼：1111222211 復(fù)制

優(yōu)惠券可用于購買實(shí)戰(zhàn)課、體系課
無門檻使用

先去看看，有什么好東西馬上兌換我愛學(xué)習(xí)，選課去


第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

K-Means 聚類算法

閱讀免費(fèi)教程