K-Means 概念定义:
K-Means 是一种基于距离的排他的聚类划分方法。
上面的 K-Means 描述中包含了几个概念:
聚类(Clustering):K-Means 是一种聚类分析(Cluster Analysis)方法。聚类就是将数据对象分组成为多个类或者簇 (Cluster),使得在同一个簇中的对象之间具有较高的相似度,而不同簇中的对象差别较大。
划分(Partitioning):聚类可以基于划分,也可以基于分层。划分即将对象划分成不同的簇,而分层是将对象分等级。
排他(Exclusive):对于一个数据对象,只能被划分到一个簇中。如果一个数据对象可以被划分到多个簇中,则称为可重叠的(Overlapping)。
距离(Distance):基于距离的聚类是将距离近的相似的对象聚在一起。基于概率分布模型的聚类是在一组对象中,找到能符合特定分布模型的对象的集合,他们不一定是距离最近的或者最相似的,而是能完美的呈现出概率分布模型所描述的模型。
K-Means 问题描述:
给定一个 n 个对象的数据集,它可以构建数据的 k 个划分,每个划分就是一个簇,并且 k ≤ n。同时还需满足:
每个组至少包含一个对象。
每个对象必须属于且仅属于一个簇。
Simply speaking, K-Means clustering is an algorithm to classify or to group your objects based on attributes/features, into K number of groups. K is a positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-means clustering is to classify the data.
例如,有如下包含 10 条数据的集合。集合中每项描述了一个人的身高(Height: inches)和体重(Weight: kilograms)。
Height Weight ------------- (73.0, 72.6) (61.0, 54.4) (67.0, 99.9) (68.0, 97.3) (62.0, 59.0) (75.0, 81.6) (74.0, 77.1) (66.0, 97.3) (68.0, 93.3) (61.0, 59.0)
通过按照身高和体重的聚类,可以将上述 10 条数据分组成 3 类。
Height Weight ------------- (67.0, 99.9) (68.0, 97.3) (66.0, 97.3) (68.0, 93.3) (73.0, 72.6) (75.0, 81.6) (74.0, 77.1) (61.0, 54.4) (62.0, 59.0) (61.0, 59.0)
分类结果可以描述为:中等身高并且很重、很高并且中等体重、矮并且轻。如果用图形来观察分组状况则结果一目了然。
K-Means 算法实现:
由于 K-Means 算法值针对给定的完整数据集进行操作,不需要任何特殊的训练数据,所以 K-Means 是一种无监督的机器学习方法(Unsupervised Machine Learning Technique)。
K-Means 算法最常见的实现方式是使用迭代式精化启发法的 Lloyd's algorithm。
给定划分数量 k。创建一个初始划分,从数据集中随机地选择 k 个对象,每个对象初始地代表了一个簇中心(Cluster Centroid)。对于其他对象,计算其与各个簇中心的距离,将它们划入距离最近的簇。
采用迭代的重定位技术,尝试通过对象在划分间移动来改进划分。所谓重定位技术,就是当有新的对象加入簇或者已有对象离开簇的时候,重新计算簇的平均值,然后对对象进行重新分配。这个过程不断重复,直到各簇中对象不再变化为止。
loop until no change in cluster assignments
compute centroids for each cluster
reassign each data item to cluster of closest centroid
end
简洁点儿的表述即为:
initialize clusteringloop
update centroids
update clustering
end loop
应用 K-Means 算法到上述身高与体重的示例,聚类过程如下图所示。
K-Means 优缺点:
当结果簇是密集的,而且簇和簇之间的区别比较明显时,K-Means 的效果较好。对于大数据集,K-Means 是相对可伸缩的和高效的,它的复杂度是 O(nkt),n 是对象的个数,k 是簇的数目,t 是迭代的次数,通常 k << n,且 t << n,所以算法经常以局部最优结束。
K-Means 的最大问题是要求先给出 k 的个数。k 的选择一般基于经验值和多次实验结果,对于不同的数据集,k 的取值没有可借鉴性。另外,K-Means 对孤立点数据是敏感的,少量噪声数据就能对平均值造成极大的影响。
Basic K-Means - Lloyd's algorithm C# 代码实现:
Code below referenced from Machine Learning Using C# Succinctly by James McCaffrey, and article K-Means Data Clustering Using C#.
1 using System;
2
3 namespace ClusterNumeric
4 {
5 class ClusterNumProgram
6 {
7 static void Main(string[] args)
8 {
9 Console.WriteLine("\nBegin k-means clustering demo\n");
10
11 double[][] rawData = new double[10][];
12 rawData[0] = new double[] { 73, 72.6 };
13 rawData[1] = new double[] { 61, 54.4 };
14 rawData[2] = new double[] { 67, 99.9 };
15 rawData[3] = new double[] { 68, 97.3 };
16 rawData[4] = new double[] { 62, 59.0 };
17 rawData[5] = new double[] { 75, 81.6 };
18 rawData[6] = new double[] { 74, 77.1 };
19 rawData[7] = new double[] { 66, 97.3 };
20 rawData[8] = new double[] { 68, 93.3 };
21 rawData[9] = new double[] { 61, 59.0 };
22
23 Console.WriteLine("Raw unclustered height (in.) weight (kg.) data:\n");
24 Console.WriteLine(" ID Height Weight");
25 Console.WriteLine("---------------------");
26 ShowData(rawData, 1, true, true);
27
28 int numClusters = 3;
29 Console.WriteLine("\nSetting numClusters to " + numClusters);
30
31 Console.WriteLine("Starting clustering using k-means algorithm");
32 Clusterer c = new Clusterer(numClusters);
33 int[] clustering = c.Cluster(rawData);
34 Console.WriteLine("Clustering complete\n");
35
36 Console.WriteLine("Final clustering in internal form:\n");
37 ShowVector(clustering, true);
38
39 Console.WriteLine("Raw data by cluster:\n");
40 Console.WriteLine(" ID Height Weight");
41 ShowClustered(rawData, clustering, numClusters, 1);
42
43 Console.WriteLine("\nEnd k-means clustering demo\n");
44 Console.ReadLine();
45 }
46
47 static void ShowData(
48 double[][] data, int decimals,
49 bool indices, bool newLine)
50 {
51 for (int i = 0; i < data.Length; ++i)
52 {
53 if (indices == true)
54 Console.Write(i.ToString().PadLeft(3) + " ");
55
56 for (int j = 0; j < data[i].Length; ++j)
57 {
58 double v = data[i][j];
59 Console.Write(v.ToString("F" + decimals) + " ");
60 }
61
62 Console.WriteLine("");
63 }
64
65 if (newLine == true)
66 Console.WriteLine("");
67 }
68
69 static void ShowVector(int[] vector, bool newLine)
70 {
71 for (int i = 0; i < vector.Length; ++i)
72 Console.Write(vector[i] + " ");
73
74 if (newLine == true)
75 Console.WriteLine("\n");
76 }
77
78 static void ShowClustered(
79 double[][] data, int[] clustering,
80 int numClusters, int decimals)
81 {
82 for (int k = 0; k < numClusters; ++k)
83 {
84 Console.WriteLine("===================");
85 for (int i = 0; i < data.Length; ++i)
86 {
87 int clusterID = clustering[i];
88 if (clusterID != k) continue;
89 Console.Write(i.ToString().PadLeft(3) + " ");
90 for (int j = 0; j < data[i].Length; ++j)
91 {
92 double v = data[i][j];
93 Console.Write(v.ToString("F" + decimals) + " ");
94 }
95 Console.WriteLine("");
96 }
97 Console.WriteLine("===================");
98 }
99 }
100 }
101
102 public class Clusterer
103 {
104 private int numClusters; // number of clusters
105 private int[] clustering; // index = a tuple, value = cluster ID
106 private double[][] centroids; // mean (vector) of each cluster
107 private Random rnd; // for initialization
108
109 public Clusterer(int numClusters)
110 {
111 this.numClusters = numClusters;
112 this.centroids = new double[numClusters][];
113 this.rnd = new Random(0); // arbitrary seed
114 }
115
116 public int[] Cluster(double[][] data)
117 {
118 int numTuples = data.Length;
119 int numValues = data[0].Length;
120 this.clustering = new int[numTuples];
121
122 for (int k = 0; k < numClusters; ++k) // allocate each centroid
123 this.centroids[k] = new double[numValues];
124
125 InitRandom(data);
126
127 Console.WriteLine("\nInitial random clustering:");
128 for (int i = 0; i < clustering.Length; ++i)
129 Console.Write(clustering[i] + " ");
130 Console.WriteLine("\n");
131
132 bool changed = true; // change in clustering?
133 int maxCount = numTuples * 10; // sanity check
134 int ct = 0;
135 while (changed == true && ct <= maxCount)
136 {
137 ++ct; // k-means typically converges very quickly
138 UpdateCentroids(data); // no effect if fail
139 changed = UpdateClustering(data); // no effect if fail
140 }
141
142 int[] result = new int[numTuples];
143 Array.Copy(this.clustering, result, clustering.Length);
144 return result;
145 }
146
147 private void InitRandom(double[][] data)
148 {
149 int numTuples = data.Length;
150
151 int clusterID = 0;
152 for (int i = 0; i < numTuples; ++i)
153 {
154 clustering[i] = clusterID++;
155 if (clusterID == numClusters)
156 clusterID = 0;
157 }
158 for (int i = 0; i < numTuples; ++i)
159 {
160 int r = rnd.Next(i, clustering.Length);
161 int tmp = clustering[r];
162 clustering[r] = clustering[i];
163 clustering[i] = tmp;
164 }
165 }
166
167 private void UpdateCentroids(double[][] data)
168 {
169 int[] clusterCounts = new int[numClusters];
170 for (int i = 0; i < data.Length; ++i)
171 {
172 int clusterID = clustering[i];
173 ++clusterCounts[clusterID];
174 }
175
176 // zero-out this.centroids so it can be used as scratch
177 for (int k = 0; k < centroids.Length; ++k)
178 for (int j = 0; j < centroids[k].Length; ++j)
179 centroids[k][j] = 0.0;
180
181 for (int i = 0; i < data.Length; ++i)
182 {
183 int clusterID = clustering[i];
184 for (int j = 0; j < data[i].Length; ++j)
185 centroids[clusterID][j] += data[i][j]; // accumulate sum
186 }
187
188 for (int k = 0; k < centroids.Length; ++k)
189 for (int j = 0; j < centroids[k].Length; ++j)
190 centroids[k][j] /= clusterCounts[k]; // danger?
191 }
192
193 private bool UpdateClustering(double[][] data)
194 {
195 // (re)assign each tuple to a cluster (closest centroid)
196 // returns false if no tuple assignments change OR
197 // if the reassignment would result in a clustering where
198 // one or more clusters have no tuples.
199
200 bool changed = false; // did any tuple change cluster?
201
202 int[] newClustering = new int[clustering.Length]; // proposed result
203 Array.Copy(clustering, newClustering, clustering.Length);
204
205 double[] distances = new double[numClusters]; // from tuple to centroids
206
207 for (int i = 0; i < data.Length; ++i) // walk through each tuple
208 {
209 for (int k = 0; k < numClusters; ++k)
210 distances[k] = Distance(data[i], centroids[k]);
211
212 int newClusterID = MinIndex(distances); // find closest centroid
213 if (newClusterID != newClustering[i])
214 {
215 changed = true; // note a new clustering
216 newClustering[i] = newClusterID; // accept update
217 }
218 }
219
220 if (changed == false)
221 return false; // no change so bail
222
223 // check proposed clustering cluster counts
224 int[] clusterCounts = new int[numClusters];
225 for (int i = 0; i < data.Length; ++i)
226 {
227 int clusterID = newClustering[i];
228 ++clusterCounts[clusterID];
229 }
230
231 for (int k = 0; k < numClusters; ++k)
232 if (clusterCounts[k] == 0)
233 return false; // bad clustering
234
235 Array.Copy(newClustering, clustering, newClustering.Length); // update
236 return true; // good clustering and at least one change
237 }
238
239 // Euclidean distance between two vectors for UpdateClustering()
240 private static double Distance(double[] tuple, double[] centroid)
241 {
242 double sumSquaredDiffs = 0.0;
243 for (int j = 0; j < tuple.Length; ++j)
244 sumSquaredDiffs += (tuple[j] - centroid[j]) * (tuple[j] - centroid[j]);
245 return Math.Sqrt(sumSquaredDiffs);
246 }
247
248 // helper for UpdateClustering() to find closest centroid
249 private static int MinIndex(double[] distances)
250 {
251 int indexOfMin = 0;
252 double smallDist = distances[0];
253 for (int k = 1; k < distances.Length; ++k)
254 {
255 if (distances[k] < smallDist)
256 {
257 smallDist = distances[k];
258 indexOfMin = k;
259 }
260 }
261 return indexOfMin;
262 }
263 }
264 }运行结果如下:
共同學(xué)習(xí),寫下你的評論
評論加載中...
作者其他優(yōu)質(zhì)文章






