“K-MEANS算法”的意思、由来-中文百科全书

基本简介

k-means 算法接受输入量 k ；然后将n个数据对象划分为 k个聚类以便使得所获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小。聚类相似度是利用各聚类中对象的均值所获得一个“中心对象”（引力中心）来进行计算的。

处理流程

k-means 算法基本步骤

（1）从 n个数据对象任意选择 k 个对象作为初始聚类中心；

（2）根据每个聚类对象的均值（中心对象），计算每个对象与这些中心对象的距离；并根据最小距离重新对相应对象进行划分；

（3）重新计算每个（有变化）聚类的均值（中心对象）；

（4）计算标准测度函数，当满足一定条件，如函数收敛时，则算法终止；如果条件不满足则回到步骤（2）。

k-means 算法的工作过程说明如下：首先从n个数据对象任意选择 k 个对象作为初始聚类中心；而对于所剩下其它对象，则根据它们与这些聚类中心的相似度（距离），分别将它们分配给与其最相似的（聚类中心所代表的）聚类；然后再计算每个所获新聚类的聚类中心（该聚类中所有对象的均值）；不断重复这一过程直到标准测度函数开始收敛为止。一般都采用均方差作为标准测度函数. k个聚类具有以下特点：各聚类本身尽可能的紧凑，而各聚类之间尽可能的分开。

算法的时间复杂度上界为O(n*k*t), 其中t是迭代次数。

k-means算法是一种基于样本间相似性度量的间接聚类方法，属于非监督学习方法。此算法以k为参数，把n 个对象分为k个簇，以使簇内具有较高的相似度，而且簇间的相似度较低。相似度的计算根据一个簇中对象的平均值（被看作簇的重心）来进行。此算法首先随机选择k个对象，每个对象代表一个聚类的质心。对于其余的每一个对象，根据该对象与各聚类质心之间的距离，把它分配到与之最相似的聚类中。然后，计算每个聚类的新质心。重复上述过程，直到准则函数会聚。k-means算法是一种较典型的逐点修改迭代的动态聚类算法，其要点是以误差平方和为准则函数。逐点修改类中心：一个象元样本按某一原则，归属于某一组类后，就要重新计算这个组类的均值，并且以新的均值作为凝聚中心点进行下一次象元素聚类；逐批修改类中心：在全部象元样本按某一组的类中心分类之后，再计算修改各类的均值，作为下一次分类的凝聚中心点。

实现方法

补充一个Matlab实现方法：

function [cid,nr,centers] = cskmeans(x,k,nc)

% CSKMEANS K-Means clustering - general method.

% This implements the more general k-means algorithm, where

% HMEANS is used to find the initial partition and then each

% observation is examined for further improvements in minimizing

% the within-group sum of squares.

% [CID,NR,CENTERS] = CSKMEANS(X,K,NC) Performs K-means

% clustering using the data given in X.

% INPUTS: X is the n x d matrix of data,

% where each row indicates an observation. K indicates

% the number of desired clusters. NC is a k x d matrix for the

% initial cluster centers. If NC is not specified, then the

% centers will be randomly chosen from the observations.

% OUTPUTS: CID provides a set of n indexes indicating cluster

% membership for each point. NR is the number of observations

% in each cluster. CENTERS is a matrix, where each row

% corresponds to a cluster center.

% See also CSHMEANS

% W. L. and A. R. Martinez, 9/15/01

% Computational Statistics Toolbox

warning off

[n,d] = size(x);

if nargin < 3

% Then pick some observations to be the cluster centers.

ind = ceil(n*rand(1,k));

% We will add some noise to make it interesting.

nc = x(ind,:) + randn(k,d);

end

% set up storage

% integer 1,...,k indicating cluster membership

cid = zeros(1,n);

% Make this different to get the loop started.

oldcid = ones(1,n);

% The number in each cluster.

nr = zeros(1,k);

% Set up maximum number of iterations.

maxiter = 100;

iter = 1;

while ~isequal(cid,oldcid) & iter < maxiter

% Implement the hmeans algorithm

% For each point, find the distance to all cluster centers

for i = 1:n

dist = sum((repmat(x(i,:),k,1)-nc).^2,2);

[m,ind] = min(dist); % assign it to this cluster center

cid(i) = ind;

end

% Find the new cluster centers

for i = 1:k

% find all points in this cluster

ind = find(cid==i);

% find the centroid

nc(i,:) = mean(x(ind,:));

% Find the number in each cluster;

nr(i) = length(ind);

end

iter = iter + 1;

end

% Now check each observation to see if the error can be minimized some more.

% Loop through all points.

maxiter = 2;

iter = 1;

move = 1;

while iter < maxiter & move ~= 0

move = 0;

% Loop through all points.

for i = 1:n

% find the distance to all cluster centers

dist = sum((repmat(x(i,:),k,1)-nc).^2,2);

r = cid(i); % This is the cluster id for x

%%nr,nr+1;

dadj = nr./(nr+1).*dist'; % All adjusted distances

[m,ind] = min(dadj); % minimum should be the cluster it belongs to

if ind ~= r % if not, then move x

cid(i) = ind;

ic = find(cid == ind);

nc(ind,:) = mean(x(ic,:));

move = 1;

end

iter = iter+1;

end

centers = nc;

if move == 0

disp('No points were moved after the initial clustering procedure.')

else

disp('Some points were moved after the initial clustering procedure.')

end

warning on

词条	K-MEANS算法
释义	K-MEANS算法是输入聚类个数k，以及包含 n个数据对象的数据库，输出满足方差最小标准的k个聚类。基本简介处理流程(k-means 算法基本步骤算法分析和评价) 实现方法基本简介 k-means 算法接受输入量 k ；然后将n个数据对象划分为 k个聚类以便使得所获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小。聚类相似度是利用各聚类中对象的均值所获得一个“中心对象”（引力中心）来进行计算的。处理流程 k-means 算法基本步骤（1）从 n个数据对象任意选择 k 个对象作为初始聚类中心；（2）根据每个聚类对象的均值（中心对象），计算每个对象与这些中心对象的距离；并根据最小距离重新对相应对象进行划分；（3）重新计算每个（有变化）聚类的均值（中心对象）；（4）计算标准测度函数，当满足一定条件，如函数收敛时，则算法终止；如果条件不满足则回到步骤（2）。算法分析和评价 k-means 算法接受输入量 k ；然后将n个数据对象划分为 k个聚类以便使得所获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小。聚类相似度是利用各聚类中对象的均值所获得一个“中心对象”（引力中心）来进行计算的。 k-means 算法的工作过程说明如下：首先从n个数据对象任意选择 k 个对象作为初始聚类中心；而对于所剩下其它对象，则根据它们与这些聚类中心的相似度（距离），分别将它们分配给与其最相似的（聚类中心所代表的）聚类；然后再计算每个所获新聚类的聚类中心（该聚类中所有对象的均值）；不断重复这一过程直到标准测度函数开始收敛为止。一般都采用均方差作为标准测度函数. k个聚类具有以下特点：各聚类本身尽可能的紧凑，而各聚类之间尽可能的分开。算法的时间复杂度上界为O(nkt), 其中t是迭代次数。 k-means算法是一种基于样本间相似性度量的间接聚类方法，属于非监督学习方法。此算法以k为参数，把n 个对象分为k个簇，以使簇内具有较高的相似度，而且簇间的相似度较低。相似度的计算根据一个簇中对象的平均值（被看作簇的重心）来进行。此算法首先随机选择k个对象，每个对象代表一个聚类的质心。对于其余的每一个对象，根据该对象与各聚类质心之间的距离，把它分配到与之最相似的聚类中。然后，计算每个聚类的新质心。重复上述过程，直到准则函数会聚。k-means算法是一种较典型的逐点修改迭代的动态聚类算法，其要点是以误差平方和为准则函数。逐点修改类中心：一个象元样本按某一原则，归属于某一组类后，就要重新计算这个组类的均值，并且以新的均值作为凝聚中心点进行下一次象元素聚类；逐批修改类中心：在全部象元样本按某一组的类中心分类之后，再计算修改各类的均值，作为下一次分类的凝聚中心点。实现方法补充一个Matlab实现方法： function [cid,nr,centers] = cskmeans(x,k,nc) % CSKMEANS K-Means clustering - general method. % % This implements the more general k-means algorithm, where % HMEANS is used to find the initial partition and then each % observation is examined for further improvements in minimizing % the within-group sum of squares. % % [CID,NR,CENTERS] = CSKMEANS(X,K,NC) Performs K-means % clustering using the data given in X. % % INPUTS: X is the n x d matrix of data, % where each row indicates an observation. K indicates % the number of desired clusters. NC is a k x d matrix for the % initial cluster centers. If NC is not specified, then the % centers will be randomly chosen from the observations. % % OUTPUTS: CID provides a set of n indexes indicating cluster % membership for each point. NR is the number of observations % in each cluster. CENTERS is a matrix, where each row % corresponds to a cluster center. % % See also CSHMEANS % W. L. and A. R. Martinez, 9/15/01 % Computational Statistics Toolbox warning off [n,d] = size(x); if nargin < 3 % Then pick some observations to be the cluster centers. ind = ceil(nrand(1,k)); % We will add some noise to make it interesting. nc = x(ind,:) + randn(k,d); end % set up storage % integer 1,...,k indicating cluster membership cid = zeros(1,n); % Make this different to get the loop started. oldcid = ones(1,n); % The number in each cluster. nr = zeros(1,k); % Set up maximum number of iterations. maxiter = 100; iter = 1; while ~isequal(cid,oldcid) & iter < maxiter % Implement the hmeans algorithm % For each point, find the distance to all cluster centers for i = 1:n dist = sum((repmat(x(i,:),k,1)-nc).^2,2); [m,ind] = min(dist); % assign it to this cluster center cid(i) = ind; end % Find the new cluster centers for i = 1:k % find all points in this cluster ind = find(cid==i); % find the centroid nc(i,:) = mean(x(ind,:)); % Find the number in each cluster; nr(i) = length(ind); end iter = iter + 1; end % Now check each observation to see if the error can be minimized some more. % Loop through all points. maxiter = 2; iter = 1; move = 1; while iter < maxiter & move ~= 0 move = 0; % Loop through all points. for i = 1:n % find the distance to all cluster centers dist = sum((repmat(x(i,:),k,1)-nc).^2,2); r = cid(i); % This is the cluster id for x %%nr,nr+1; dadj = nr./(nr+1).dist'; % All adjusted distances [m,ind] = min(dadj); % minimum should be the cluster it belongs to if ind ~= r % if not, then move x cid(i) = ind; ic = find(cid == ind); nc(ind,:) = mean(x(ic,:)); move = 1; end end iter = iter+1; end centers = nc; if move == 0 disp('No points were moved after the initial clustering procedure.') else disp('Some points were moved after the initial clustering procedure.') end warning on
随便看	范希尔定理范希峰范希夫特范希更范希鲁范希明范希彭范希武范希秀范希哲范希璧范夕波范惜范溪村范习友范喜风范喜军范细妹范夏华范夏夏范先狄范先锋范先阁范先汉范先六

基本简介

处理流程

k-means 算法基本步骤

算法分析和评价

实现方法