DATAMANIP commands


PROGRAM NAME

kkmeans - Perform K-Means Clustering

DESCRIPTION

kkmeans accepts an input data object containing vectors of equal size and performs the K-means clustering algorithm on the vectors. The length of each vector is determined by the elements (E) dimension of the input data object.

The K-means algorithm is based on minimization of the sum of the squared distances from all points in a cluster to a cluster center. The user chooses K initial cluster centers and the input vectors are iteratively distributed among the K cluster domains. New cluster centers are computed from these results, such that the sum of the squared distances from all points in a cluster to the new cluster center is minimized.

Although the K-means algorithm does not really converge (in a continuous space), it may converge in a discrete space or a practical upper limit can be chosen for convergence. The user has the option of specifying the maximum number of iterations using the -n option. The default is 50000 iterations.

There are two ways to specify the initial cluster centers. If the -i2 argument is supplied, then the cluster centers are read from the specified object. The vectors are assumed to be stored along the E direction. Only the first K centers (as specified by the -k argument) will be read. If the -i2 argument is not present, then the first K vectors in the -i1 object will be used as the initial cluster centers.

It should be noted that it is possible to specify an initial cluster that lies at a sufficient distance from all input vectors that it will have no vectors assigned to it during a pass of the K-means algorithm. If this happens, kkmeans will reinitialize the value of that cluster to the mean value of a moving pair of the existing cluster centers, thus avoiding degeneracy.

If no options are selected, the output object specified by the -o1 argument will contain a value segment specifying the cluster number to which each input vector was assigned. If the -map flag is also specified, then a map segment will also be generated. The final cluster centers will be stored row by row in the map. The values in the value segment can be interpreted as "pointing" to a particular row in the map where the associated cluster for that input vector can be found.

If the -spectrum flag is specified, then the -o1 output object will contain a special map segment (regardless of the -map flag) with additional information required for use with the spectrum program in the most general sense. Here, not only the cluster centers are stored, but so are the number of vectors associated with each cluster and the packed upper triangle of the covariance matrix for each cluster. See the spectrum manual for additional information on how this data is used and the additional capabilities that become available when the extra data supplied by the -spectrum flag is present.

The -o2 optional argument will generate an output data object containing the cluster centers (mean vectors), stored row by row in the value segment. The dimensions of the value segment will be WxHx1x1x1 where W is the number of elements in each mean vector and W is the number of clusters.

The -o3 optional argument will generate an output data object containing the cluster variances, stored row by row in the value segment. The dimensions of the value segment will be WxHx1x1x1 where W is the number of elements in each vector of variances and W is the number of clusters.

The -o4 optional argument will generate an output data object containing the cluster membership counts, stored row by row in the value segment. The dimensions of the value segment will be 1xHx1x1x1 where H is the number of clusters. The membership counts simply state the number of vectors that were present in the input object that were assigned to each of the final cluster centers.

The statistics file (-o5) contains statistics obtained during the execution of kkmeans. This file includes the following information:

Total Number of K-means Iterations
Total Number of Clusters
Number of Vectors Per Cluster
Cluster Center Values
Cluster Center Variance Values
Trace of Covariance Matrix

Results obtained by the K-means algorithm can be influenced by the number and choice of initial cluster centers and the geometrical properties of the data.

For the -o2, and -o3, output objects, the data will be stored as type KDOUBLE. For the -o4 output object, the data will be stored as type KINT. For the -o1 output object, the value data will be stored as type KSHORT and all map data as type KDOUBLE.

kkmeans was converted from the K1.5 vkmeans program, which was written by Tom Sauer and Charlie Gage, with assistance and ideas from Dr. Don Hush, University of New Mexico, Dept. of EECE. Significant modifications were made to the algorithm by Scott Wilson during conversion to K2.

REQUIRED ARGUMENTS

-i1
type: infile
desc: input data object
-o1
type: outfile
desc: cluster number output object

OPTIONAL ARGUMENTS

-i2
type: infile
desc: cluster center input object
default: {none}
-map
type: flag
desc: generate output map
-spectrum
type: flag
desc: SPECTRUM compatable map segment
-n
type: integer
desc: max number of iterations
default: 50000
bounds: 0 < [-n] < 100000
-k
type: integer
desc: number of clusters
default: 2
bounds: value > 0
-o2
type: outfile
desc: cluster center output object
default: {none}
-o3
type: outfile
desc: cluster variance output object
default: {none}
-o4
type: outfile
desc: Cluster membership count output
default: {none}
-o5
type: outfile
desc: K-means statistics output (ASCII)
default: {none}

EXAMPLES

kkmeans -i1 image1 -n 10000 -k 6 -o1 image2 -o2 image3

this will apply the K-means clustering algorithm to image1 using the first K vectors as cluster centers. The number of iterations selected is 10000, and the number of clusters selected is 6. Image2 will contain a map linking each input vector to it's respective cluster center, while image3 will contain the actual cluster centers.

kkmeans -i1 image1 -i2 file1 -k 8 -o1 image2 -o2 image3 -o5 file2 -map

this will apply the K-means clustering algorithm to image1 using the cluster centers specified in file1. The -k option specifies 8 cluster centers. An ASCII file containing the K-means statistics (file2) is created. The other output objects are as specified above, except that image2 will also have a map attached containing the cluster centers.

kkmeans -i1 object1 -i2 object2 -k 8 -o1 object3 -spectrum

this will apply the K-means clustering algorithm to object1 using as input object2 to specify the initial cluster centers. The -k option specifies 8 cluster centers. Object3 will contain not only the mapping from vectors to clusters (in the value segment), but an extended map segment containing the cluster centers, counts, and covariance matrices. This object can be automatically classified using the AutoClassify utilities in spectrum.

SEE ALSO

spectrum(1)

RESTRICTIONS

kkmeans will not operate on any form of COMPLEX data. Mask data is currently ignored. If map data is present, then the value data is pulled through the map prior to application of the K-means algorithm.

A maximum of 32767 clusters can be requested due to the use of the KSHORT output data type. If more clusters than this are desired, then the code can be easily modified to change the output data type to KINT.

REFERENCES

The K-means algorithm is also called out as the Basic Isodata algorithm in R. Duda and P. Hart, Pattern Classification and Scene Analysis, Wiley, N.Y., 1973, p. 201. ISBN 0-471-22361-1. This is a dated, by very useful reference.

COPYRIGHT

Copyright (C) 1993 - 1997, Khoral Research, Inc. ("KRI") All rights reserved.