Title: | Selection of K in K-Means Clustering |
---|---|
Description: | Selection of k in k-means clustering based on Pham et al. paper ``Selection of k in k-means clustering''. |
Authors: | Daniel Rodriguez |
Maintainer: | Daniel Rodriguez <[email protected]> |
License: | GPL-3 |
Version: | 0.2.1 |
Built: | 2025-02-28 05:02:39 UTC |
Source: | https://github.com/drodriguezperez/kselection |
Selection of k in k-means clustering based on Pham et al. paper “Selection of k in k-means clustering”
This package implements the method for selecting the number of clusters for the algorithm K-means introduced in the publication of Pham, Dimov and Nguyen of 2004.
Package: | kselection |
Version: | 0.2.0 |
License: | GPL-3 |
Daniel Rodriguez [email protected]
D T Pham, S S Dimov, and C D Nguyen, "Selection of k in k-means clustering", Mechanical Engineering Science, 2004, pp. 103-119.
vectorGet the vector.
get_f_k(obj)
get_f_k(obj)
obj |
the output of |
the vector of function.
Daniel Rodriguez
num_clusters
, num_clusters_all
# Create a data set with two clusters dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1), rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2) # Get the f(k) vector sol <- kselection(dat) f_k <- get_f_k(sol)
# Create a data set with two clusters dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1), rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2) # Get the f(k) vector sol <- kselection(dat) f_k <- get_f_k(sol)
k_threshold
Get the maximum value of from which can not be considered the
existence of more than one cluster.
get_k_threshold(obj)
get_k_threshold(obj)
obj |
the output of |
the k_threshold
value.
Daniel Rodriguez
Selection of k in k-means clustering based on Pham et al. paper.
kselection( x, fun_cluster = stats::kmeans, max_centers = 15, k_threshold = 0.85, progressBar = FALSE, trace = FALSE, parallel = FALSE, ... )
kselection( x, fun_cluster = stats::kmeans, max_centers = 15, k_threshold = 0.85, progressBar = FALSE, trace = FALSE, parallel = FALSE, ... )
x |
numeric matrix of data, or an object that can be coerced to such a matrix. |
fun_cluster |
function to cluster by (e.g. |
max_centers |
maximum number of clusters for evaluation. |
k_threshold |
maximum value of |
progressBar |
show a progress bar. |
trace |
display a trace of the progress. |
parallel |
If set to true, use parallel |
... |
arguments to be passed to the kmeans method. |
This function implements the method proposed by Pham, Dimov and Nguyen for
selecting the number of clusters for the K-means algorithm. In this method
a function is used to evaluate the quality of the resulting
clustering and help decide on the optimal value of
for each data
set. The
function is defined as
where is the sum of the distortion of all cluster and
is a weight factor which is defined as
where is the number of dimensions of the data set.
In this definition is the ratio of the real distortion to the
estimated distortion and decreases when there are areas of concentration in
the data distribution.
The values of that yield
can be recommended for
clustering. If there is not a value of
which
, it
cannot be considered the existence of clusters in the data set.
an object with the results.
Daniel Rodriguez
D T Pham, S S Dimov, and C D Nguyen, "Selection of k in k-means clustering", Mechanical Engineering Science, 2004, pp. 103-119.
# Create a data set with two clusters dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1), rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2) # Execute the method sol <- kselection(dat) # Get the results k <- num_clusters(sol) # optimal number of clustes f_k <- get_f_k(sol) # the f(K) vector # Plot the results plot(sol) ## Not run: # Parallel require(doMC) registerDoMC(cores = 4) system.time(kselection(dat, max_centers = 50 , nstart = 25)) system.time(kselection(dat, max_centers = 50 , nstart = 25, parallel = TRUE)) ## End(Not run)
# Create a data set with two clusters dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1), rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2) # Execute the method sol <- kselection(dat) # Get the results k <- num_clusters(sol) # optimal number of clustes f_k <- get_f_k(sol) # the f(K) vector # Plot the results plot(sol) ## Not run: # Parallel require(doMC) registerDoMC(cores = 4) system.time(kselection(dat, max_centers = 50 , nstart = 25)) system.time(kselection(dat, max_centers = 50 , nstart = 25, parallel = TRUE)) ## End(Not run)
The optimal number of clusters proposed by the method.
num_clusters(obj)
num_clusters(obj)
obj |
the output of kselection function. |
the number of clusters proposed.
Daniel Rodriguez
# Create a data set with two clusters dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1), rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2) # Get the optimal number of clustes sol <- kselection(dat) k <- num_clusters(sol)
# Create a data set with two clusters dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1), rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2) # Get the optimal number of clustes sol <- kselection(dat) k <- num_clusters(sol)
The number of cluster which could be recommender according the method threshold.
num_clusters_all(obj)
num_clusters_all(obj)
obj |
the output of |
an array of number of clusters that could be recommended.
Daniel Rodriguez
# Create a data set with two clusters dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1), rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2) # Get the optimal number of clustes sol <- kselection(dat) k <- num_clusters(sol)
# Create a data set with two clusters dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1), rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2) # Get the optimal number of clustes sol <- kselection(dat) k <- num_clusters(sol)
k_threshold
Set the maximum value of from which can not be considered the
existence of more than one cluster.
set_k_threshold(obj, k_threshold)
set_k_threshold(obj, k_threshold)
obj |
the output of |
k_threshold |
maximum value of |
the output of kselection function with new k_threshold
.
Daniel Rodriguez