Title: | Leader Clustering Algorithm |
---|---|
Description: | The leader clustering algorithm provides a means for clustering a set of data points. Unlike many other clustering algorithms it does not require the user to specify the number of clusters, but instead requires the approximate radius of a cluster as its primary tuning parameter. The package provides a fast implementation of this algorithm in n-dimensions using Lp-distances (with special cases for p=1,2, and infinity) as well as for spatial data using the Haversine formula, which takes latitude/longitude pairs as inputs and clusters based on great circle distances. |
Authors: | Taylor B. Arnold |
Maintainer: | Taylor B. Arnold <[email protected]> |
License: | LGPL-2 |
Version: | 1.5 |
Built: | 2024-10-27 05:49:44 UTC |
Source: | https://github.com/statsmaths/leadercluster |
Takes a matrix of coordinates and outputs cluster ids from running the leader algorithm. The coordinates can either be on points in the space R^n, or latitude/longitude pairs. A radius delta must be provided.
leaderCluster( points, radius, weights = rep(1, nrow(points)), max_iter = 10L, distance = c("Lp", "L1", "L2", "Linf", "haversine"), p = 2 )
leaderCluster( points, radius, weights = rep(1, nrow(points)), max_iter = 10L, distance = c("Lp", "L1", "L2", "Linf", "haversine"), p = 2 )
points |
A matrix, or something which can be coerced into a
matrix, of coordinates with rows representing points
and columns representing dimensions. If using
|
radius |
A scalar value giving the radius of the resulting
clusters; this is the main tuning parameter for the
algorithm. When using the |
weights |
An vector of weights, one per row of points, to apply to the clustering algorithm. |
max_iter |
Maximum number of times to iterate the algorithm; can safely set to 1 in many instances. See Details. |
distance |
The method to be used for calculating distances between
points. If this is set to |
p |
When using |
The value for delta defines an approximate radius of each cluster. As the algorithm runs, a point within a distance delta from the centroid of a cluster will be labeled with the coorisponding cluster. As centroid clusters move, it is possible for the final radius of each cluster to be slightly larger than delta.
Unlike many other iterative clustering algorithms, the leader algorithm typically provides reasonable clusters after just a single pass. When speed is of concern, the max_iter value may be safely set to 1. However, the algorithm typically fully converges in only a few cycles; also, a convergent solution will usually have a smaller number of clusters than a solution with only one pass.
The algorithm scales nicely, and can fit a model with 100s of columns and 100k's of rows in (on a relatively modest machine) under a minute. However, the processing time decays significantly if the radius is too small, since the number of clusters will be very high.
A list containing a vector of cluster ids, a matrix of cluster centroids, the number of clusters, and the number iterations.
Taylor B. Arnold, [email protected]
J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, 1975.
points <- 1:10 out <- leaderCluster(points, radius=2, distance="Lp", max_iter=1L) par(mar = c(0,0,0,0), mfrow = c(1,3)) set.seed(1) points <- matrix(runif(100*2), ncol=2) for(r in c(0.1, 0.2, 0.4)) { out <- leaderCluster(points = points, radius = r, distance="L2")$cluster_id cols <- rainbow(length(unique(out)))[out] plot(points, pch = 19, cex = 0.7, col = cols, axes = FALSE) points(points[!duplicated(out),,drop=FALSE], cex = 2, col = unique(cols)) box() }
points <- 1:10 out <- leaderCluster(points, radius=2, distance="Lp", max_iter=1L) par(mar = c(0,0,0,0), mfrow = c(1,3)) set.seed(1) points <- matrix(runif(100*2), ncol=2) for(r in c(0.1, 0.2, 0.4)) { out <- leaderCluster(points = points, radius = r, distance="L2")$cluster_id cols <- rainbow(length(unique(out)))[out] plot(points, pch = 19, cex = 0.7, col = cols, axes = FALSE) points(points[!duplicated(out),,drop=FALSE], cex = 2, col = unique(cols)) box() }