Minimum sum-of-squares clustering

https://wiki.bordeaux.inria.fr/realopt/uploads/CIMINLP/hansen_CIMINLP.pdf

See the PDF, page 4, for the problem description. Find \(C\) cluster centers \(z_j\) to minimize the sum of distances from each point to its cluster center.

MSSC is similar to the problem we are trying to solve in the \(kn\) to \(k\) groups of \(n\) matching problem.  Given \(kn\) points, we would like to find the \(n\) cluster centers that minimize the distance of any point \(p_i\) to its nearest center \(z_j\) (the centroid of that cluster).   For fixed \(n\) and dimensionality \(d\) , it has been shown MSSC is solvable in \(O((N)^{nd+1})\) , for \(N\) input points.  For general \(d\) and input parameter \(n\) (as in our case), the problem is shown to be NP-hard.

MSSC, however, differs from our problem statement, since we require the size of any cluster to be size \(k\) such that the cluster center \(z_i\) provides a good representation for \(k\) points.  The points in each cluster may be divided up to form \(n\) groups, where each member comes from a unique cluster, thereby removing bias between the groups.  Also, our initial problem statement proposes a greedy method, such that the smallest clusters are found first, whereas MSSC minimizes the sizes of all clusters.

Leave a Comment

Filed under Algorithms

Leave a Reply

Your email address will not be published. Required fields are marked *