Dimensionality's Blessing:
Clustering Images by Underlying Distribution

Some image
Some image
Some image
Some image
Some image
Some image
Some image

Introduction

Many high dimensional vector distances tend to a constant. This is typically considered a negative ``contrast-loss'' phenomenon that hinders clustering and other machine learning techniques. We reinterpret ``contrast-loss'' as a blessing. Re-deriving ``contrast-loss'' using the law of large numbers, we show it results in a distribution's instances concentrating on a thin ``hyper-shell''. The hollow centre means apparently chaotically overlapping distributions are actually intrinsically separable. We use this to develop distribution-clustering, an elegant algorithm for grouping of data points by their (unknown) underlying distribution. Distribution-clustering, creates notably clean clusters from raw unlabeled data, estimates the number of clusters for itself and is inherently robust to ``outliers'' which form their own clusters. This enables trawling for patterns in unorganized data and may be key to enabling machine intelligence. Paper, Supplementary, Matlab Code, Data.

Idea

PDF cannot be displayed.

Left: Traditional view of data as overlapping spheres. Clustering such data is deemed an ill-posed problem. Right: We prove that in high dimensions, each distribution's instances form hollow rings. Clustering data by fitting ``hyper-shells'' is a well-posed problem.

High-dimensional clustering is traditional considered difficult because a ``contrast-loss'' effect causes traditional clustering metrics to lose effectiveness. We show that ``contrast-loss'' is the result of a distribution's points concentrating on the surface of a sphere. Fitting hollow ``hyper-shells'' permits fine clustering of ambiguous data.

We also show that``contrast-loss'' causes distances between instances to almost surely reflect the mean and variance of their underlying distributions and not the instance values. This ensures instances of the same distribution almost surely have identical distance to any other point. Further, this property is unique to a distribution-cluster. This permits clustering data by detecting identical rows in the affinity matrix. The process permits efficient fitting of ``hyper-shells'' and creates a distinctive pattern in the post-clustering affinity matrix.

Some image Some image

Affinity Matrix before clustering

Affinity Matrix after clustering

Results

Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image
Some image Some image

Links:

"RepMatch: Robust Feature Matching and Pose for Reconstructing Modern Cities," ECCV2016. Ultra-robust feature matching paper that started us on the road of statistical separability;

"GMS: Grid-based Motion Statistics for Fast, Ultra-robust Feature Correspondence," CVPR2017 Extension to real time feature matching with a statistical separability function.