CCM Colloquium: Antonietta Mira (USI) Università della Svizzera Italiana

America/New_York
Description

Bayesian identifications of data intrinsic dimensions

With the advent of Big Data, it is increasingly common to deal with
cases where data are defined in a high-dimensional space and little is
known a priori about their distribution. Quite often, however, the
data distribution has support on a subspace (manifold) whose
dimension, called the intrinsic dimension (ID) of the data, is much
lower than that of the embedding space. Under very weak assumptions on
the data distribution, the k-nearest-neighbor distances in the data
follow distributions which depend parmetrically on the ID. This fact
was leveraged by Facco et al., Scientific Reports 2017 to provide an
ID estimate (TWO-NN) based on the ratio between the 1st and 2nd
neighbor of each point in the data.  We extended TWO-NN to the case
where the ID is not constant within the data, i.e., the distribution
has support on the union of several manifolds with different ID. This
situation may trivially occur if data sets with different ID are
merged, but, as we reveal, occurs quite naturally in several dataset
from diverse disciplines. In this case, the nearest-neighbor-distances
follow a simple mixture distribution, and within a Bayesian framework
we can robustly estimate the IDs of the manifolds, and assign each
data point to one of the manifolds. In many real-world data sets we
find widely heterogeneous dimensions, corresponding to variation in
core properties: folded vs unfolded configurations in a protein
molecular dynamics trajectory, active vs non-active regions in brain
imaging data, and firms with different financial risk in company
balance sheets.

The agenda of this meeting is empty