For an implementation with examples, see this notebook.

Diffusion maps refer to a non-linear dimensionality reduction algorithm. The key ideas are as follows:

  • Assume data \(x\) is sampled from \((X,\mathcal A,\mathbb P)\) according to \(\mathbb P\),
    • e.g. \(X=\mathbb R^{10000}\), but \(\mathbb P\) has full support on some low-dimensional submanifold.
  • Given a symmetric kernel (e.g. Gaussian) \(k(x,y)=k(y,x)\geq 0\), we define
    • \(d(x)=\int k(x,y)d\mathbb P(y)\) (akin to the degree of a vertex in a graph, where \(k\) would be connectivity) and
    • \(p(x,y)={k(x,y)\over d(x)}\) (akin to transition probabilities of a random walk on the data manifold).
  • Let \(K_{ij}=k(x_i,x_j)\) and \(D_{ii}=\sum_j K_{ij}\) (and 0 elsewhere; this is an empirical estimation of \(d(x_i)\)) and define
    • \(M=D^{-1}K\) (akin to the “left (random-walk) normalized Laplacian matrix”).
  • Now \[M_{ij}^t = \sum_l\lambda_l^t\psi_l(x_i)\phi_l(x_j)=\mathbb P(X_t=x_j\mid X_0=x_i)\] is the probability of a random walk on our data to reach \(x_j\) from \(x_i\) in \(t\) steps; here \(\psi,\phi\) are the right- and left eigenvectors of \(M\), respectively).
  • Define the diffusion distance (big if lots of short paths from \(x_i\) to \(x_j\) and vice versa) \[D_t(x_i,x_j)^2:=\sum_k\frac{(M^t_{ik}-M^t_{jk})^2}{\phi_0(x_k)} = \sum_{l}\lambda_l^{2t}(\psi_l(x_i)-\psi_l(x_j))^2 = |\Psi_t(x_i)-\Psi_t(x_j)|^2,\] where \(\Psi_t(x)=(\lambda_1^t\psi_1(x),\lambda_2^t\psi_2(x),\dots)\).
    • Note that we do not include \(\Psi_0\), the eigenvector corresponding to eigenvalue 1!
  • The last equality suggests using (the first few coordinates of) \(\Psi_t(x)\) as low-dimensional embeddings, since their Euclidean distance approximates the geodesic distance on the underlying manifold.
  • \(t\) is a free parameter; intuitively bigger \(t\) corresponds to checking similarity on larger and larger scales, ignoring the microscopic structure.