Learning Gaussian Mixtures Using the Wasserstein-Fisher-Rao Gradient Flow

Nicholas Richardson

4 March 2025

What is it

Propose a new algorithm
Compute a nonparametric maximum likelihood estimator (NPMLE) for a Gaussian mixture model
The catch: there is a continuum of Gaussians

Given: \(N\) i.i.d. samples \(X_{i}\sim \mathcal{D}\), where \(X_{i}\in\mathbb{R}^d\)
Estimate: the density for distribution \(\mathcal{D}\)
Assume \(\mathcal{D}= \rho^\star * \mathcal{N}(0,I_{d})\) is an isotropic Gaussian mixture with density \[ (\rho^\star * \phi) (x) = \int _{\mathbb{R}^d}\phi(x-y)\rho^\star(y) \, dy \] for a standard Gaussian density \(\phi(x) = \frac{1}{(2\pi)^{d/2}} \exp(-\lVert x \rVert_{2}^2/2)\)
\(\rho^\star:\mathbb{R}^d\to \mathbb{R}_{+}\) is the unknown mixing distribution

Given \(N\) i.i.d. samples \(X_{i}\sim \mathcal{D}\), we estimate \(\mathcal{D}\)’s density function \(f\) as \[f(x)=\frac{1}{N}\sum_{i=1}^N \frac{1}{h^n} k\left( \frac{{x-X_{i}}}{h} \right)\] for some kernel function \(k:\mathbb{R}^n \to \mathbb{R}\) and bandwidth \(h>0\).
If \(k=\phi\), a standard Gaussian, and we fix the bandwidth \(h=1\), we get a special case of an isotropic Gaussian mixture where \[\rho^\star= \frac{1}{N} \sum_{i=1}^N\delta_{X_{i}}\]
Rather than limit to one “particle” at each sample, we now have a continuum

\(\rho^\star = \phi\), convolution of 1D standard normal distribution with itself \[(\rho^\star*\phi)(x) = \frac{1}{2\sqrt{ \pi }}\exp(-x^2 / 4)\]
\(\rho^\star(x)=\frac{1}{10}\) for \(0\leq x\leq 10\), and \(0\) otherwise. Uniform distribution on \([0,10]\). \[(\rho^\star*\phi)(x) = \frac{1}{20}\left( \text{erf}\left( x/\sqrt{ 2 } \right) - \text{erf}\left( (x-10)/\sqrt{ 2 } \right) \right)\]

This idea of going from discrete to continuum is not new
Have a two layer neural network with a discrete number of nodes \(N\): \[ f(x)=\frac{1}{N}\sum_{i=1}^N a_{i}\sigma(\left\langle b_{i}, x \right\rangle+c_{i})\]
Turn it into a continuum \[f(x)=\int a\sigma(\left\langle b, x \right\rangle+c) \, d\mu(a,b,c) \]

Estimate \(\rho^\star\) with \[\hat{\rho} \in \mathrm{arg}\,\min_{\rho \in \mathcal{P}(\mathbb{R}^d)}\ell_{N}(\rho)\] where \[\ell_{N}(\rho)=-\frac{1}{N}\sum_{i=1}^N \log\left( (\rho * \phi)(X_{i}) \right)\]
Although this is a convex, it is infinite dimensional

Define the first variation of \(\ell\) as any function \(\delta \ell(\rho):\mathbb{R}^d \to \mathbb{R}\) that satisfies \[\lim_{ \epsilon \to 0 } \frac{{\ell(\rho+\epsilon \nu) - \ell(\rho)}}{\epsilon}=\int \delta \ell(\rho) \, d\nu \] for all signed measures \(\nu\) such that \(\int d\nu = 0\).
For our function \(\ell_{N}\), the first variation is \[\delta \ell_{N}(\rho): x \mapsto - \frac{1}{N}\sum_{i=1}^N \frac{{\phi(x-X_{i})}}{(\rho * \phi)(X_{i})}\]

Existence: \(\hat{\rho}\) exists! \[\hat{\rho} \in \mathrm{arg}\,\min_{\rho \in \mathcal{P}(\mathbb{R}^d)}\ell_{N}(\rho)\]
Condition: A distribution is an NPMLE iff
1. \(\delta \ell_{N}(\hat{\rho})(x) + 1\geq 0\) for all \(x \in \mathbb{R}^d\)
2. \(\delta \ell_{N}(\hat{\rho})(x) + 1 = 0\) \(\hat{\rho}\)-a.e. \(x\) (almost everywhere \(\hat{\rho}\) is supported)

Above my head
But they have a continuous gradient descent scheme (gradient flow) for \(\rho\) given some initialization
Not a practical algorithm in a digital computer
But we can discretize this scheme!

Discretize the space \(\mathbb{R}^d\) with movable points \(\mu^{j}\)
Parametrize \(\rho\) with weights \(\omega^j\)
This is like fitting the empirical distribution with a KDE, but with adaptive heights and locations
Alternate updating the points and weights

\[ \rho^\star=\frac{1}{3}(\delta_{-1}+\delta_{1}+\delta_{10}) \]

\[ \rho^\star=\frac{1}{\sqrt{ 2\pi }}\exp\left( -x^2 / 2 \right) \]

Will compare Wasserstein-Fisher-Rao to two modifications
Fisher-Rao: Initialize the particles randomly \(\mu_{j}\sim \mathcal{U}\{ X_{i} \}_{i=1}^N\) and fix them, only learn the weights \(\omega_{j}\)
Wasserstein: Fix the weights to \(\omega_{j}=\frac{1}{N}\), and only learn the locations \(\mu_{j}\)

The first variation \(\delta \ell_{n}(\hat{\rho})\) should be \(-1\) on the support of \(\hat{\rho}\)