What is it

  • Propose a new algorithm
  • Compute a nonparametric maximum likelihood estimator (NPMLE) for a Gaussian mixture model
  • The catch: there is a continuum of Gaussians

Set up

  • Given: \(N\) i.i.d. samples \(X_{i}\sim \mathcal{D}\), where \(X_{i}\in\mathbb{R}^d\)
  • Estimate: the density for distribution \(\mathcal{D}\)
  • Assume \(\mathcal{D}= \rho^\star * \mathcal{N}(0,I_{d})\) is an isotropic Gaussian mixture with density \[ (\rho^\star * \phi) (x) = \int _{\mathbb{R}^d}\phi(x-y)\rho^\star(y) \, dy \] for a standard Gaussian density \(\phi(x) = \frac{1}{(2\pi)^{d/2}} \exp(-\lVert x \rVert_{2}^2/2)\)
  • \(\rho^\star:\mathbb{R}^d\to \mathbb{R}_{+}\) is the unknown mixing distribution

Connection to things we’ve seen

Kernel Density Estimation

  • Given \(N\) i.i.d. samples \(X_{i}\sim \mathcal{D}\), we estimate \(\mathcal{D}\)’s density function \(f\) as \[f(x)=\frac{1}{N}\sum_{i=1}^N \frac{1}{h^n} k\left( \frac{{x-X_{i}}}{h} \right)\] for some kernel function \(k:\mathbb{R}^n \to \mathbb{R}\) and bandwidth \(h>0\).
  • If \(k=\phi\), a standard Gaussian, and we fix the bandwidth \(h=1\), we get a special case of an isotropic Gaussian mixture where \[\rho^\star= \frac{1}{N} \sum_{i=1}^N\delta_{X_{i}}\]
  • Rather than limit to one “particle” at each sample, we now have a continuum

Examples

  • \(\rho^\star = \phi\), convolution of 1D standard normal distribution with itself \[(\rho^\star*\phi)(x) = \frac{1}{2\sqrt{ \pi }}\exp(-x^2 / 4)\]
  • \(\rho^\star(x)=\frac{1}{10}\) for \(0\leq x\leq 10\), and \(0\) otherwise. Uniform distribution on \([0,10]\). \[(\rho^\star*\phi)(x) = \frac{1}{20}\left( \text{erf}\left( x/\sqrt{ 2 } \right) - \text{erf}\left( (x-10)/\sqrt{ 2 } \right) \right)\]

Barron Spaces

  • This idea of going from discrete to continuum is not new
  • Have a two layer neural network with a discrete number of nodes \(N\): \[ f(x)=\frac{1}{N}\sum_{i=1}^N a_{i}\sigma(\left\langle b_{i}, x \right\rangle+c_{i})\]
  • Turn it into a continuum \[f(x)=\int a\sigma(\left\langle b, x \right\rangle+c) \, d\mu(a,b,c) \]

Learning the Gaussian Mixture

Maximum Likelihood Estimation

  • Estimate \(\rho^\star\) with \[\hat{\rho} \in \mathrm{arg}\,\min_{\rho \in \mathcal{P}(\mathbb{R}^d)}\ell_{N}(\rho)\] where \[\ell_{N}(\rho)=-\frac{1}{N}\sum_{i=1}^N \log\left( (\rho * \phi)(X_{i}) \right)\]
  • Although this is a convex, it is infinite dimensional

Optimality: Need a derivative

  • Define the first variation of \(\ell\) as any function \(\delta \ell(\rho):\mathbb{R}^d \to \mathbb{R}\) that satisfies \[\lim_{ \epsilon \to 0 } \frac{{\ell(\rho+\epsilon \nu) - \ell(\rho)}}{\epsilon}=\int \delta \ell(\rho) \, d\nu \] for all signed measures \(\nu\) such that \(\int d\nu = 0\).
  • For our function \(\ell_{N}\), the first variation is \[\delta \ell_{N}(\rho): x \mapsto - \frac{1}{N}\sum_{i=1}^N \frac{{\phi(x-X_{i})}}{(\rho * \phi)(X_{i})}\]

Optimality: Conditions (Theorem 1)

  1. Existence: \(\hat{\rho}\) exists! \[\hat{\rho} \in \mathrm{arg}\,\min_{\rho \in \mathcal{P}(\mathbb{R}^d)}\ell_{N}(\rho)\]
  2. Condition: A distribution is an NPMLE iff
    1. \(\delta \ell_{N}(\hat{\rho})(x) + 1\geq 0\) for all \(x \in \mathbb{R}^d\)
    2. \(\delta \ell_{N}(\hat{\rho})(x) + 1 = 0\) \(\hat{\rho}\)-a.e. \(x\) (almost everywhere \(\hat{\rho}\) is supported)

Wasserstein-Fisher-Rao gradient descent

  • Above my head
  • But they have a continuous gradient descent scheme (gradient flow) for \(\rho\) given some initialization
  • Not a practical algorithm in a digital computer
  • But we can discretize this scheme!

(Discretized) Algorithm

  • Discretize the space \(\mathbb{R}^d\) with movable points \(\mu^{j}\)
  • Parametrize \(\rho\) with weights \(\omega^j\)
  • This is like fitting the empirical distribution with a KDE, but with adaptive heights and locations
  • Alternate updating the points and weights

(Discretized) Algorithm

  • If this converges, then it converges to the NPMLE.
  • But convergence itself is not guaranteed

Experiments

Example distributions to learn

\[ \rho^\star=\frac{1}{3}(\delta_{-1}+\delta_{1}+\delta_{10}) \]

\[ \rho^\star=\frac{1}{\sqrt{ 2\pi }}\exp\left( -x^2 / 2 \right) \]

Does it converge?

  • Will compare Wasserstein-Fisher-Rao to two modifications
  • Fisher-Rao: Initialize the particles randomly \(\mu_{j}\sim \mathcal{U}\{ X_{i} \}_{i=1}^N\) and fix them, only learn the weights \(\omega_{j}\)
  • Wasserstein: Fix the weights to \(\omega_{j}=\frac{1}{N}\), and only learn the locations \(\mu_{j}\)

Does it converge?

How many particles do you need?

  • Fix the number of iterations (\(T=1000\)) and observe the loss

How do we know we are optimal

  • The first variation \(\delta \ell_{n}(\hat{\rho})\) should be \(-1\) on the support of \(\hat{\rho}\)

References