Compute a nonparametric maximum likelihood estimator (NPMLE) for a Gaussian mixture model
The catch: there is a continuum of Gaussians
Set up
Given: \(N\) i.i.d. samples \(X_{i}\sim \mathcal{D}\), where \(X_{i}\in\mathbb{R}^d\)
Estimate: the density for distribution \(\mathcal{D}\)
Assume \(\mathcal{D}= \rho^\star * \mathcal{N}(0,I_{d})\) is an isotropic Gaussian mixture with density \[
(\rho^\star * \phi) (x) = \int _{\mathbb{R}^d}\phi(x-y)\rho^\star(y) \, dy
\] for a standard Gaussian density \(\phi(x) = \frac{1}{(2\pi)^{d/2}} \exp(-\lVert x \rVert_{2}^2/2)\)
\(\rho^\star:\mathbb{R}^d\to \mathbb{R}_{+}\) is the unknown mixing distribution
Connection to things we’ve seen
Kernel Density Estimation
Given \(N\) i.i.d. samples \(X_{i}\sim \mathcal{D}\), we estimate \(\mathcal{D}\)’s density function \(f\) as \[f(x)=\frac{1}{N}\sum_{i=1}^N \frac{1}{h^n} k\left( \frac{{x-X_{i}}}{h} \right)\] for some kernel function \(k:\mathbb{R}^n \to \mathbb{R}\) and bandwidth \(h>0\).
If \(k=\phi\), a standard Gaussian, and we fix the bandwidth \(h=1\), we get a special case of an isotropic Gaussian mixture where \[\rho^\star= \frac{1}{N} \sum_{i=1}^N\delta_{X_{i}}\]
Rather than limit to one “particle” at each sample, we now have a continuum
Examples
\(\rho^\star = \phi\), convolution of 1D standard normal distribution with itself \[(\rho^\star*\phi)(x) = \frac{1}{2\sqrt{ \pi }}\exp(-x^2 / 4)\]
\(\rho^\star(x)=\frac{1}{10}\) for \(0\leq x\leq 10\), and \(0\) otherwise. Uniform distribution on \([0,10]\). \[(\rho^\star*\phi)(x) = \frac{1}{20}\left( \text{erf}\left( x/\sqrt{ 2 } \right) - \text{erf}\left( (x-10)/\sqrt{ 2 } \right) \right)\]
Barron Spaces
This idea of going from discrete to continuum is not new
Have a two layer neural network with a discrete number of nodes \(N\): \[ f(x)=\frac{1}{N}\sum_{i=1}^N a_{i}\sigma(\left\langle b_{i}, x \right\rangle+c_{i})\]
Turn it into a continuum \[f(x)=\int a\sigma(\left\langle b, x \right\rangle+c) \, d\mu(a,b,c) \]
Learning the Gaussian Mixture
Maximum Likelihood Estimation
Estimate \(\rho^\star\) with \[\hat{\rho} \in \mathrm{arg}\,\min_{\rho \in \mathcal{P}(\mathbb{R}^d)}\ell_{N}(\rho)\] where \[\ell_{N}(\rho)=-\frac{1}{N}\sum_{i=1}^N \log\left( (\rho * \phi)(X_{i}) \right)\]
Although this is a convex, it is infinite dimensional
Optimality: Need a derivative
Define the first variation of \(\ell\) as any function \(\delta \ell(\rho):\mathbb{R}^d \to \mathbb{R}\) that satisfies \[\lim_{ \epsilon \to 0 } \frac{{\ell(\rho+\epsilon \nu) - \ell(\rho)}}{\epsilon}=\int \delta \ell(\rho) \, d\nu \] for all signed measures \(\nu\) such that \(\int d\nu = 0\).
For our function \(\ell_{N}\), the first variation is \[\delta \ell_{N}(\rho): x \mapsto - \frac{1}{N}\sum_{i=1}^N \frac{{\phi(x-X_{i})}}{(\rho * \phi)(X_{i})}\]