In [1]:
%matplotlib inline
from IPython.html.widgets import interact
from scipy import  stats
import seaborn as sns
import pandas as pd

Previously, we talk about maximum likelihood estimation and maximum a-posteriori estimation and in each case we started out with a probability density function of some kind and we further assumed that the samples were identically distributed and independent. The idea behind robust statistics is to construct estimators that can survive the weakening of either or both of these assumptions.

The first idea to consider is the notion of location, which is a generalization of the idea of "central value". Typically, we just use an estimate of the mean for this, but we will see shortly why that is a bad idea. The general idea of Location satisfies the following requirements

Let $ X $ be a random variable with distribution $ F $, and let $\theta(X)$ be some descriptive measure of $F$. Then $\theta(X)$ is said to be a measure of location if for any constants a and b, we have the following:

$$ \theta(X+b) = \theta(X) +b $$$$ \theta(-X) = -\theta(X)$$$$ X \ge 0 \Rightarrow \theta(X) \ge 0 $$$$ \theta(a X) = a\theta(X) $$

The first condition is called location equivariance (or shift-invariance in signal processing lingo). The fourth condition is called scale equivariance, which means that the units that $X$ is measured in should not effect the value of the location estimator. These Requirements capture the idea of what we intuitively mean by centrality of a distribution, or where most of the probability mass is located.

For example, the mean estimator is $ \hat{\mu}=\frac{1}{n}\sum X_i $. The first requirement is obviously satisfied as $ \hat{\mu}=\frac{1}{n}\sum (X_i+b) = b + \frac{1}{n}\sum X_i =b+\hat{\mu}$. Let us consider the second requirement:$ \hat{\mu}=\frac{1}{n}\sum -X_i = -\hat{\mu}$. Finally, the last requirement is satisfied with $ \hat{\mu}=\frac{1}{n}\sum a X_i =a \hat{\mu}$.

What do we mean by robust estimators?

Now that we have the generalized location of centrality embodied in the location parameter, what can we do with it? The next idea is to nail down is the concept of robust estimators. Previously, we assumed that our samples were all identically distributed. The key idea is that the samples might be actually coming from a distribution that is contaminated by another nearby distribution, as in the following:

$$ F(X) = \epsilon G(X) + (1-\epsilon)H(X) $$

where $ \epsilon $ is between zero and one. This means that our data samples $\lbrace X_i \rbrace$ actually derived from two separate distributions, $ G(X) $ and $ H(X) $. We just don't know how they are mixed together. What we really want is an estimator that captures the location of $ G(X) $ in the face of random intermittent contamination by $ H(X) $. It can get even worse than that because we don't know that there is only one contaminating $H(X)$ distribution out there. There may be a whole family of distributions that are contaminating $G(X)$ that we don't know of. This means that whatever estimators we construct have to be derived from families of distributions instead of a distribution, which is what we have been assuming for maximum-likelihood estimators. This is what makes robust estimation so difficult --- the extended theory has to deal with spaces of function distributions instead of particular parameters of a particular probability distribution.

  • Influence function
  • Outlier Detection
  • Estimates of location
    • definition of location
  • Trimmed means
  • Windsorized means
  • Hodges Lehmann statistics
  • Asymptotic efficiency
  • Fisher Consistent

  • Robust Regression

- least median
- outliers
In [2]:
n0=stats.norm(0,1)
n1=stats.norm(0,10)
xi = linspace(-5,5,100)

fig,ax=subplots()
ax.plot(xi,n0.pdf(xi))
ax.plot(xi,n1.pdf(xi))

def bias_coin(phead = .5):
    while True:
        yield int( np.random.rand() < phead ) 

pct_mixed  = 0.1
bias_coin_gen = bias_coin(pct_mixed)        
dual_set = [n0,n1]
samples  = [ dual_set[bias_coin_gen.next()].rvs() for i in range(500) ]
In [3]:
hist(samples,bins=20)
title('average = %3.3f, median=%3.3f pct_mixed=%3.3f'%(mean(samples),np.median(samples),pct_mixed))
Out[3]:
<matplotlib.text.Text at 0xd9b6470>
In [4]:
import sympy.stats
from sympy.abc import x
eps = sympy.symbols('epsilon')

mixed_cdf = sympy.stats.cdf(sympy.stats.Normal('x',0,1),'x')(x)*(1-eps) + eps*sympy.stats.cdf(sympy.stats.Normal('x',1,2),'x')(x)
mixed_pdf = sympy.diff(mixed_cdf,x)
In [5]:
def plot_mixed_dist(epsilon=.1):
    n1 = stats.norm(1,2)
    xi = linspace(-5,5,100)
    fig,ax = subplots()
    ax.plot(xi,[sympy.lambdify(x,mixed_pdf.subs(eps,epsilon))(i) for i in xi],label='mixed',lw=2)
    ax.plot(xi,n0.pdf(xi),label='g(x)',linestyle='--')
    ax.plot(xi,n1.pdf(xi),label='h(x)',linestyle='--')
    ax.legend(loc=0)
    ax.set_title('epsilon = %2.2f'%(epsilon))
    ax.vlines(0,0,.4,linestyle='-',color='g')
    ax.vlines(epsilon,0,.4,linestyle='-',color='b')

interact(plot_mixed_dist,epsilon=(0,1,.05))
Out[5]:
<function __main__.plot_mixed_dist>

M-estimators

M-estimators are generalized maximum likelihood estimators. Recall that for maximum likelihood, we want to maximize the likelihood function as in the following:

$$ L_{\mu}(x_i) = \prod f_0(x_i-\mu)$$

and then to find the estimator $\hat{\mu}$ so that

$$ \hat{\mu} = \arg \max_{\mu} L_{\mu}(x_i) $$

So far, everything is the same as our usual maximum-likelihood derivation except for the fact that we don't know $f_0$, the distribution of the $\lbrace X_i\rbrace$. Making the convenient definition of

$$ \rho = -\log f_0 $$

we obtain the more convenient form of the likelihood product and the optimal $\hat{\mu}$ as

$$ \hat{\mu} = \arg \min_{\mu} \sum \rho(x_i-\mu)$$

If $\rho$ is differentiable, then differentiating this with respect to $\mu$ gives

$$ \sum \psi(x_i-\hat{\mu}) = 0 $$

with $\psi = \rho'$ and for technical reasons we will assume that $\psi$ is increasing. The key idea here is we want to consider general $\rho$ functions that my not be MLE for any distribution.

The distribution of M-estimates

For a given distribution $F$, we define $\mu_0=\mu(F)$ as the solution to the following

$$ \mathbb{E}_F(\psi(x-\mu_0))= 0 $$

It is technical to show, but it turns out that $\hat{\mu} \sim \mathcal{N}(\mu_0,\frac{v}{n})$ with

$$ v = \frac{\mathbb{E}_F(\psi(x-\mu_0)^2)}{(\mathbb{E}_F(\psi^\prime(x-\mu_0)))^2} $$

Thus, we can say that $\hat{\mu}$ is asymptotically normal with asymptotic value $\mu_0$ and asymptotic variance $v$. This leads to the efficiency ratio which is defined as the following:

$$ \texttt{Eff}(\hat{\mu})= \frac{v_0}{v} $$

where $v_0$ is the asymptotic variance of the MLE and measures how near $\hat{\mu}$ is to the optimum. for example, if for two estimates with asymptotic variances $v_1$ and $v_2$, we have $v_1=3v_2$, then first estimate requires three times as many observations to obtain the same variance as the second.

For example, for the sample mean (i.e. $\hat{\mu}=\frac{1}{n} \sum X_i$) with $F=\mathcal{N}$, we have $\rho=x^2/2$ and $\psi=x$ and also $\psi'=1$. Thus, we have $v=\mathbb{V}(x)$. Alternatively, using the sample median as the estimator for the location, we have $v=\frac{1}{4 f(\mu_0)^2}$. Thus, if we have $F=\mathcal{N}(0,1)$, for the sample median, we obtain $v=\frac{2\pi}{4} \approx 1.571$. This means that the sample median takes approximately 1.6 times as many samples to obtain the same variance for the location as the sample mean.

One way to think about M-estimates is a weighted means. Most of the time, we have $\psi(0)=0$ and $\psi'(0)$ exists so that $\psi$ is approximately linear at the origin. Using the following definition:

$$ W(x) = \begin{cases} \psi(x)/x & \text{if} \: x \neq 0 \\ \psi'(x) & \text{if} \: x =0 \end{cases} $$

We can write our earlier equation as follows:

$$ \sum W(x_i-\hat{\mu})(x_i-\hat{\mu}) = 0 $$

Solving this for $\hat{\mu} $ yields the following,

$$ \hat{\mu} = \frac{\sum w_{i} x_i}{\sum w_{i}} $$

where $w_{i}=W(x_i-\hat{\mu})$. The question that remains is how to pick the $\psi$ functions.

Huber functions

The family of Huber function is defined by the following:

$$ \rho_k(x ) = \begin{cases} x^2 & \text{if} \: |x|\le k \\ 2 k |x|-k^2 & \text{if} \: |x| \gt k \end{cases} $$

with corresponding derivatives $2\psi_k(x)$ with

$$ \psi_k(x ) = \begin{cases} x & \text{if} \: |x|\le k \\ \text{sgn}(x)k & \text{if} \: |x| \gt k \end{cases} $$

where the limiting cases $k \rightarrow \infty$ and $k \rightarrow 0$ correspond to the mean and median, respectively. To see this, take $\psi_{\infty} = x$ and therefore $W(x) = 1$ and thus the defining equation results in

$$ \sum_{i=1}^{n} (x_i-\hat{\mu}) = 0 $$

and then solving this leads to $\hat{\mu} = \frac{1}{n}\sum x_i$. Note that choosing $k=0$ leads to the sample median, but that is not so straightforward to solve for.

In [6]:
fig,ax=subplots()
colors=['b','r']
for k in [1,2]:
    ax.plot(xi,np.ma.masked_array(xi,abs(xi)>k),color=colors[k-1])
    ax.plot(xi,np.ma.masked_array(np.sign(xi)*k,abs(xi)<k),color=colors[k-1],label='k=%d'%k)
ax.axis(ymax=2.3,ymin=-2.3)
ax.set_ylabel(r'$\psi(x)$',fontsize=28)
ax.set_xlabel(r'$x$',fontsize=24)
ax.legend(loc='best')
ax.set_title('Huber functions')
ax.grid()

The $W$ function corresponding to Huber's $\psi$ is the following:

$$ W_k(x) = \min\Big{\lbrace} 1, \frac{k}{|x|} \Big{\rbrace} $$

which is plotted in the following cell for a few values of $k$.

In [7]:
fig,ax=subplots()
ax.plot(xi,np.vstack([np.ones(xi.shape),2/abs(xi)]).min(axis=0),label='k=2')
ax.plot(xi,np.vstack([np.ones(xi.shape),1/abs(xi)]).min(axis=0),label='k=1')
ax.axis(ymax=1.1)
ax.legend(loc=0)
ax.set_title("Huber's weight function")
Out[7]:
<matplotlib.text.Text at 0x117bf510>

Another alternative intuitive way to interpret the M-estimate is to rewrite the following:

$$ \hat{\mu} = \hat{\mu} +\frac{1}{n}\sum_i \psi(x_i-\hat{\mu}) = \frac{1}{n} \sum_i \zeta(x_i,\hat{\mu})$$

which for the Huber family of functions takes on the form :

$$ \zeta(x,\mu) = \begin{cases} \mu - k & \text{if} \: x \lt \mu-k \\ x & \text{if} \: \mu-k \le x \le \mu+k \\ \mu+k & \text{if} \: x \gt \mu \\ \end{cases} $$

Thus, the interpretation here is that $\hat{\mu}$ is the average of the truncated pseudo-observations $\zeta_i$ where the observations beyond a certain point are clipped at the $k$-offset of the $\hat{\mu}$.

Asymptotic variance

In [8]:
from sympy import mpmath, symbols, diff, Piecewise, sign, lambdify
from sympy.stats import density, cdf, Normal
from sympy.abc import k,x
In [9]:
eps = symbols('epsilon')
lpdf=diff(cdf(Normal('x',0,1))(x)*(1-eps)+ eps*cdf(Normal('x',0,10))(x),x)
p = Piecewise((x,abs(x)<k),(k*sign(x),True))
In [10]:
def closure_on_asymptotic_variance(mn=(0,0),std=(1,10)):
    from sympy.abc import k,x
    eps = symbols('epsilon')
    lpdf=diff(cdf(Normal('x',mn[0],std[0]))(x)*(1-eps)+ eps*cdf(Normal('x',mn[1],std[1]))(x),x)
    def asymptotic_variance(kval,epsval):
        p = Piecewise((x,abs(x)<kval),(kval*sign(x),True))
        denom=mpmath.quad(lambdify(x,lpdf.subs(eps,epsval)),[-kval,kval])**2
        numer=mpmath.quad(lambdify(x,(p.subs(k,kval))**2*lpdf.subs(eps,epsval)),[-np.inf,np.inf])
        return float(numer/denom)
    return asymptotic_variance
In [11]:
case2=closure_on_asymptotic_variance()
In [12]:
kvals= [.001,0.3,0.5,.7,1.00001,1.4,1.7,2,3,4,5]
In [13]:
fig2,ax=subplots()
ax.plot(kvals,[case2(k,0) for k in kvals],'-o',label='eps=0')
ax.plot(kvals,[case2(k,.05) for k in kvals],'-o',label='eps=.05')
ax.plot(kvals,[case2(k,.1) for k in kvals],'-o',label='eps=0.1')
ax.set_xlabel("k")
ax.set_ylabel("asymptotic variance")
ax.legend(loc=0)
ax.set_title(r"$\mathcal{N}(0,1) , \mathcal{N}(0,10)$ mixed",fontsize=18)
Out[13]:
<matplotlib.text.Text at 0x114e2b30>

Computable Example

In [14]:
nsamples = 500
ncols = 300
xs = np.array([dual_set[bias_coin_gen.next()].rvs() for i in range(ncols)*nsamples ]).reshape(nsamples,-1)
In [15]:
fig,ax=subplots()
ax.hist(np.mean(xs,0),20,alpha=0.8,label = 'mean')
ax.hist(np.median(xs,0),20,alpha=0.3,label ='median')
ax.legend()
Out[15]:
<matplotlib.legend.Legend at 0xd8b60b0>
In [16]:
fig,ax=subplots()
sns.violinplot(np.vstack([np.median(xs,axis=0),np.mean(xs,axis=0)]).T,ax=ax,names=['median','mean']);
In [17]:
def huber_estim(x,k=1,tol=1e-6):
    if x.ndim==2: # loop over columns
        out = []        
        for i in range(x.shape[1]):
            out.append(huber_estim(x[:,i],k=k)) # recurse        
        return np.array(out)
    else:
        mu = median(x)
        mad = median(abs(x-mu))*1.4826 # follow MADN convention
        while True:
            mu_i=mean(minimum(maximum(mu-k*mad,x),mu+k*mad))
            if abs(mu-mu_i) < tol*mad: break
            mu = mu_i
        return mu_i
In [18]:
huber_est={k:huber_estim(xs,k) for k in [1,1.5,2,3]}
huber_est[0] = np.median(xs,axis=0)
huber_est[4] = np.mean(xs,axis=0)
fig,ax=subplots()
sns.violinplot(pd.DataFrame(huber_est),ax=ax)
ax.set_xticklabels(['median','1','1.5','2','3','mean']);
In [19]:
huber_est={k:huber_estim(xs,k) for k in kvals}
huber_est_df = pd.DataFrame(huber_est)
In [20]:
fig,ax=subplots()
sns.violinplot(huber_est_df,ax=ax)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d1b0f0>
In [21]:
fig2,ax=subplots()
ax.plot(kvals,[case2(k,pct_mixed) for k in kvals],'-o',label='eps=0.1')
(huber_est_df.var()*nsamples).plot(marker='o',label='est eps=0.1',ax=ax)
ax.set_xlabel("k")
ax.set_ylabel("asymptotic variance")
ax.legend(loc=0)
Out[21]:
<matplotlib.legend.Legend at 0x11d84ef0>

References

  • Maronna, R. A., R. D. Martin, and V. J. Yohai. "Robust Statistics: Theory and Methods". 2006.