Woodbury Matrix Inverse Identity

Application in Conditional Distribution of Multivariate Normal

The Sherman-Woodbury-Morrison matrix inverse identity can be regarded as a transform between Schur complements. That is, given V_{22.1}^{-1} one can obtain V_{11.2}^{-1} by using the Woodbury matrix identity and vice versa. Recall the Woodbury Identity:

V_{11.2}^{-1}=V_{11}^{-1}+V_{11}^{-1}V_{12}V_{22.1}^{-1}V_{21}V_{11}^{-1}

and

V_{22.1}^{-1}=V_{22}^{-1}+V_{22}^{-1}V_{21}V_{11.2}^{-1}V_{12}V_{22}^{-1}

I recently stumbled across a neat application of this whilst deriving full conditionals for a multivariate normal. Recall that if the data are partitioned into two blocks, Y_{1},Y_{2}, then the variance of the conditional distribution Y_{1}|Y_{2},- is the Schur complement of the block V_{22} of total variance matrix V, that is, the variance of the conditional distribution is V_{11.2}=V_{11}-V_{12}V_{22}^{-1}V_{21} which is the variance of Y_{1} subtracted by something corresponding to the reduction in uncertainty about Y_{1} gained from the knowledge about Y_{2}. If, however, V_{22} has the form of a Schur complement itself, then it may be possible to exploit the Woodbury identity above to considerably simplify the variance term. I came across this when I derived two very different-looking expressions for the conditional distribution and found them equivalent by the Woodbury identity. Consider the model

\begin{bmatrix}   Y_{1}\\   Y_{2}  \end{bmatrix}  =  \begin{bmatrix}   X_{1}\\   X_{2}  \end{bmatrix}\beta_{ } + \varepsilon

where

\varepsilon \sim N\left( \begin{bmatrix}0\\ 0\end{bmatrix}, \sigma^{2} \begin{bmatrix}I_{1} & 0 \\ 0 & I_{2}\end{bmatrix} \right)

\beta_{ }| ,\sigma^{2} \sim N(0, \sigma^{2}\Lambda^{-1})

.
I was seeking the distribution Y_{1}| Y_{2},\sigma^{2} and arrived there through two different paths. The distributions derived looked very different, but they turned out to be equivalent upon considering the Woodbury identity.

Method 1

This simply manipulates properties of the multivariate normal. Marginalizing over \beta one gets

Cov \begin{bmatrix}   Y_{1} \\  Y_{2}  \end{bmatrix} = \begin{bmatrix}   X_{1} \\  X_{2}  \end{bmatrix} Cov (\beta_{ })  \begin{bmatrix}   X_{1}^{T} &  X_{2}^{T}  \end{bmatrix} + Cov(\varepsilon)

Cov \begin{bmatrix}   Y_{1} \\  Y_{2}  \end{bmatrix} = \sigma^{2}\begin{bmatrix}   X_{1}\Lambda^{-1} X_{1}^{T} &  X_{1}\Lambda^{-1} X_{2}^{T} \\  X_{2}\Lambda^{-1} X_{1}^{T} &  X_{2}\Lambda^{-1} X_{2}^{T}  \end{bmatrix}  + \sigma^{2}  \begin{bmatrix}  I_{1} & 0 \\ 0 & I_{2}  \end{bmatrix}

.
Such that the distribution

\begin{bmatrix}   Y_{1}\\   Y_{2}  \end{bmatrix}| ,\sigma^{2} \sim N \left(  \begin{bmatrix}  0\\  0  \end{bmatrix},  \sigma^{2} \left( \begin{bmatrix}  I_{1}+  X_{1}\Lambda^{-1} X_{1}^{T} &  X_{1}\Lambda^{-1} X_{2}^{T} \\  X_{2}\Lambda^{-1} X_{1}^{T} & I_{2}+ X_{2}\Lambda^{-1} X_{2}^{T}  \end{bmatrix}  \right) \right)

It follows that the conditional distribution is
Y_{1}| Y_{2} ,\sigma^{2} \sim N \left(  X_{1}\Lambda^{-1} X_{2}^{T} \left[  X_{2}\Lambda^{-1} X_{2}^{T} + I_{2}\right]^{-1} Y_{2}, I_{1} +  X_{1}\Lambda^{-1} X_{1}^{T} -  X_{1}\Lambda^{-1} X_{2}^{T} \left[ I_{2} +  X_{2}\Lambda^{-1} X_{2}^{T} \right]^{-1}  X_{2}\Lambda^{-1} X_{1}^{T}\right).
This looks a bit nasty, but notice that V_{22}^{-1} looks like it too could be a Schur complement of some matrix.

Method 2

An alternative route to this distribution is

f( Y_{1}| Y_{2},\sigma^{2} )=\int f( Y_{1}|\sigma^{2} ,\beta_{ })\pi(\beta_{ }| Y_{2},\sigma^{2} )d\beta_{ }

where

\beta_{ }| Y_{2} ,\sigma^{2}\sim N \left( ( X_{2}^{T} X_{2}+\Lambda)^{-1} X_{2}^{T} Y_{2}, \sigma^{2}( X_{2}^{T} X_{2}+\Lambda)^{-1} \right).

It follows that

Y_{1}| Y_{2} ,\sigma^{2} \sim N\left(  X_{1}( X_{2}^{T} X_{2}+\Lambda)^{-1} X_{2}^{T} Y_{2}, \sigma^{2} (I_{1} +  X_{1} ( X_{2}^{T} X_{2}+\Lambda)^{-1} X_{1}^{T}) \right)

which looks different from the distribution obtained through method 1. The expression for the variance is a lot neater. They are in fact identical by the Woodbury identity.

Comparison

Mean (Submitted by Michelle Leigh)

\left[\Lambda+  X_{2}^TI_{2} X_{2}\right]^{-1} X_{2}^T\\  =\{\Lambda^{-1}-\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1} X_{2}\Lambda^{-1}\} X_{2}^T\\  =\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1}\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]-\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1} X_{2}\Lambda^{-1} X_{2}^T\\  =\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1}I_{2}

So mean1=mean2.

Variance

By the Woodbury Identity it follows that

\Lambda^{-1} - \Lambda^{-1} X_{2}^{T} \left[ I_{2} +  X_{2}\Lambda^{-1} X_{2}^{T} \right]^{-1}  X_{2}\Lambda^{-1} = ( X_{2}^{T}I_{2} X_{2}+\Lambda)^{-1}.

Therefore

X_{1}\Lambda^{-1} X_{1}^{T}- X_{1}\Lambda^{-1} X_{2}^{T} \left[ I_{2}+ X_{2}\Lambda^{-1} X_{2}^{T} \right]^{-1}  X_{2}\Lambda^{-1} X_{1}^{T}={ X_{1} ( X_{2}^{T} X_{2}+\Lambda)^{-1} X_{1}^{T}}\\

and so variance1=variance2. The trick is recognizing the form of the formulas at the top of the page, then one can write the variance as a much neater expression.

  • Very nice series of articles about schur compliment and etc. Appreciate them very much.

    Anyways here is my proof of mean1=mean2
    \left[\Lambda+  X_{2}^TI_{2} X_{2}\right]^{-1} X_{2}^T\\  =\{\Lambda^{-1}-\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1} X_{2}\Lambda^{-1}\} X_{2}^T\\  =\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1}\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]-\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1} X_{2}\Lambda^{-1} X_{2}^T\\  =\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1}I_{2}