## Variational Inference Step-by-Step (Part 3: Mean Field V.I. cont.)

Hello everyone! Thanks for following the series so far. This post will be very likely to be the last part of the series. Due to the depth of the topic itself, we would not be able to cover it from end to end. Nonetheless, by the end of this post, we expected you to at least be familiar with the derivation of mean field variational inference. Enjoy!

## Mean Field Variational Inference (cont.)

We have so far done quite some work in deriving variational inference. We arrived at the point where we split the large equation of into 3 parts, and simplify it again further. In this section, we will do even more tricks and mathematical manipulation. Hang on, we’re almost there!

The first thing we do is splitting this constant into another two negative constants: and . Our equation would now looks like this:

\[ \begin{aligned} \mathcal{L} = \sum_{z_1} q(z_1) \Bigg[ E_{z_2,z_3} \Big[ \ln p(x_1,x_2,x_3,z_1,z_2,z_3) \Big] + \mathcal{C}_1 + \mathcal{C}_2 \Bigg] - \sum q(z_1) \ln q(z_1) \end{aligned} \]

Lets shift our focus into this part: . We will try to argue that this part corresponds to a log of some function . The goal here is so that we can subtitute that part in the equation with .

Follow carefully our following arguments: we know that is the expectation of a log of some function. We also know that is some constant. It is obvious that log of a function added by a constant is also a log of some function. But notice that is not just a regular function, it is a probability function! It means that it has a property of summing up to 1. How do we know that?

Well, we could first get rid of the log in front of . This could be done by taking the exponential of both side of the equation:

\[ \begin{aligned} \ln f(X,Z) & = E_{z_2,z_3} \Big[ \ln p(X, Z)\Big] + \mathcal{C}_1\newline e^{\big(\ln f(X,Z)\big)} & = e^{\big( (Ez_2,z_3 \Big[ \ln p(X, Z)\Big] + \mathcal{C}_1\big)} \newline f(X,Z) & = e^{\big( \mathcal{C}_1 \big)} e^{\big( (Ez_2,z_3 \Big[ \ln p(X, Z)\Big] \big)} & \text{(by exponential rule)}\newline f(X,Z) & = \mathcal{K}*e^{\big(Ez_2,z_3 \Big[ \ln p(X, Z)\Big] \big)} & (e^{\big( \mathcal{C}_1 \big)} \text{could be replaced by another constant $\mathcal{K}$)}\newline \end{aligned} \]

In order to hold the property that to be a probability function that sum to 1, we need to find such that \[ \mathcal{K} = \frac{1}{e^{(E_{z_2,z_3} (\ln p(X, Z))}} \]

This should always be possible since we can pick by arbitrarily split into and . We have shown that indeed is a probability function!

Having the property of assured, we could substitute it back to the equation. Namely:

Having only this portion of equation left to maximize, we should look back at the equation for finding KL-divergence. You should notice that this portion is actually the negative of KL divergence of and . That is, \[ KL(q(z_1)\mid \mid f(X,Z)) = - \sum_{z_1} q(z_1) \Bigg[ \ln \frac{ f(X,Y)}{q(z_1)} \Bigg] \]

Again, we could see that **maximizing the terms is equals to minimizing **. Remember that the minimum of KL divergence could be achieved when . So our goal could be restated as finding which equals to . You could trace back the equations above to see what the equations to compute .

This whole derivation is done with respect to . Fortunately, we could do the exact same formulation for other variables in . Therefore, this whole variational inference could be done by solving all equations of where . Namely, for this example, all equations that we need to solve are:

You could see clearly the pattern in the equations above! We could generalize this to arbitrary number of variables of ofcourse. We would then have the multiple summations over the product of all .

We now know how to compute each individually and later use it to compute . However, you shall notice that as we compute , we do not know and . This applies for all . We are now in the ‘chicken and egg situation’ where we have variables that dependent on each other to compute. On top of that, we also don’t know the value of any constant .

One approach to address this problem is by using coordinate ascent inference. It is done by iteratively optimizing one variational distribution at a time, while holding the others fixed. We could first initialize these distribution randomly, and run this iterative algorithm until it converges.

Unfortunately, this series would not discuss the algoritm that address the given problem. But hopefully, after this series of post you’ll be better equipped when reading from another resources that cover more depth and perhaps in rigor manner. We will end this post here and see you in another post!