Class 3. MCMC and Bayesian extensions

Estimated reading: 3 minutes 456 views

Bayesian estimation tactics can be used to replace arbitrary construction of deep learning model’s hidden layer. In one way, it is to replicate Factor Analysis in every layer construction, but now that one layer’s value change affects the other layers. This process goes from one layer to all layers. What makes this job more demanding is that we are still unsure the next stage’s number of nodes (or hidden factors) are right, precisely as we are unsure about the feeding layer’s node numbers. In fact, everything here is unsure, and reliant to each other. This is where Bayesian help us in less arbitrary manner.

Let’s think about what we learn from basic Bayesian courses. By a handful of sample data, we assume probable distribution A, B, and C as candidates for the entire population. Say we give 1/3 probability to each. Bayesians call it ‘Prior’. We find another sample data, which works like a new weight to distributions. In Bayesian world, we call it ‘Likelihood’. Combining the ‘Prior’ and ‘Likelihood’, we can find ‘Posterior’. With another set of sample data, we can do the process again, as we place the ‘Posterior’ as the ‘Prior’. We do this process Nth times, and at some point, the ‘Posterior’ hardly is affected by ‘Likelihood’. The final ‘Posterior’ is the probability assignment that we initially looked for. We can do the same process with A, B, C, and D distributions, or even more candidates that fit to sample data.

The structure of Bayesian ‘learning’ in fact is similar to what we do with Feed Forward and Back Propagation in multiple loops. This is where Bayesian meets Deep Learning.

MCMC (Monte-Carlo + Markov-Chain) simulation

The term MCMC is closely related to Bayesian model building by ways of creating probably simulated data and make the model to learn by itself.

The first MC, Monte Carlo, means a simulation by a prior assumptions on data set’s distribution. Recall the basics from COM501: Scientific Programming that by LLN (Law of Large Number), MC approximates $I_M \rightarrow I$ as $M \rightarrow \infty$. In Bayesian, more data helps us to closely approximate the true underlying distribution, when population density is unknown. Remember that we are unsure about the number of nodes in each layer of Autoencoder. So, as long as we can construct a convergence path, the outcome will be most fitted model that represents the data’s hidden structure.

The second MC, Markov-Chain, is to explain each simulation’s independency to other simulated samples. In other words, we assume i.i.d draws, which ensures unbiased convergence in Monte Carlo process. This helps us if our data follows i.i.d. For example, when we do image recognition, each image’s numbers are independent to each other. It is not like I am going to have number 6 in the next flip, if I have number 5 now. And when we feed images to the model, we already preprocess the image by sliding windows, which will be re-visited in image recognition part.

Overall, the MCMC simulation greatly helps us to construct an ideal Autoencoder model without arbitrarily experimenting all possible combinations of data. One may still rely on stepwise type searching, but given the risk of overfitting, local maxima, and exponentially increasing computational cost, it is always wise to rely on more scientific tools like MCMC.

In the lecture, Gibbs sampling, the most well-known MCMC technique, is presented for a sample construction of the Autoencoder. One can also rely on Metropolis-Hastings, if tampering marginal distribution by truncation is necessary.