Archive for the 'classes' Category

Organic Learning vs Machine Learning

What advantages does the brain have over machine learning? What are we missing?

Duties before class (please review well before the class):

  • Read section 11 and 12 of the review paper on representation learning.
  • Listen to the last two videos of lecture 16.
  • Please prepare a question for April 25.
  • Please try to implement and evaluate an unsupervised pre-training strategy (based on RBMs or regularized auto-encoders) for the kaggle competition.

Planned for the class:

  • We will discuss questions you have raised in posts or comments, or questions about the past exams.

Hierarchical Coordinate Frames

Hinton describes a method for recognizing objects by learning hierarchies of coordinate frames. The basic idea is to use a neural network to learn vectors that represent the poses of objects in the image, and weights that represent spatial relationships.

Hinton illustrates this with a few groups of neurons, but what might the whole architecture (of a generic object recognition system employing this method) look like?

Auto-encoders & new drop-out

1) In Auto-Encoder sampling using MCMC, it is mentioned that when the score is zero, we look for the smallest second derivative of log-density. Since we are already in a local mode, the purpose of this process is to go to another nearby mode?

2) In section 9.5, it is mentioned that for auto-encoders and sparse-coding, the test set reconstruction error can be misleading, because larger capacity (more features, training time) leads to lower reconstruction error even on the test set. If the model generalizes well to the test set, then why it is bad?

3) what is the new drop-out trick that injects strong binary multiplicative noise on the hidden units (mentioned at the end of section 10.1)?

 

Deep net and Learning by parts

1.  Why make the Jacobian of each layer closer to 1 in deep net can reduce the difficulty of training and how ?

2.  In the hinton’s lecture, he said it’s a good way to recognize object by learning the pose and position of its parts. In the case of recognize a face we learn the mouse , nose , eyes etc. But here we use the prior: A face has mouse, nose , eyes. How to generalize this approach? Eg. how to let computer to learn what are the parts? How many parts we divided the object into?

DBM and doubled weights

In the video 16.2, Hinton gives a really high level explanation of why we need to halve the weights of some layers when we combine RBMs to build a DBM. Can you give a more detailed explanation of why a simple division by 2 is a suitable geometric mean?

Joint Model: DBN vs DBM

Hinton mentions that DBM are better than DBN since DBM joint training allow each modality to improve the early layers of the other modality. He then says that you can basically do the same thing by training your Deep Belief Net with a contrastive wake-sleep algorithm.

What are the pros and cons of using each method in a joint model? Is there any results from people using Deep Belief Nets for this problem?

CNN and ‘argmax pooling’

In his videos, Hinton argues that convolutional neural networks are doomed because pooling loses the precise spatial relationship between high level parts (such as nose and mouth in the case of a face).

Could you lessen this issue by using a sort of “argmax-and-max pooling” where you propagate both the max response from the pooling region and also the location of the max response relative to the boundries of the pooling region?

Plan for April 22’s class

Duties before class (please review well before the class):

  • Read section 10 of the review paper on representation learning.
  • Listen to the first two videos of lecture 16.
  • Please prepare a question for April 22.
  • Please try to implement and evaluate an unsupervised pre-training strategy (based on RBMs or regularized auto-encoders) for the kaggle competition.

Planned for the class:

  • We will discuss questions you have raised in posts or comments.

Learning a parametric mapping based on a neighbourhood graph

How can we learning a parametric mapping based on a neighbourhood graph?

Herding

What is herding?

Does herding also have mixing problems like MCMC methods?

CAE+H and PSD

1) How the additional term in CAE+H compared to CAE penalizes higher order derivatives?

2)Why PSD stays at the intersection of probabilistic models and encoding methods?

Noise and Saturation

When training a deep auto-encoder for the purposes of semantic hashing, Hinton adds noise to the input during the fine-tuning stage, because this forces the activities of the sigmoid units in the code to become bimodal (saturate). Why does adding noise have this effect?

Approximate inference

Q1

Could you give an example of a learned inference mechanism as opposed to a general one?

Q2

What is Langevin MCMC (mentioned in section 9.2)? Since it uses the estimated second derivative of the density, does it have an important impact on computation time (harder to compute but converges faster)?

Auto-encoders and sampling

1) Auto-encoders aren’t probabilistic models,  then what makes it possible to successfully sample from them? (not how can we sample,  but why can we sample)

2) Can you explain how we can sample from auto-encoders using an MCMC sampling algorithm?

3) What are the advantages and drawbacks of sampling from a shallow auto-encoder compared to an RBM or from a deep auto-encoder compared to a DBN (or DBM)?

AIS & Tracking the partition function

Q1 – How does the technique known as Annealed Importance Sampling work to estimate the partition function?

Q2 – What technique(s) is/are used to track the partition function of an RBM?

Plan for April 18’s class

Duties before class (please review well before the class):

  • Read section 9 of the review paper on representation learning.
  • Listen to the videos 4, 5 and 6 (the last three) of lecture 15.
  • Please prepare a question for April 18.
  • Please try to implement and evaluate an unsupervised pre-training strategy (based on RBMs or regularized auto-encoders) for the kaggle competition.

Planned for the class:

  • We will discuss questions you have raised in posts or comments.
  • Each student will talk about what they have tried with the keypoints data.

Direct Encoding and Probabilistic Models

It is enough to answer 3 of the 5 questions below:

1)PSD: Is there an advantage of using PSD (predictive sparse decomposition) instead of Sparse Autoencoders where PSD seems to be harder to optimize.

1.1) If you want to stack PSD’s, what you would give as an input to the upper layer? f_{\alpha}(x^t) or h^t.

2)In the energy function of the Boltzmann machine where does the 1/2 in front of quadratic terms come from?

3) Is there an experiment in the literature comparing the “Greedy layerwise unsupervised training” vs “Jointly training all the layers of stacked autoencoder”?

4) In the Hinton’s bottleneck architecture, does it make sense to use a Contractive autoencoder. Because of its contraction property.

Stop words

In his lecture, Hinton mentions that bottleneck auto-encoders are useful for document classification. The inputs of such models are vectors of words counts in the document from which “stop words” (common words) such as ‘the’/’or’/’a’/… are removed.

How are those words chosen? Could we imagine a way of looking at how frequent a word is in a given document compared to how frequent it is in other documents as a ponderation? If you take a word like ‘the’ which is present a lot in your document, the fact that it is often used in all your database would lower its importance. This approach would help select problem-dependant stop words instead of relying on a predetermined group.

Direct Encoding

Q1: Why it’s called Direct Encoding, there is Undirect Encoding ? If yes what is that?

Q2: In Predictive Sparse Decomposition’s equation how the h is computed here ?  And how the choice of fa affected the behavior of encoder?

DAE and CAE

In the article about representation learning, it is said that DAE and CAE are tightly connected. Can you explain why?

Training Deep AE & FPCD

1) Based on the explanations of Hinton, is the un-supervised pre-training the same for both DBN and Deep Auto-Encoders? (In both we train stack of RBMs one layer after another)

2) In FPCD, the parameters of two models exist: (1) the model that is trained with parameters theta (the objective model) and (2) the model that is used for sampling with parameters theta-star (the sampling model).

2-1) Is the log-likelihood gradient that is used to update both models the same? (the log-likelihood gradient is the coefficient of epsilon in the objective model and the coefficient of epsilon-star in the sampling model)

2-2) Considering the formula of theta-star in the sampling model, what gives it a better mixing property? (Considering the terms in the theta-star formula)

2-3) this question is probably related to the previous one:

in the formula of theta-star the variable epsilon-star satisfies (epsilon-star > epsilon) which makes the theta-star converge faster towards the direction specified in the current update and at the same time it has a term eta*theta-star(t) which keeps the parameters of the theta-star reluctant to move from its previous value. What’s the reasoning for such setting?

Plan for April 15’s class

Duties before class (please review well before the class):

  • Read sections 7 and 8 of the review paper on representation learning.
  • Listen to the 2nd and 3rd video of lecture 15.
  • If you have not prepared a question for April 11, please prepare a question for April 15.
  • Please try to implement and evaluate an unsupervised pre-training strategy (based on RBMs or regularized auto-encoders) for the kaggle competition.

Planned for the class:

  • We will discuss questions you have raised in posts or comments.
  • Each student will talk about what they have tried with the keypoints data.

Sparse Coding

In section 6.1.3, it is mentioned that we seek a maximum a posteriori (MAP) value of h, i.e. h* = argmax(h)p(h|x) instead of the habitual expectation. And that this MAP value is used for learning a dictionary W. How do we find this h* in practice (is it tractable?), and why do we call W a dictionary?

Plan for April 11’s class

Duties before class (please review well before the class):

  • Read sections 5 and 6 of the review paper on representation learning.
  • Toss a coin. If heads then prepare a question for April 15, otherwise prepare a question for April 11. If you cheat and many of you do, it will show up as the mean number of heads being far from the expected value of 0.5.
  • Please try more models on the  facial keypoints detection task.

Planned for the class:

  • We will have a quiz.
  • We will discuss questions you have raised in posts or comments.

Plan for April 8’s class

Duties before class (please review well before the class):

  • Please report a baseline experimental result on the kaggle competition data on facial keypoints detection (the model should be as simple as you want but needs to try to predict outputs from inputs).

Planned for the class:

  • Ian will finish the lecture on variational inference methods. Slides: pdf Math: pdf lyx
  • We will discuss the first quiz (for marks).
  • We will discuss questions you have raised in posts or comments.

Deep sigmoid belief net vs AE

What are the advantages of using Deep sigmoid belief(stack many RBMs) to train the features compare to the other AEs?

What are the priors exploit by this structure  ?

Analogy of K-means to Sparsity Constraint

The usual output of K-means is a one-hot encoding because of the winner-take-all. Leave this unchanged for the learning rule, but let the output be a softmax of the negative distance of an input to each centroid (or something similar). The output will probably still be very sparse, probably more so in high-dimensional space because of the curse of dimensionality (right?).

If we think of a sparsity constraint on hidden representations, these will also end up being sparse. Both are based on the same kind of underlying linear/affine transformation during feedforward. Isn’t the sparsity constraint imposing the same kind of clustering of inputs into centroids as K-means, thus making it a local generalization where different regions in the input space are basically associated with there own private set of parameters?

How does the backpropagated error affect the equivalent centroids in the network with a sparsity constraint?

Infinite Sigmoid belief nets and AEs vs PCA

What is complementary prior and how does it cancel out the explaining away effect?

In the video Geoffrey Hinton says that probably CD-1 is better than maximum likelihood gradient. Why?

Why don’t people do for example CD1.5 to get the reconstruction and compute the reconstruction error in order to measure how well RBM learns the input data.

Can we say PCA corresponds to an autoencoder with tied weights. Let’s assume that you have an autoencoder with weights W and inputs X (assuming that we aren’t using biases). Where W tries to learn the same space spanned by the principal components with SGD and encoded representation will be h = WX. If we use tied weights, encoder will try to learn W’ with constraint W’W=I. So if we have perfect reconstruction W’WX will be X.

CD & PCA-whitened

1) Why in CD learning if the weights get bigger we need more iterations of CD to get unbiased samples from the equilibrium distribution?

2) What is the PCA-whitening that is applied to the data as preprocessing?

relu, smoothness assumption and Pylearn

1) What makes relu scale invariant? How does it relate to images?

2) What is problematic with the smoothness assumption? How can we get around this and still use simple parametric or local non-parametric model (Gaussian kernel for example)?

A practical question about pylearn for Ian :

3) Is there a convenient way to define an unsupervised pretraining followed by a supervised training using a yaml file? Or do we need to write a script do to so?

Reaching equilibrium state in a RBM

Chuck Norris can sample to infinity many times but we cannot. Say we want to reach the equilibrium state of an RBM, do we have a convergence criterion that tell us how “close” we are to equilibrium distribution ? In other words, how can we quantify the residual error if we stop after N steps ?

Finding Manifolds using Backpropagation

Q1:

Hinton mentions that using a linear single layer bottleneck network gives a projection of the input in the space obtained with PCA. However, PCA vectors (along directions of high variance) are different from the hidden units which tend to have similar variances and non-orthogonal directions. Is it possible to make the hidden units match the eigenvectors of PCA (by adding constraints on the weight by example)? Would that even be interesting?

Q2:

It has been mentioned that such bottleneck architectures are useful for implementing “features” hash tables. What are the other applications? Can the resulting manifold be seen as a kernel on which one could “plug” a SVM?

CNNs and Speech Recognition

We have seen how convolutional neural networks (CNN) are used for image recognition. The review paper tells us that convolutional (or time-delay) networks have also been used for speech recognition. What features of CNNs (local connectivity, shared weights, pooling) are used for the task of speech recognition?

Recursive AE and Noise Contrastive Estimation

Q1 – What a recursive autoencoders? What makes them more general than the RNN which they generalize?

Q2 – The review paper refers to something called “noise contrastive estimation”, what is it?

Plan for April 4’s class

Duties before class (please review well before the class):

  • View last video of lecture 14 and first video of lecture 15.
  • Read first 4 sections of the review paper on representation learning.
  • Come up with AT LEAST ONE QUIZ QUESTION regarding the material you have read or seen. Post it to the site as New Post (not a quiz reply), with categories apr4, quiz, and a tag identifying you (your name).
  • Please report a baseline experimental result on the kaggle competition data on facial keypoints detection (the model should be as simple as you want but needs to try to predict outputs from inputs).

Planned for the class:

  • Ian will give a lecture on variational inference methods.
  • If time permits, we will discuss questions you have raised in posts or comments.

RBM as abstractors

Unsupervised pre-training by stacking RBMs followed by supervised fine-tuning (backpropagation) is better than just backpropagation. One could think that pipelining a data distribution into a stack of RBMs would loose more and more of the original distribution, as when pipelined into a stack of random weight matrices. Yet it manages to translate the original distribution into another that is more suitable for learning on the final MLP layer. Is this because RBMs learn are a kind of abstraction of the input distribution, thus facilitating generalization?

Training Deep Belief Networks

Can we use dropout and weight norm constraint with deep belief networks? How?

Theoretically is it equivalent to train an RBM with infinite number of hidden units and DBN with infinite number of layers? Would infinite depth DBN converge to something (e.g. boltzmann machine) in terms of representational power?

What is mean field approximation and when do you need to use it?

What is the difference between the training of DBN and sigmoid belief networks. Why it is easier to train and draw samples from DBN compared to sigmoid belief networks. What is the advantage of using greedy layerwise training to wake sleep algorithm?

Wake-sleep Algorithm

1- What makes the contrastive wake-sleep algorithm a better algorithm compared to the wake-sleep algorithm?

2- why the flaws of the wake-sleep algorithm like the problem of estimating the posterior distribution over the hidden layer given the visible units, does not make the learning difficult in deep-belief nets (in the generative case)?

Unsupervised pre-training

Q1) Why does unsupervised pre-training make sense considering the image-label relation?

Q2) Is it true for every data-label relations?

Fine tune with backpropagation

1. How the backpropagation works on the RBM layer at fine tune stage ?

2. How we define the hyper-parameters: learning rate, how many epochs, etc…

3. In some cases, if we have only very few labeled data,  is it possible to overfitting ?

Instability in DBN

When training a DBN with sampling as mentioned in Yoshua’s book section 6.1 , for each layer we generate a sample using CD and then use that sample as input to the next layer and repeat the process.

But CD is a “weak” form of Gibb sampling and we don’t really sample from the “true” distribution. So each layer as for input the result of a bias sampling and generate another bias sample. I would suspect such a chain of bias sampling would lead to a significant difference between the true distribution and the one learned. (Like error propagation in chaotic system as we have seen for recurrent neural net).

Is that a real problem for DBN ? (Why?). If so, how do we handle it ?

Building a DBN out of RBMs

Once you trained a first RBM and want to learn another RBM on top of it, Hinton says that you can initialize your second RBM to be the inverse of the first RBM (it’s weight matrix is the transpose of the first RBM’s weight matrix). This way, he says, the second RBM is already a sensible model of the hidden layer of the first RBM.

My question is : what do you actually gain from doing this instead of starting your second RBM from random initialization with, perhaps, with a different number of hidden units than the number of visible units in the first RBM? If you initialize the second RBM from the first one, it is probably already in a very good local minima so I imagine it would be pretty hard to get it learn something different.

I may be wrong, but I feel that having the second RBM simply implement the inverse mapping of the first RBM is the very last thing you want. One of the motivations behind depth is often to be able to learn more abstract features. If the features in your second hidden units are no more abstract than the visible units, it seems to defeat the purpose of depth.

Using a DBN as a pretrained MLN

Q1:

After training a Deep Belief Network using the approximate posteriors, you can use the weights of this network to initialize a multi-layer neural network. Does this work regardless of the type of neurons that you use in your MLN?

Q2:

What are the pros and cons of using mean-field computation rather than stochastic sampling when training a DBN?

Permutation Invariance

In the permutation-invariant MNIST task, the same random permutation is applied to the pixels in each image. What is the intuition behind this task? Why would we not want to exploit information about the spatial structure of things like hand-written digits?

Plan for March 28’s class

Duties before class (please review well before the class):

  • View first 3 videos of lecture 14.
  • Read section 6.1 of  YB’s book on Learning deep architectures for AI (printer-friendly version).
  • Come up with AT LEAST ONE QUIZ QUESTION regarding the material you have read or seen. Post it to the site as New Post (not a quiz reply), with categories Mar28, quiz, and a tag identifying you (your name).
  • Please get started your journal regarding the kaggle competition data on facial keypoints detection.

Planned for the class:

  • We will discuss the first quiz (for marks).
  • We will discuss questions you have raised in posts or comments.

Benchmarks

In Learning Deep Architectures for AI, you say that:

It is good that the field is moving towards more ambitious benchmarks

We have heard about MNIST, ImageNet or Caltech101 multiple times. What are the other successful datasets? What are the problems for which we do not have “real” training sets yet?

Deep network generalization and regularization

It is said in Learning deep architectures for AI that

[…] when the training set size is “small” (e.g., MNIST, with less than hundred thousand examples), although unsupervised pre-training brings improved test error, it tends to produce larger training error. On the other hand, for much larger training sets, with better initialization of the lower hidden layers, both training and generalization error can be made significantly lower when using unsupervised pre-training.

How can we explain this?

Autoencoders and Stochastic Neurons

Q1) Consider that you have a dataset of 2’s and 5’s digits from the MNIST dataset. How would you discriminate 2’s and 5’s using only one autoencoder without any labels? (Use your imagination-there is very simple way to do it)

Q2) Given that the autoencoders can learn the implicit dimensionality of the data (lower dimensional manifold), can we use it as a compression algorithm? How and what would be the necessary conditions to minimize the information loss/reconstruction error for the compression(e.g.: Using Tied weights, type of nonlinearity, sparsity, training objective….etc.). Would disentangling factors of variations help for compression as well?

Q3) Sigmoid belief networks have an explaining away effect, which models we have seen so far have the same problem? What is their common characteristic?

Belief nets

1. Is there any other activation function can be used in belief nets?

2. Any mathematical prove that the wake-sleep algorithm can give a optima result ?

3. How to initialize the two set of initial weight ?