Precisions about weight-initializing and input-scaling

In his lecture concerning tricks for mini-batch gradient descent (6b), Hinton presents two ideas to improve the performance of this method:

  1. Set the weights randomly before training the network and make them proportionnal to the square-root of the fan-in. My problem is that, as he mentionned, high fan-in units will tend to saturate more easily if we start with bigger weights. So, shouldn’t the initialization be proportionnal to something like sqrt(1/fan-in)?
  2. Scale the inputs to set the variance of the training set to 1. However, in his example, he scales his two inputs (0.1 and 10) by two different values. How do you decide of which scale to choose for each input? Is this the same idea as what we talked about for the standardization process?

1 Response to “Precisions about weight-initializing and input-scaling”

  1. 1 Geoffroy MOURET February 25, 2013 at 10:02

    Q1) Hinton actually meant to set them within a sqrt(1/fan-in) range.
    Q2) Use the variance of each input across the dataset.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: