In his lecture concerning tricks for mini-batch gradient descent (6b), Hinton presents two ideas to improve the performance of this method:

- Set the weights randomly before training the network and make them proportionnal to the square-root of the fan-in. My problem is that, as he mentionned, high fan-in units will tend to saturate more easily if we start with bigger weights. So, shouldn’t the initialization be proportionnal to something like sqrt(1/fan-in)?
- Scale the inputs to set the variance of the training set to 1. However, in his example, he scales his two inputs (0.1 and 10) by two different values. How do you decide of which scale to choose for each input? Is this the same idea as what we talked about for the standardization process?

### Like this:

Like Loading...

*Related*

Q1) Hinton actually meant to set them within a sqrt(1/fan-in) range.

Q2) Use the variance of each input across the dataset.