**Q1**

Give some ways to exploit the **reuse** idea described in section 5.3 and explain why it is interesting in the context of learning with symbolic data.

**Q2**

In the video 4c *Another diversion: The softmax output function*, Hinton says that with squared error measure we are depriving the network from the knowledge that the outputs should sum to 1. But isn’t it fixed by the softmax function more than by the Cross-Entropy function? Wouldn’t this knowledge (sum to 1) be present if one uses squared error measure on a softmax output function? (even if it makes less sense to do so)

Beside this, what are the drawbacks of the squared error measure compared to cross-entropy?

### Like this:

Like Loading...

*Related*

Q1 :

The units of the first hidden layer can have the same incoming weights.

Q2 :

For the squared error the saturation of neurons can cause problem. Because d / da (sigmoid(a) – y) ^2 = (p – a) * p * (1 – p) if p is near 0 or 1 the derivative is really small.

But if we use a log prob, we’ll still be able to propagate the gradient.