You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Herta, Christian" <Ch...@htw-berlin.de> on 2012/02/06 15:38:43 UTC
Bug in Gradient Machine?
Hello,
yesterday I checked the code of the gradient machine to understand what's going
on there. I think I found a bug in the computation of the gradient (trunk):
In the commentit's written: "dy / dw is just w since y = x' * w + b."
This is wrong. dy/dw_ is x (ignoring the indices). The same is done in the code.
See the corrected version below.
----
The gradient machine is a specialized version of a multi layer perceptron (MLP).
In a MLP the gradient for computing the "weight change" for the output units is:
dE / dw_ij = dE / dz_i * dz_i / d_ij with z_i = sum_j (w_ij * a_j)
here: i index of the output layer; j index of the hidden layer
(d stands for the partial derivatives)
here: z_i = a_i (no squashing in the output layer)
with the special loss (cost function) is E = 1 - a_g + a_b = 1 - z_g + z_b
with
g index of output unit with target value: +1 (positive class)
b: random output unit with target value: 0
=>
dE / dw_gj = -dE/dz_g * dz_g/dw_gj = -1 * a_j (a_j: activity of the hidden unit
j)
dE / dw_bj = -dE/dz_b * dz_b/dw_bj = +1 * a_j (a_j: activity of the hidden unit
j)
That's the same if the comment would be correct:
dy /dw = x (x is here the activation of the hidden unit) * (-1) for weights to
the output unit with target value +1.
In neural network implementations it's common to compute the gradient
numerically for a test of the implementation. This can be done by:
dE/dw_ij = (E(w_ij + epsilon) -E(w_ij - epsilon) ) / (2* (epsilon))
Cheers
Christian
-----------------------------------
// Note from the loss above the gradient dloss/dy , y being the label is
-1 for good
// and +1 for bad.
// dy / dw is just x since y = z' * w + b.
// Hence by the chain rule, dloss / dw_ij = dloss / dy_i * dy_i / dw_ij =
-z_j (for j=g).
// For the regularization part, 0.5 * lambda * w' w, the gradient is
lambda * w.
// dy / db = 1.
// gradient descent update of the weights to the
// positive (should-be) output-unit
Vector gradGood = hiddenActivations.clone();
gradGood.assign(Functions.NEGATE);
gradGood.assign(Functions.mult(-learningRate * (1.0 -
regularization)));
outputWeights[good].assign(gradGood, Functions.PLUS);
outputBias.setQuick(good, outputBias.get(good) + learningRate);
// gradient descent update of the weights to the
// (random) negative (should-be) output-unit
Vector gradBad = hiddenActivations.clone();
gradBad.assign(Functions.mult(-learningRate * (1.0 +
regularization)));
outputWeights[bad].assign(gradBad, Functions.PLUS);
outputBias.setQuick(bad, outputBias.get(bad) - learningRate);
// backpropagation from output to hidden layer for
// computing the deltas (errors) of the hidden units
Vector propHidden = outputWeights[good].clone();
propHidden.assign(Function.NEGATE);
propHidden.assign(outputWeights[bad], Functions.PLUS);
// Gradient of sigmoid (logistic function) is s * (1 -s).
Vector gradSig = hiddenActivation.clone();
gradSig.assign(Functions.SIGMOIDGRADIENT);
// Multiply by the change caused by the ranking loss.
for (int i = 0; i < numHidden; i++) {
gradSig.setQuick(i, gradSig.get(i) * propHidden.get(i));
}
// gradSig are now the deltas (errors) of the hidden layers
// the weight change of w_ij should be proportional
// to delta_i * x_j + regularization * w_ij
for (int i = 0; i < numHidden; i++) {
for (int j = 0; j < numFeatures; j++) {
double v = hiddenWeights[i].get(j);
v -= learningRate * (gradSig.get(i) + regularization * v);
hiddenWeights[i].setQuick(j, v);
}
}
Prof. Dr. Christian Herta
HTW Berlin
Wilhelminenhofstraße 75A,
12459 Berlin, Gebäude C, Raum: 613
Email: christian.herta@htw-berlin.de
Telefon: (030) 5019-3498
Fax: (030) 5019-483498