You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sergey Chickin <co...@solo.by> on 2008/03/31 22:30:41 UTC

NN implementation question

In what way do we summarize gradients? Cause if we just summarize them 
as vectors we won't get direction of cost function minimization. It 
seems to be near to true only if we choose really little learning step. 
But in this case we'll lose all the speedup from mapred. To see that sum 
of partial gradients doesn't give right direction just have a look at 
the parabaloid z=x^2+y^2 , take 2 symmetrical points on it (1, 1) and 
(-1, -1) for example and summary gradient will always be 0...

Re: NN implementation question

Posted by Sergey Chickin <co...@solo.by>.

Ted Dunning wrote:
> I think that you have to have separate MR phases for forward and backward
> propagation.
>
> I don't know if you can combine all backward phases into a single MR.  It
> seems enough simpler that I would start that way.  I do think that you can
> combine weight updates and error back-prop for a single layer into a single
> MR by tagging the output records.
>
> On 3/31/08 1:57 PM, "Sergey Chickin" <co...@solo.by> wrote:
>
>   
>> So we spawn map-reduce job for each layer in the network in forward
>> analyze and backpropagation? If so, everything's clear and easy. But I
>> understood from NIPS something like: we map over learning sets,
>> calculate gradient and weights increase for each of them and sum them in
>> reduce stage. Anyway, I really wonder if this approach can cause
>> speedup(do not forget that there is a price of job creation, etc. and
>> they tested their NIPS for multicore)
>>     
>
>
>
>   
Thanks, it's now clear to me

Re: NN implementation question

Posted by Ted Dunning <td...@veoh.com>.

I think that you have to have separate MR phases for forward and backward
propagation.

I don't know if you can combine all backward phases into a single MR.  It
seems enough simpler that I would start that way.  I do think that you can
combine weight updates and error back-prop for a single layer into a single
MR by tagging the output records.

On 3/31/08 1:57 PM, "Sergey Chickin" <co...@solo.by> wrote:

> So we spawn map-reduce job for each layer in the network in forward
> analyze and backpropagation? If so, everything's clear and easy. But I
> understood from NIPS something like: we map over learning sets,
> calculate gradient and weights increase for each of them and sum them in
> reduce stage. Anyway, I really wonder if this approach can cause
> speedup(do not forget that there is a price of job creation, etc. and
> they tested their NIPS for multicore)

Re: NN implementation question

Posted by Sergey Chickin <co...@solo.by>.

Ted Dunning wrote:
> Can you give a bit more context to your question?
>
> If you are talking about neural network training, I don't see where the
> problem is.  It is easiest if you run one map-reduce for computing forward
> values and errors, then another map-reduce for each level of
> back-propagation.  In the back-prop phases, if you map over input cases, you
> can compute the derivative and output weight increments and the
> back-propagated error in the map and sum them in the combiner/reducer.  The
> summed weight corrections could be added to the weights in conventional
> code.  Regularization in the form of weight decay or early stopping is
> easily incorporated into this framework.
>
> Given this structure, I don't see how your comment applies.  It seems that
> it is always possible for different inputs to give contradictory impetus to
> the model at the least by having completely contradictory target variables
> for the same input variables.
>
> SO can you expand a bit on what is worrying you?
>
>
> On 3/31/08 1:30 PM, "Sergey Chickin" <co...@solo.by> wrote:
>
>   
>> In what way do we summarize gradients? Cause if we just summarize them
>> as vectors we won't get direction of cost function minimization. It
>> seems to be near to true only if we choose really little learning step.
>> But in this case we'll lose all the speedup from mapred. To see that sum
>> of partial gradients doesn't give right direction just have a look at
>> the parabaloid z=x^2+y^2 , take 2 symmetrical points on it (1, 1) and
>> (-1, -1) for example and summary gradient will always be 0...
>>
>>     
>
>
>
>   
So we spawn map-reduce job for each layer in the network in forward 
analyze and backpropagation? If so, everything's clear and easy. But I 
understood from NIPS something like: we map over learning sets, 
calculate gradient and weights increase for each of them and sum them in 
reduce stage. Anyway, I really wonder if this approach can cause 
speedup(do not forget that there is a price of job creation, etc. and 
they tested their NIPS for multicore)

Re: NN implementation question

Posted by Ted Dunning <td...@veoh.com>.

Can you give a bit more context to your question?

If you are talking about neural network training, I don't see where the
problem is.  It is easiest if you run one map-reduce for computing forward
values and errors, then another map-reduce for each level of
back-propagation.  In the back-prop phases, if you map over input cases, you
can compute the derivative and output weight increments and the
back-propagated error in the map and sum them in the combiner/reducer.  The
summed weight corrections could be added to the weights in conventional
code.  Regularization in the form of weight decay or early stopping is
easily incorporated into this framework.

Given this structure, I don't see how your comment applies.  It seems that
it is always possible for different inputs to give contradictory impetus to
the model at the least by having completely contradictory target variables
for the same input variables.

SO can you expand a bit on what is worrying you?

On 3/31/08 1:30 PM, "Sergey Chickin" <co...@solo.by> wrote:

> In what way do we summarize gradients? Cause if we just summarize them
> as vectors we won't get direction of cost function minimization. It
> seems to be near to true only if we choose really little learning step.
> But in this case we'll lose all the speedup from mapred. To see that sum
> of partial gradients doesn't give right direction just have a look at
> the parabaloid z=x^2+y^2 , take 2 symmetrical points on it (1, 1) and
> (-1, -1) for example and summary gradient will always be 0...
>