You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Li Li <fa...@gmail.com> on 2014/04/01 05:48:12 UTC
Re: how to implement parallel sgd in map reduce?

but I can't control the inputsplit.
what I need is:
1. split input data to small blocks whose size is defined by me(e.g.
the maximum number training instances I machine can deal with).
2. randomly dispatch these small blocks to each machine(may be hdfs
can do this for me)
3. each mapper deal with a small block


On Sat, Mar 29, 2014 at 3:55 AM, Ted Dunning <te...@gmail.com> wrote:
> Yes. That is feasible.
>
> I think that you would have better luck with something like asynchronous
> SGD as described here:
>
>    http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2012_0598.pdf
>
> and here
>
>    http://www.cs.toronto.edu/~fritz/absps/georgerectified.pdf
>
> It would also be good to consider looking at some of the new scala work in
> Mahout.  Map-reduce is a difficult medium for this art.
>
>
>
>
> On Fri, Mar 28, 2014 at 5:21 AM, Li Li <fa...@gmail.com> wrote:
>
>> I have read "Parallelized stochastic gradient descent" (2010) by
>> Martin A. Zinkevich et al.
>> the parallel sgd is very simple:
>>
>> Define T = ⌊m/k⌋
>> Randomly partition the examples, giving T examples to each machine.
>> for all i ∈ {1, . . . k} parallel do
>>     Randomly shuffle the data on machine i.
>>     Initialize wi,0 = 0.
>>     for all t ∈ {1, . . . T }: do
>>          Get the tth example on the ith machine (this machine), ci,t
>>          wi,t ← wi,t−1 − η∂w ci (wi,t−1 )
>>      end for
>> end for
>> Aggregate from all computers v = k i=1 wi,t and return v.
>>
>> it assumes that each machine do sgd optimization on the data locally
>> and randomly shuffle the data on this machine.
>>
>> it seems each machine has to load all the local data into memory and
>> shuffle to perform sgd
>> then average them
>>
>> how to do this in hadoop?
>>
>> 1. how to control hadoop input split size .
>>       let hadoop do this for me? but each split should be not too much
>> that can't be loaded into memory
>> 2. do batch?
>>       in setUp of Mapper, construct a data structure to store all data
>> of this split
>>       int mapper, just add data to this data structure
>>       int close method, do the real job of sgd
>>
>> is my method feasible?
>>