You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stanley Xu <we...@gmail.com> on 2011/05/05 11:16:49 UTC

Is there any way I could use to reduce the cost of Mapper and Reducer setup and cleanup in a iterative MapReduce chain?

Dear All,

Our team is trying to implement a parallelized LDA with Gibbs Sampling. We
are using the algorithm mentioned by plda, http://code.google.com/p/plda/

The problem is that by the Map-Reduce method the paper mentioned. We need to
run a MapReduce job every gibbs sampling iteration, and normally, it will
use 1000 - 2000 iterations per our test with our data to converge. But as we
know, there is a cost to re-create the mapper/reducer, and cleanup the
mapper/reducer in every iteration. It will take about 40 seconds on our
cluster per our test, and 1000 iteration means almost 12 hours.

I am wondering if there is a way to reduce the cost of Mapper/Reducer
setup/cleanup, since I prefer to have all the mappers to read the same local
data and update the local data in a mapper process. All the other update it
need comes from the reducer which is a pretty small data compare to the
whole dataset.

Is there any approach I could try(including change part of hadoop's source
code.)?


Best wishes,
Stanley Xu

Re: Is there any way I could use to reduce the cost of Mapper and Reducer setup and cleanup in a iterative MapReduce chain?

Posted by Stanley Xu <we...@gmail.com>.

Thanks a lot. Ted, checking haloop and plume now. I could always get the
answer from you. :-)

On Thu, May 5, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com> wrote:

> Stanley,
>
> The short answer is that this is a real problem.
>
> Try this:
>
> *Spark: Cluster Computing with Working Sets.* Matei Zaharia, Mosharaf
> Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, in HotCloud 2010,
> June 2010.
>
> Or this http://www.iterativemapreduce.org/
>
> http://code.google.com/p/haloop/
>
> You may be interested in experimenting with MapReduce 2.0.  THat allows
> more flexibility in execution model:
>
>
> http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/
>
> Systems like FlumeJava (and my open source, incomplete clone Plume) may
> help with flexibility:
>
>
> http://www.deepdyve.com/lp/association-for-computing-machinery/flumejava-easy-efficient-data-parallel-pipelines-xtUvap2t1I
>
>
> https://github.com/tdunning/Plume/commit/a5a10feaa068b33b1d929c332e4614aba50dd39a
>
>
> On Thu, May 5, 2011 at 2:16 AM, Stanley Xu <we...@gmail.com> wrote:
>
>> Dear All,
>>
>> Our team is trying to implement a parallelized LDA with Gibbs Sampling. We
>> are using the algorithm mentioned by plda, http://code.google.com/p/plda/
>>
>> The problem is that by the Map-Reduce method the paper mentioned. We need
>> to
>> run a MapReduce job every gibbs sampling iteration, and normally, it will
>> use 1000 - 2000 iterations per our test with our data to converge. But as
>> we
>> know, there is a cost to re-create the mapper/reducer, and cleanup the
>> mapper/reducer in every iteration. It will take about 40 seconds on our
>> cluster per our test, and 1000 iteration means almost 12 hours.
>>
>> I am wondering if there is a way to reduce the cost of Mapper/Reducer
>> setup/cleanup, since I prefer to have all the mappers to read the same
>> local
>> data and update the local data in a mapper process. All the other update
>> it
>> need comes from the reducer which is a pretty small data compare to the
>> whole dataset.
>>
>> Is there any approach I could try(including change part of hadoop's source
>> code.)?
>>
>>
>> Best wishes,
>> Stanley Xu
>>
>
>

Re: Is there any way I could use to reduce the cost of Mapper and Reducer setup and cleanup in a iterative MapReduce chain?

Posted by Stanley Xu <we...@gmail.com>.

Thanks a lot. Ted, checking haloop and plume now. I could always get the
answer from you. :-)

On Thu, May 5, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com> wrote:

> Stanley,
>
> The short answer is that this is a real problem.
>
> Try this:
>
> *Spark: Cluster Computing with Working Sets.* Matei Zaharia, Mosharaf
> Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, in HotCloud 2010,
> June 2010.
>
> Or this http://www.iterativemapreduce.org/
>
> http://code.google.com/p/haloop/
>
> You may be interested in experimenting with MapReduce 2.0.  THat allows
> more flexibility in execution model:
>
>
> http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/
>
> Systems like FlumeJava (and my open source, incomplete clone Plume) may
> help with flexibility:
>
>
> http://www.deepdyve.com/lp/association-for-computing-machinery/flumejava-easy-efficient-data-parallel-pipelines-xtUvap2t1I
>
>
> https://github.com/tdunning/Plume/commit/a5a10feaa068b33b1d929c332e4614aba50dd39a
>
>
> On Thu, May 5, 2011 at 2:16 AM, Stanley Xu <we...@gmail.com> wrote:
>
>> Dear All,
>>
>> Our team is trying to implement a parallelized LDA with Gibbs Sampling. We
>> are using the algorithm mentioned by plda, http://code.google.com/p/plda/
>>
>> The problem is that by the Map-Reduce method the paper mentioned. We need
>> to
>> run a MapReduce job every gibbs sampling iteration, and normally, it will
>> use 1000 - 2000 iterations per our test with our data to converge. But as
>> we
>> know, there is a cost to re-create the mapper/reducer, and cleanup the
>> mapper/reducer in every iteration. It will take about 40 seconds on our
>> cluster per our test, and 1000 iteration means almost 12 hours.
>>
>> I am wondering if there is a way to reduce the cost of Mapper/Reducer
>> setup/cleanup, since I prefer to have all the mappers to read the same
>> local
>> data and update the local data in a mapper process. All the other update
>> it
>> need comes from the reducer which is a pretty small data compare to the
>> whole dataset.
>>
>> Is there any approach I could try(including change part of hadoop's source
>> code.)?
>>
>>
>> Best wishes,
>> Stanley Xu
>>
>
>

Re: Is there any way I could use to reduce the cost of Mapper and Reducer setup and cleanup in a iterative MapReduce chain?

Posted by Ted Dunning <te...@gmail.com>.

Stanley,

The short answer is that this is a real problem.

Try this:

*Spark: Cluster Computing with Working Sets.* Matei Zaharia, Mosharaf
Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, in HotCloud 2010,
June 2010.

Or this http://www.iterativemapreduce.org/

http://code.google.com/p/haloop/

You may be interested in experimenting with MapReduce 2.0.  THat allows more
flexibility in execution model:

http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/

Systems like FlumeJava (and my open source, incomplete clone Plume) may help
with flexibility:

http://www.deepdyve.com/lp/association-for-computing-machinery/flumejava-easy-efficient-data-parallel-pipelines-xtUvap2t1I

https://github.com/tdunning/Plume/commit/a5a10feaa068b33b1d929c332e4614aba50dd39a


On Thu, May 5, 2011 at 2:16 AM, Stanley Xu <we...@gmail.com> wrote:

> Dear All,
>
> Our team is trying to implement a parallelized LDA with Gibbs Sampling. We
> are using the algorithm mentioned by plda, http://code.google.com/p/plda/
>
> The problem is that by the Map-Reduce method the paper mentioned. We need
> to
> run a MapReduce job every gibbs sampling iteration, and normally, it will
> use 1000 - 2000 iterations per our test with our data to converge. But as
> we
> know, there is a cost to re-create the mapper/reducer, and cleanup the
> mapper/reducer in every iteration. It will take about 40 seconds on our
> cluster per our test, and 1000 iteration means almost 12 hours.
>
> I am wondering if there is a way to reduce the cost of Mapper/Reducer
> setup/cleanup, since I prefer to have all the mappers to read the same
> local
> data and update the local data in a mapper process. All the other update it
> need comes from the reducer which is a pretty small data compare to the
> whole dataset.
>
> Is there any approach I could try(including change part of hadoop's source
> code.)?
>
>
> Best wishes,
> Stanley Xu
>