You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sebastien Bratieres <sb...@cam.ac.uk> on 2010/01/27 01:12:55 UTC

iterative MapReduce for Gibbs sampling

Hi all,

I'm working on a machine learning project with Hadoop, in which I use the
Mahout matrix package.

My application effectively is an instance of Gibbs sampling, therefore
iterative. One iteration consists of 9 MR jobs, with some dependencies
between them, and some code executed between them. I need to iterate over
this thousands of times. At present an iteration takes around 7 minutes (I
run everything on Amazon Elastic MapReduce), which makes it prohibitive for
thousands of iterations. The bulk of the time is eaten up by Hadoop
overhead, not by proper number-crunching. Of the 9 MR jobs comprising an
iteration, 6 jobs take (map on) my training data, which consists of around
50 000 Wall Street Journal sentences, ie a rather small dataset. 2 jobs maps
on cells of a 200*200 matrix, the last one on 50 000 words.

I'm aware that generally speaking, Hadoop might not be adapted to this sort
of job, because of the overhead involved. What better options come to mind
to distribute an iterative computation (with "mainstream", possibly open
source, software) ?

Here are the optimizations I am thinking of, in my present setting:
- run some jobs in parallel, by using JobControl. One issue I have here is
that I don't know how to have the code in-between MR jobs executed by the
JobControl-ler. I don't want to create fake MR jobs because that would only
create more overhead. Is there a solution you know of ?
- avoid constructing writables everytime I need to collect one (tip 6 from
http://www.cloudera.com/blog/2009/12/17/7-tips-for-improving-mapreduce-performance/
)
- using the Mahout matrix package, I need to change the cardinality of
matrices and vectors (eg, by appending an element/row, or removing one) --
I'd like to do this in-place instead of creating a new 200*200 matrix, is
there some way to do that ?
- run the garbage collector often, to avoid running low on memory, which
prevents the Hadoop shuffles from running in memory and has them spill to
disk
- use map-only jobs (ie using the default IdentityReducer) where possible,
and setting conf.setNumReducers(0) in these cases (my sentence data doesn't
need merging/sorting)
- JVM reuse is not accessible in Hadoop 0.18.3, which is in use on AEMR
- write my own combiner to speed up the shuffle phase (not sure what I can
achieve)

In Mahout, the Dirichlet and LDA packages follow this iterative pattern,
albeit with only one MR job per iteration, not 9. Can you give me some
advice from the experience you have with such iterative MR jobs ? Any places
in the Mahout code I should read ? Any optimization I'm not thinking of,
tuning I should consider/check for ? If someone would like to look at my
code, I would be happy to share it.

Thanks !
Sebastien

Re: iterative MapReduce for Gibbs sampling

Posted by Ted Dunning <te...@gmail.com>.
Gibbs' sampling can be quite difficult in a map-reduce setting because the
state inherently is chained from one iteration to the next.

Parallel processing can help if you have a relatively small number of
iterations, but each iteration is expensive (LDA, Dirichlet Process
clustering) or if you want to use many chains.  If your process mixes poorly
then you can be in a world of hurt.

We have been lucky so far to have very fast mixing processes (Dirichlet
Process) or variational techniques that converge very quickly (LDA).

If you want many iterations, then Hadoop based map-reduce may be very
difficult to make efficient.  The minimum job time is probably going to be
10's of seconds.  Running thousands of these is going to hurt.  Running tens
of thousands will hurt much worse.

On Tue, Jan 26, 2010 at 4:12 PM, Sebastien Bratieres <sb...@cam.ac.uk>wrote:

> Hi all,
>
> I'm working on a machine learning project with Hadoop, in which I use the
> Mahout matrix package.
>
> My application effectively is an instance of Gibbs sampling, therefore
> iterative. One iteration consists of 9 MR jobs, with some dependencies
> between them, and some code executed between them. I need to iterate over
> this thousands of times. At present an iteration takes around 7 minutes (I
> run everything on Amazon Elastic MapReduce), which makes it prohibitive for
> thousands of iterations. The bulk of the time is eaten up by Hadoop
> overhead, not by proper number-crunching. Of the 9 MR jobs comprising an
> iteration, 6 jobs take (map on) my training data, which consists of around
> 50 000 Wall Street Journal sentences, ie a rather small dataset. 2 jobs
> maps
> on cells of a 200*200 matrix, the last one on 50 000 words.
>
> I'm aware that generally speaking, Hadoop might not be adapted to this sort
> of job, because of the overhead involved. What better options come to mind
> to distribute an iterative computation (with "mainstream", possibly open
> source, software) ?
>
> Here are the optimizations I am thinking of, in my present setting:
> - run some jobs in parallel, by using JobControl. One issue I have here is
> that I don't know how to have the code in-between MR jobs executed by the
> JobControl-ler. I don't want to create fake MR jobs because that would only
> create more overhead. Is there a solution you know of ?
> - avoid constructing writables everytime I need to collect one (tip 6 from
>
> http://www.cloudera.com/blog/2009/12/17/7-tips-for-improving-mapreduce-performance/
> )
> - using the Mahout matrix package, I need to change the cardinality of
> matrices and vectors (eg, by appending an element/row, or removing one) --
> I'd like to do this in-place instead of creating a new 200*200 matrix, is
> there some way to do that ?
> - run the garbage collector often, to avoid running low on memory, which
> prevents the Hadoop shuffles from running in memory and has them spill to
> disk
> - use map-only jobs (ie using the default IdentityReducer) where possible,
> and setting conf.setNumReducers(0) in these cases (my sentence data doesn't
> need merging/sorting)
> - JVM reuse is not accessible in Hadoop 0.18.3, which is in use on AEMR
> - write my own combiner to speed up the shuffle phase (not sure what I can
> achieve)
>
> In Mahout, the Dirichlet and LDA packages follow this iterative pattern,
> albeit with only one MR job per iteration, not 9. Can you give me some
> advice from the experience you have with such iterative MR jobs ? Any
> places
> in the Mahout code I should read ? Any optimization I'm not thinking of,
> tuning I should consider/check for ? If someone would like to look at my
> code, I would be happy to share it.
>
> Thanks !
> Sebastien
>



-- 
Ted Dunning, CTO
DeepDyve