You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Mat Kelcey <ma...@gmail.com> on 2009/09/15 03:14:34 UTC

hadoop scales but is not performant?

hi all,

recently i've started playing with hadoop and my first learning
experiment has surprised me

i'm implementing a a text problem; trying to extract infrequent
phrases using a mixture of probabilistic models.
it's a bit of toy problem so the algorithm details aren't super
important, though the implementation might be....

this problem ended up being represented by a dozen or so map reduce
jobs of various types including some aggregate steps and some manual
joins.
(see the bottom of this page for details
http://matpalm.com/sip/take3_markov_chains.html)

i've implemented each step using ruby / streaming, the code is at
http://github.com/matpalm/sip if anyone cares.

i ran some tests using 10 ec2 medium cpu instance across a small
100mb's of gzipped text.
for validation of the results i also reimplemented the entire
algorithm as a single threaded ruby app

my surprise comes from finding that the ruby implementation
outperforms the 10 ec2 instances on this data size...
i ran a few samples of different sizes with the graph at the bottom of
http://matpalm.com/sip/part4_but_does_it_scale.html

so why is this? here are my explanations in order of how confident i am...

a) 100mb is peanuts and hadoop was made for 1000x this size so the
test is invalid.
b) there is a better representation of this problem that uses fewer
map/reduce passes.
c) streaming is too slow and rewriting in java (and making use of
techniques like chaining mappers) would speed things up
d) doing these steps, particularly the joins, in pig would be faster

my next steps are to rewrite some of the steps in pig to sample the difference

does anyone have any high level comments on this?

cheers,
mat

Re: hadoop scales but is not performant?

Posted by Mat Kelcey <ma...@gmail.com>.
>>[me]
>> a) 100mb is peanuts and hadoop was made for 1000x this size so the
>> test is invalid.

>[Alan]
> Definitely the case.  At this size your start costs will swamp any
> performance benefits of parallelism.

cheers, thanks alan
was pretty sure this was the case.
mat

Re: hadoop scales but is not performant?

Posted by Alan Gates <ga...@yahoo-inc.com>.
Some thoughts, inlined.

Alan.

On Sep 14, 2009, at 6:14 PM, Mat Kelcey wrote:

> hi all,
>
> recently i've started playing with hadoop and my first learning
> experiment has surprised me
>
> i'm implementing a a text problem; trying to extract infrequent
> phrases using a mixture of probabilistic models.
> it's a bit of toy problem so the algorithm details aren't super
> important, though the implementation might be....
>
> this problem ended up being represented by a dozen or so map reduce
> jobs of various types including some aggregate steps and some manual
> joins.
> (see the bottom of this page for details
> http://matpalm.com/sip/take3_markov_chains.html)
>
> i've implemented each step using ruby / streaming, the code is at
> http://github.com/matpalm/sip if anyone cares.
>
> i ran some tests using 10 ec2 medium cpu instance across a small
> 100mb's of gzipped text.
> for validation of the results i also reimplemented the entire
> algorithm as a single threaded ruby app
>
> my surprise comes from finding that the ruby implementation
> outperforms the 10 ec2 instances on this data size...
> i ran a few samples of different sizes with the graph at the bottom of
> http://matpalm.com/sip/part4_but_does_it_scale.html
>
> so why is this? here are my explanations in order of how confident i  
> am...
>
> a) 100mb is peanuts and hadoop was made for 1000x this size so the
> test is invalid.
Definitely the case.  At this size your start costs will swamp any  
performance benefits of parallelism.

> b) there is a better representation of this problem that uses fewer
> map/reduce passes.
> c) streaming is too slow and rewriting in java (and making use of
> techniques like chaining mappers) would speed things up
Maybe.  Streaming itself is a little slower than using java.  I don't  
know what the penalty of using java versus ruby is.

> d) doing these steps, particularly the joins, in pig would be faster
Writing your joins in Pig will definitely be faster to code.  They  
won't be faster to execute unless you are able to use one of Pig's  
specialized join algorithms (fragment-replicate, merge, skew).  At  
100mb its hard to see that any of those will make a big difference.

>
> my next steps are to rewrite some of the steps in pig to sample the  
> difference
>
> does anyone have any high level comments on this?
>
> cheers,
> mat


Re: hadoop scales but is not performant?

Posted by Mat Kelcey <ma...@gmail.com>.
> mappers would be utilised (due to the same input data size)
whoops, i mean "small input data size"

Re: hadoop scales but is not performant?

Posted by Mat Kelcey <ma...@gmail.com>.
thanks scott, some great things to think about!

the only "tuning" i did was to set mapred.reduce.tasks and
mapred.map.tasks to 30 to correspond to the capability specified by
the html ui. i admit i did this without a deep understanding what it
meant, i do know that when i did not specify these then only a few
mappers would be utilised (due to the same input data size)

in relation to scheduling i was taking the simple approach of running
the streaming jobs sequentially with the default scheduler. even from
watching output scroll past it is obvious that a _lot_ of time is
being taken up in setup related activities. this is most apparent in
the single document case. something is just not right...

i had read in http://issues.apache.org/jira/browse/HADOOP-2721 "Use
job control for tasks (and therefore for pipes and streaming)"  that
jobcontrol (specifically representing job dependencies) was not yet
available for streaming. as such i dismissed any scheduling changes.
i'll revisit this to make sure i understand what i can and can't do in
streaming. if nothing else i can try fairscheduling with my own rolled
version of dependencies. i'm orchestrating the job runs from rake and
i've got my own homebrew libraries for this type of dependency
management, though i'm also loath to roll my own versions of things.

so lots of ideas and things to check, i'll rerun trying some of the
things you've mentioned.

thanks again for the feedback!
mat

Re: hadoop scales but is not performant?

Posted by Scott Carey <sc...@richrelevance.com>.
Some more comments:

How are you controlling your graph of different map reduce dependencies?  Doing this linearly will leave the cluster rather underutilized.  Using Pig or Cascading or the job control interface to schedule this all (which can submit parallel independent jobs), combined with the Fair Scheduler, will usually have very good results on smaller clusters in reducing overall batch latency and increasing cluster utilization.  This often allows setting the number of reduces for each job more optimally as well.

Although Pig will be a bit less performant than an optimized M/R, for tasks like yours it can possibly go much faster with a lot less code by optimizing your M/R passes, scheduling, and providing some alternate join strategies out of the box if used right.


On 9/15/09 4:02 PM, "Scott Carey" <sc...@richrelevance.com> wrote:

Its not entirely invalid at this scale, though the dataset generally should be (in memory) closer to the size of the available RAM of the cluster or more to be a better test.

Hadoop should still do well at this size relative to a single process.  If it is taking 45 minutes on a 10 machine cluster, it should be faster than any single process.  If this was limited primarily by startup costs, then increasing the data size wouldn't almost linearly increase the time.

It can probably be significantly tuned.
Some tips for jobs this size:
Watch the CPU / RAM / IO on a node while it is running.  Is it being well utilized?
Watch the jobs in the HTML UI on the jobtracker.  Are the total number of slots configured being used most of the time?   Or is it spending a lot of clock time with a lot of empty slots?  If so, which slots?  Are Maps or Reduces taking the most time?

Hadoop currently is a slow scheduler for tasks (can schecule at most 1 task per second or two per node).  Using the Fair Scheduler and enabling some options can turn this up to 1 map and 1 reduce to schedule per 'tick'.  There are some big changes in the schedulers due out in 0.21 that will significantly help here.  For smaller low latency jobs this can make a big difference.

Hadoop also has some flaws in the shuffle phase that affect clusters of all sizes, but can hurt small clusters with many small map jobs.   Look at the log file output of your reduce jobs (in the jobtracker UI) and see how long the shuffle phase is taking.
There is a big change due for 0.21 that makes this a LOT faster for some cases, and low latency smaller jobs will benefit a lot too.
https://issues.apache.org/jira/browse/MAPREDUCE-318

See my comment in that ticket from the 10th of June in relation to shuffle times.  A one line change in 0.19.x and 0.20.x cut times on smaller low latency jobs on smaller clusters in half when limited by shuffle inefficiencies (specifically, fetching no more than one map output from one host every few seconds for no good reason).

Tuning several hadoop parameters might help a great deal as well.   Cloudera's configuration wizard is a good starting point to help you find which knobs are more important.


You probably also want to do some work to figure out what is taking time on your local tests.  There might be something wrong there.  It seems suspicious to me, even at the one document size, that it would take 2 + minutes versus less than one second.  Startup costs aren't that big for 2 maps and 2 reduces - on the order of a few seconds not minutes.  A few seconds times a few jobs is still not that much.



On 9/14/09 6:14 PM, "Mat Kelcey" <ma...@gmail.com> wrote:

hi all,

recently i've started playing with hadoop and my first learning
experiment has surprised me

i'm implementing a a text problem; trying to extract infrequent
phrases using a mixture of probabilistic models.
it's a bit of toy problem so the algorithm details aren't super
important, though the implementation might be....

this problem ended up being represented by a dozen or so map reduce
jobs of various types including some aggregate steps and some manual
joins.
(see the bottom of this page for details
http://matpalm.com/sip/take3_markov_chains.html)

i've implemented each step using ruby / streaming, the code is at
http://github.com/matpalm/sip if anyone cares.

i ran some tests using 10 ec2 medium cpu instance across a small
100mb's of gzipped text.
for validation of the results i also reimplemented the entire
algorithm as a single threaded ruby app

my surprise comes from finding that the ruby implementation
outperforms the 10 ec2 instances on this data size...
i ran a few samples of different sizes with the graph at the bottom of
http://matpalm.com/sip/part4_but_does_it_scale.html

so why is this? here are my explanations in order of how confident i am...

a) 100mb is peanuts and hadoop was made for 1000x this size so the
test is invalid.
b) there is a better representation of this problem that uses fewer
map/reduce passes.
c) streaming is too slow and rewriting in java (and making use of
techniques like chaining mappers) would speed things up
d) doing these steps, particularly the joins, in pig would be faster

my next steps are to rewrite some of the steps in pig to sample the difference

does anyone have any high level comments on this?

cheers,
mat



Re: hadoop scales but is not performant?

Posted by Scott Carey <sc...@richrelevance.com>.
Its not entirely invalid at this scale, though the dataset generally should be (in memory) closer to the size of the available RAM of the cluster or more to be a better test.

Hadoop should still do well at this size relative to a single process.  If it is taking 45 minutes on a 10 machine cluster, it should be faster than any single process.  If this was limited primarily by startup costs, then increasing the data size wouldn't almost linearly increase the time.

It can probably be significantly tuned.
Some tips for jobs this size:
Watch the CPU / RAM / IO on a node while it is running.  Is it being well utilized?
Watch the jobs in the HTML UI on the jobtracker.  Are the total number of slots configured being used most of the time?   Or is it spending a lot of clock time with a lot of empty slots?  If so, which slots?  Are Maps or Reduces taking the most time?

Hadoop currently is a slow scheduler for tasks (can schecule at most 1 task per second or two per node).  Using the Fair Scheduler and enabling some options can turn this up to 1 map and 1 reduce to schedule per 'tick'.  There are some big changes in the schedulers due out in 0.21 that will significantly help here.  For smaller low latency jobs this can make a big difference.

Hadoop also has some flaws in the shuffle phase that affect clusters of all sizes, but can hurt small clusters with many small map jobs.   Look at the log file output of your reduce jobs (in the jobtracker UI) and see how long the shuffle phase is taking.
There is a big change due for 0.21 that makes this a LOT faster for some cases, and low latency smaller jobs will benefit a lot too.
https://issues.apache.org/jira/browse/MAPREDUCE-318

See my comment in that ticket from the 10th of June in relation to shuffle times.  A one line change in 0.19.x and 0.20.x cut times on smaller low latency jobs on smaller clusters in half when limited by shuffle inefficiencies (specifically, fetching no more than one map output from one host every few seconds for no good reason).

Tuning several hadoop parameters might help a great deal as well.   Cloudera's configuration wizard is a good starting point to help you find which knobs are more important.


You probably also want to do some work to figure out what is taking time on your local tests.  There might be something wrong there.  It seems suspicious to me, even at the one document size, that it would take 2 + minutes versus less than one second.  Startup costs aren't that big for 2 maps and 2 reduces - on the order of a few seconds not minutes.  A few seconds times a few jobs is still not that much.



On 9/14/09 6:14 PM, "Mat Kelcey" <ma...@gmail.com> wrote:

hi all,

recently i've started playing with hadoop and my first learning
experiment has surprised me

i'm implementing a a text problem; trying to extract infrequent
phrases using a mixture of probabilistic models.
it's a bit of toy problem so the algorithm details aren't super
important, though the implementation might be....

this problem ended up being represented by a dozen or so map reduce
jobs of various types including some aggregate steps and some manual
joins.
(see the bottom of this page for details
http://matpalm.com/sip/take3_markov_chains.html)

i've implemented each step using ruby / streaming, the code is at
http://github.com/matpalm/sip if anyone cares.

i ran some tests using 10 ec2 medium cpu instance across a small
100mb's of gzipped text.
for validation of the results i also reimplemented the entire
algorithm as a single threaded ruby app

my surprise comes from finding that the ruby implementation
outperforms the 10 ec2 instances on this data size...
i ran a few samples of different sizes with the graph at the bottom of
http://matpalm.com/sip/part4_but_does_it_scale.html

so why is this? here are my explanations in order of how confident i am...

a) 100mb is peanuts and hadoop was made for 1000x this size so the
test is invalid.
b) there is a better representation of this problem that uses fewer
map/reduce passes.
c) streaming is too slow and rewriting in java (and making use of
techniques like chaining mappers) would speed things up
d) doing these steps, particularly the joins, in pig would be faster

my next steps are to rewrite some of the steps in pig to sample the difference

does anyone have any high level comments on this?

cheers,
mat