You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Nathan Marz <na...@gmail.com> on 2010/01/03 20:30:29 UTC

Mathematics behind Hadoop-based systems

I did some analysis on the performance of Hadoop-based workflows. Some of
the results are counter-intuitive so I thought the community at large would
be interested:

http://nathanmarz.com/blog/hadoop-mathematics/

Would love to hear any feedback or comments you have.

Re: Mathematics behind Hadoop-based systems

Posted by Nathan Marz <na...@gmail.com>.

That's a great way of putting it: "increasing capacity shifts the
equilibrium". If you work some examples you'll find that it doesn't take
many iterations for a workflow to converge to its stable runtime. There is
some minimum capacity you need for there to be an equilibrium though, or
else the runtime doesn't stabilize.

Regarding the sorting, I believe the complexity in Hadoop is more like n / R
* log(n/R), where R is in the number of reducers. And in practice, I've
found sorting to not take very long. But you're right - fundamentally this
model is an approximation.


On Tue, Jan 5, 2010 at 5:29 PM, Yuri Pradkin <yu...@isi.edu> wrote:

> On Sunday 03 January 2010 11:30:29 Nathan Marz <na...@gmail.com>
> wrote:
> > I did some analysis on the performance of Hadoop-based workflows. Some of
> > the results are counter-intuitive so I thought the community at large
> would
> > be interested:
> >
> > http://nathanmarz.com/blog/hadoop-mathematics/
> >
> > Would love to hear any feedback or comments you have.
> >
>
> Just thinking out loud:  runtime generally is not equal to the "hours of
> data".  In your processing model it's your next iteration that will have
> this
> amount of data.  By equating them, you're assuming an equilibrium case.
>  But
> if you're in an equilibrium, what would increasing the capacity mean? - I
> think it'd mean that you process your so-far accumulated data faster, so
> your
> next iteration will have less data to process and so on until you get to a
> new
> equilibrium (but how fast will you get there?).  Lesson learned: increasing
> capacity shifts equilibrium.  It's kind of like a reaction time of your
> system...  Sometimes, I imagine, as long as you're keeping up with the
> arrival
> rate you don't care if it takes a week's full or a day's.  In fact there
> may
> be some constraints on the minimum size of the input.
>
> Another comment: you're assuming that processing time is linear in the
> input
> size.  Of course it depends on the processing you're doing, but even if
> YOUR
> processing is linear, Hadoop needs to sort keys, and that is at best
> O(n*log(n)), so there is inherent non-linearity present.
>
> Similarly about the number of nodes: doubling your nodes will not double
> your
> service rate for a variety of reasons.
>
>  -Yuri
>



-- 
Nathan Marz
Twitter: @nathanmarz
http://nathanmarz.com

Re: Mathematics behind Hadoop-based systems

Posted by Yuri Pradkin <yu...@isi.edu>.

On Sunday 03 January 2010 11:30:29 Nathan Marz <na...@gmail.com> wrote:
> I did some analysis on the performance of Hadoop-based workflows. Some of
> the results are counter-intuitive so I thought the community at large would
> be interested:
> 
> http://nathanmarz.com/blog/hadoop-mathematics/
> 
> Would love to hear any feedback or comments you have.
> 

Just thinking out loud:  runtime generally is not equal to the "hours of 
data".  In your processing model it's your next iteration that will have this 
amount of data.  By equating them, you're assuming an equilibrium case.  But 
if you're in an equilibrium, what would increasing the capacity mean? - I 
think it'd mean that you process your so-far accumulated data faster, so your 
next iteration will have less data to process and so on until you get to a new 
equilibrium (but how fast will you get there?).  Lesson learned: increasing 
capacity shifts equilibrium.  It's kind of like a reaction time of your 
system...  Sometimes, I imagine, as long as you're keeping up with the arrival 
rate you don't care if it takes a week's full or a day's.  In fact there may 
be some constraints on the minimum size of the input.

Another comment: you're assuming that processing time is linear in the input 
size.  Of course it depends on the processing you're doing, but even if YOUR 
processing is linear, Hadoop needs to sort keys, and that is at best 
O(n*log(n)), so there is inherent non-linearity present.

Similarly about the number of nodes: doubling your nodes will not double your 
service rate for a variety of reasons.

  -Yuri