You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jeff Zhang <zj...@gmail.com> on 2009/11/22 07:30:17 UTC

Is there performance comparison document ?

Hi all,,

Since mahout is build upon hadoop, so is there any performance comparison
between the algorithms using hadoop and without using hadoop. ?

Thank you.

Jeff Zhang

Re: Is there performance comparison document ?

Posted by Ted Dunning <te...@gmail.com>.

Pretty much not.

We need some realistic benchmarks in order to start working on performance
where it counts.

On Sat, Nov 21, 2009 at 10:30 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Since mahout is build upon hadoop, so is there any performance comparison
> between the algorithms using hadoop and without using hadoop. ?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Is there performance comparison document ?

Posted by Ted Dunning <te...@gmail.com>.

Right.

Realistic benchmarks are what we really don't have yet.

Care to write a simple clustering benchmark?

On Sat, Nov 21, 2009 at 10:56 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Maybe benchmark is what I like to know accurately,
>
> Just like hadoop has a benchmark that it can sort 1TB data in 62 seconds,
> so
> the same, how much time will it take mahout's bayes algorithms to train a
> model using data like 1GB?
>
>
> Thank you
>
> Jeff Zhang
>
>
> ---------- Forwarded message ----------
> From: Sean Owen <sr...@gmail.com>
> Date: Sat, Nov 21, 2009 at 10:44 PM
> Subject: Re: Is there performance comparison document ?
> To: mahout-user@lucene.apache.org
>
>
> I think we can already state the answer though: it's going to take
> much more CPU time and resources to run a computation via Hadoop than
> run it completely on one machine (non-parallelized). Hadoop is a lot
> of overhead.
>
> However some problems are too big to fit on one machine, so you have
> to parallelize with Hadoop. In that case, there is no comparison --
> you can't run it without Hadoop.
>
> Also, parallelizing means you can finish the computation in fewer
> wall-clock seconds. It'll take more CPU-seconds though. But then the
> Hadoop runtime is just a function of how many machines you throw at it
> and how parallelizable it is, so it's not much of a comparison.
>
> Are you wondering how much the overhead is, of a framework like Hadoop?
>
> On Sun, Nov 22, 2009 at 6:30 AM, Jeff Zhang <zj...@gmail.com> wrote:
> > Hi all,,
> >
> > Since mahout is build upon hadoop, so is there any performance comparison
> > between the algorithms using hadoop and without using hadoop. ?
> >
> > Thank you.
> >
> > Jeff Zhang
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Is there performance comparison document ?

Posted by Robin Anil <ro...@gmail.com>.

I have some old data from a personal experiment. An year ago, CBayes
model generation from a subset of wikipedia(3 GB out of 17GB) over 6
Pentium HT 3.0GHz cluster with 100mbps switched ethernet took 15 mins.
An addition 5 mins was used to generated the 3 GB dataset from 17Gb
bringing total time to 20mins approx.

Note that hadoop sorted 1TB using 4000 quadcore/duo core systems over
gigabit/multigigabit connections. so there is no comparison.

I hope this info helps

Robin



On Sun, Nov 22, 2009 at 12:26 PM, Jeff Zhang <zj...@gmail.com> wrote:
> Maybe benchmark is what I like to know accurately,
>
> Just like hadoop has a benchmark that it can sort 1TB data in 62 seconds, so
> the same, how much time will it take mahout's bayes algorithms to train a
> model using data like 1GB?
>
>
> Thank you
>
> Jeff Zhang
>
>
> ---------- Forwarded message ----------
> From: Sean Owen <sr...@gmail.com>
> Date: Sat, Nov 21, 2009 at 10:44 PM
> Subject: Re: Is there performance comparison document ?
> To: mahout-user@lucene.apache.org
>
>
> I think we can already state the answer though: it's going to take
> much more CPU time and resources to run a computation via Hadoop than
> run it completely on one machine (non-parallelized). Hadoop is a lot
> of overhead.
>
> However some problems are too big to fit on one machine, so you have
> to parallelize with Hadoop. In that case, there is no comparison --
> you can't run it without Hadoop.
>
> Also, parallelizing means you can finish the computation in fewer
> wall-clock seconds. It'll take more CPU-seconds though. But then the
> Hadoop runtime is just a function of how many machines you throw at it
> and how parallelizable it is, so it's not much of a comparison.
>
> Are you wondering how much the overhead is, of a framework like Hadoop?
>
> On Sun, Nov 22, 2009 at 6:30 AM, Jeff Zhang <zj...@gmail.com> wrote:
>> Hi all,,
>>
>> Since mahout is build upon hadoop, so is there any performance comparison
>> between the algorithms using hadoop and without using hadoop. ?
>>
>> Thank you.
>>
>> Jeff Zhang
>>
>

Re: Is there performance comparison document ?

Posted by Jeff Zhang <zj...@gmail.com>.

Maybe benchmark is what I like to know accurately,

Just like hadoop has a benchmark that it can sort 1TB data in 62 seconds, so
the same, how much time will it take mahout's bayes algorithms to train a
model using data like 1GB?

Thank you

Jeff Zhang

---------- Forwarded message ----------
From: Sean Owen <sr...@gmail.com>
Date: Sat, Nov 21, 2009 at 10:44 PM
Subject: Re: Is there performance comparison document ?
To: mahout-user@lucene.apache.org

I think we can already state the answer though: it's going to take
much more CPU time and resources to run a computation via Hadoop than
run it completely on one machine (non-parallelized). Hadoop is a lot
of overhead.

However some problems are too big to fit on one machine, so you have
to parallelize with Hadoop. In that case, there is no comparison --
you can't run it without Hadoop.

Also, parallelizing means you can finish the computation in fewer
wall-clock seconds. It'll take more CPU-seconds though. But then the
Hadoop runtime is just a function of how many machines you throw at it
and how parallelizable it is, so it's not much of a comparison.

Are you wondering how much the overhead is, of a framework like Hadoop?

On Sun, Nov 22, 2009 at 6:30 AM, Jeff Zhang <zj...@gmail.com> wrote:
> Hi all,,
>
> Since mahout is build upon hadoop, so is there any performance comparison
> between the algorithms using hadoop and without using hadoop. ?
>
> Thank you.
>
> Jeff Zhang
>

Re: Is there performance comparison document ?

Posted by Ted Dunning <te...@gmail.com>.

I generally agree that Hadoop is quite a bit of overhead, but I was shocked
(and stunned) when I re-implemented a recommendation engine using Hadoop a
few years back.  The reference implementation was a sparse matrix in-memory
model that was slightly tuned for small space.  The sparse matrix was
similar to what we have now, but all of the integers were byte-wise
compressed.

The serious knock me back on my heels moment was when I measured a quick and
dirty map-reduce version running on the same data.  Without any
consideration to memory use or optimization, the local invocation of the
map-reduce version actually ran faster than the in-memory version.

What had happened is that my memory efficiency zealotry combined with the
random access nature of my program was making accesses expensive and was
blowing the L2 cache completely.  That resulted in my program running
hundreds of times slower than it might have with good cache coherency and
without the compressed integer overheads.

On the other hand, the hadoop version was reading data from disk in
completely sequential fashion and was making use of some very well written
sort and merge routines.  The result was that it was hitting cached data way
more than my other program was.

The net result was that using hadoop on a single machine running an
out-of-core program was slightly faster than my fancy in-core answer.
Moving to multi-machine hadoop incurred a lot more overhead, but I was able
to work on hundreds of times larger data sets with about 10x the hardware in
the same time.

This story can be read many ways.  One way is to read it as saying what a
putz I am for writing a not very clever (or entirely too clever) program in
the first place.  Another reading is as another example of how disk and
memory have become more like tapes than like random access devices.  Another
reading is that the discipline of map-reduce is good for writing simple,
fast programs.

Regardless, I haven't looked back since that day.  If I have a batch program
of any scale at all, it goes into map-reduce form at my earliest
opportunity.

On Sat, Nov 21, 2009 at 10:44 PM, Sean Owen <sr...@gmail.com> wrote:

> think we can already state the answer though: it's going to take
> much more CPU time and resources to run a computation via Hadoop than
> run it completely on one machine (non-parallelized). Hadoop is a lot
> of overhead.
>
> ...
> Are you wondering how much the overhead is, of a framework like Hadoop?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Is there performance comparison document ?

Posted by Sean Owen <sr...@gmail.com>.

I think we can already state the answer though: it's going to take
much more CPU time and resources to run a computation via Hadoop than
run it completely on one machine (non-parallelized). Hadoop is a lot
of overhead.

However some problems are too big to fit on one machine, so you have
to parallelize with Hadoop. In that case, there is no comparison --
you can't run it without Hadoop.

Also, parallelizing means you can finish the computation in fewer
wall-clock seconds. It'll take more CPU-seconds though. But then the
Hadoop runtime is just a function of how many machines you throw at it
and how parallelizable it is, so it's not much of a comparison.

Are you wondering how much the overhead is, of a framework like Hadoop?

On Sun, Nov 22, 2009 at 6:30 AM, Jeff Zhang <zj...@gmail.com> wrote:
> Hi all,,
>
> Since mahout is build upon hadoop, so is there any performance comparison
> between the algorithms using hadoop and without using hadoop. ?
>
> Thank you.
>
> Jeff Zhang
>