You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Yair Gottdenker <ya...@cotendo.com> on 2008/07/01 10:14:55 UTC

RE: Hadoop - is it good for me and performance question

Thanks for your reply Haijun,

Do you know what makes Hadoop run so slow? I have been trying to figure
it out my self but I can't imagine anything so complicate that justifies
hadoop performance and latency.



-----Original Message-----
From: Haijun Cao [mailto:haijun@kindsight.net] 
Sent: Monday, June 30, 2008 9:33 PM
To: core-user@hadoop.apache.org
Subject: RE: Hadoop - is it good for me and performance question


Not sure if this will answer your question, but a similar thread
regarding hadoop performance:

http://www.mail-archive.com/core-user@hadoop.apache.org/msg02878.html

Hadoop is good for log processing if you have a lot of logs to process
and you don't need the result in real time (e.g. you can accumulate one
day's log and process them in one batch, latency == 1 day). In another
word, it shines with large data set batch (long latency) processing.  It
is good at scalability (scale out), not at increasing single
core/machine performance. If your data fits in one process, then using a
distributed framework will probably slow it down.

Haijun

-----Original Message-----
From: yair gotdanker [mailto:yairgot@gmail.com] 
Sent: Sunday, June 29, 2008 4:46 AM
To: core-user@hadoop.apache.org
Subject: Hadoop - is it good for me and performance question

Hello all,



I am newbie to hadoop, The technology seems very interesting but I am
not
sure it suit my needs.  I really appreciate your feedbacks.



The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here
since
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no
processing
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I
increase
the performance.
btw:
Does anyone know good behind the schene tutorial, that explains more on
how
the jobtracker/tasktracker communicate and so.

Re: Hadoop - is it good for me and performance question

Posted by tim robertson <ti...@gmail.com>.

MapReduce on Hadoop is for processing very large amounts of data, or else
the overhead of framework (job scheduling, failover etc) do not justify it.
If you are processing 10-100M / min = 14-140G a day.  This probably
justifies it's use I would say

You can't get a performance estimate on a pseudo cluster on 1 machine with
small amounts of data- it is just not what hadoop is designed for.

I have recently gone through what you are doing, and then I went to EC2 to
do my first real test at the weekend.
Have you considered a test run on EC2 with 140G file?  It takes about 1 day
from starting to getting running unless you are into EC2 already as there is
a fair amount to read and get set up, and will cost you around $5US total.

I blogged my experience here which will help you avoid a couple of pitfalls:
http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html

I have subsequently found I ran only 1 reducer, and it was the reducer that
took 50% the time - I should have run more like 10 reducers for the job I
was doing...

Cheers

Tim


On Tue, Jul 1, 2008 at 10:14 AM, Yair Gottdenker <ya...@cotendo.com> wrote:

> Thanks for your reply Haijun,
>
> Do you know what makes Hadoop run so slow? I have been trying to figure
> it out my self but I can't imagine anything so complicate that justifies
> hadoop performance and latency.
>
>
>
> -----Original Message-----
> From: Haijun Cao [mailto:haijun@kindsight.net]
> Sent: Monday, June 30, 2008 9:33 PM
> To: core-user@hadoop.apache.org
> Subject: RE: Hadoop - is it good for me and performance question
>
>
> Not sure if this will answer your question, but a similar thread
> regarding hadoop performance:
>
> http://www.mail-archive.com/core-user@hadoop.apache.org/msg02878.html
>
> Hadoop is good for log processing if you have a lot of logs to process
> and you don't need the result in real time (e.g. you can accumulate one
> day's log and process them in one batch, latency == 1 day). In another
> word, it shines with large data set batch (long latency) processing.  It
> is good at scalability (scale out), not at increasing single
> core/machine performance. If your data fits in one process, then using a
> distributed framework will probably slow it down.
>
> Haijun
>
> -----Original Message-----
> From: yair gotdanker [mailto:yairgot@gmail.com]
> Sent: Sunday, June 29, 2008 4:46 AM
> To: core-user@hadoop.apache.org
> Subject: Hadoop - is it good for me and performance question
>
> Hello all,
>
>
>
> I am newbie to hadoop, The technology seems very interesting but I am
> not
> sure it suit my needs.  I really appreciate your feedbacks.
>
>
>
> The problem:
>
> I have multiple logservers each receiving 10-100 mg/minute. The received
> data is processed to produce aggregated data.
> The data process time should take few minutes at top (10 min).
>
> In addtion, I did some performance benchmark on the workcount example
> provided by quickstart tutorial on my pc (pseudo-distributed, using
> quickstart configurations file) and it took about 40 seconds!
> I must be missing something here, I must be doing something wrong here
> since
> 40 seconds is way too long!
> Map/reduce function should be very fast since there is almost no
> processing
> done. So I guess most of the time spend on the hadoop framework.
>
> I will appreciate any help  for understanding this and how can I
> increase
> the performance.
> btw:
> Does anyone know good behind the schene tutorial, that explains more on
> how
> the jobtracker/tasktracker communicate and so.
>

RE: Hadoop - is it good for me and performance question

Posted by Haijun Cao <ha...@kindsight.net>.

That I don't know, I would be interested to know too. Maybe you can
establish a baseline (a program that does wordcount and but not using
mapred) first?

Haijun

-----Original Message-----
From: Yair Gottdenker [mailto:yair@cotendo.com] 
Sent: Tuesday, July 01, 2008 1:15 AM
To: core-user@hadoop.apache.org
Subject: RE: Hadoop - is it good for me and performance question

Thanks for your reply Haijun,

Do you know what makes Hadoop run so slow? I have been trying to figure
it out my self but I can't imagine anything so complicate that justifies
hadoop performance and latency.

-----Original Message-----
From: Haijun Cao [mailto:haijun@kindsight.net] 
Sent: Monday, June 30, 2008 9:33 PM
To: core-user@hadoop.apache.org
Subject: RE: Hadoop - is it good for me and performance question

Not sure if this will answer your question, but a similar thread
regarding hadoop performance:

http://www.mail-archive.com/core-user@hadoop.apache.org/msg02878.html

Hadoop is good for log processing if you have a lot of logs to process
and you don't need the result in real time (e.g. you can accumulate one
day's log and process them in one batch, latency == 1 day). In another
word, it shines with large data set batch (long latency) processing.  It
is good at scalability (scale out), not at increasing single
core/machine performance. If your data fits in one process, then using a
distributed framework will probably slow it down.

Haijun

-----Original Message-----
From: yair gotdanker [mailto:yairgot@gmail.com] 
Sent: Sunday, June 29, 2008 4:46 AM
To: core-user@hadoop.apache.org
Subject: Hadoop - is it good for me and performance question

Hello all,

I am newbie to hadoop, The technology seems very interesting but I am
not
sure it suit my needs.  I really appreciate your feedbacks.

The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here
since
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no
processing
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I
increase
the performance.
btw:
Does anyone know good behind the schene tutorial, that explains more on
how
the jobtracker/tasktracker communicate and so.