You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by yair gotdanker <ya...@gmail.com> on 2008/06/29 13:45:49 UTC

Hadoop - is it good for me and performance question

Hello all,



I am newbie to hadoop, The technology seems very interesting but I am not
sure it suit my needs.  I really appreciate your feedbacks.



The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here since
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no processing
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I increase
the performance.
btw:
Does anyone know good behind the schene tutorial, that explains more on how
the jobtracker/tasktracker communicate and so.

RE: Hadoop - is it good for me and performance question

Posted by Haijun Cao <ha...@kindsight.net>.
http://www.mail-archive.com/core-user@hadoop.apache.org/msg02906.html


-----Original Message-----
From: yair gotdanker [mailto:yairgot@gmail.com] 
Sent: Sunday, June 29, 2008 4:46 AM
To: core-user@hadoop.apache.org
Subject: Hadoop - is it good for me and performance question

Hello all,



I am newbie to hadoop, The technology seems very interesting but I am
not
sure it suit my needs.  I really appreciate your feedbacks.



The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here
since
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no
processing
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I
increase
the performance.
btw:
Does anyone know good behind the schene tutorial, that explains more on
how
the jobtracker/tasktracker communicate and so.

RE: Hadoop - is it good for me and performance question

Posted by Haijun Cao <ha...@kindsight.net>.
Not sure if this will answer your question, but a similar thread
regarding hadoop performance:

http://www.mail-archive.com/core-user@hadoop.apache.org/msg02878.html

Hadoop is good for log processing if you have a lot of logs to process
and you don't need the result in real time (e.g. you can accumulate one
day's log and process them in one batch, latency == 1 day). In another
word, it shines with large data set batch (long latency) processing.  It
is good at scalability (scale out), not at increasing single
core/machine performance. If your data fits in one process, then using a
distributed framework will probably slow it down.

Haijun

-----Original Message-----
From: yair gotdanker [mailto:yairgot@gmail.com] 
Sent: Sunday, June 29, 2008 4:46 AM
To: core-user@hadoop.apache.org
Subject: Hadoop - is it good for me and performance question

Hello all,



I am newbie to hadoop, The technology seems very interesting but I am
not
sure it suit my needs.  I really appreciate your feedbacks.



The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here
since
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no
processing
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I
increase
the performance.
btw:
Does anyone know good behind the schene tutorial, that explains more on
how
the jobtracker/tasktracker communicate and so.