You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ryan LeCompte <le...@gmail.com> on 2009/02/25 14:59:39 UTC

Using Hadoop for near real-time processing of log data

Hello all,

Is anyone using Hadoop as more of a near/almost real-time processing
of log data for their systems to aggregate stats, etc? I know that
Hadoop has generally been good at off-line processing of large amounts
of data, but I've wondered if anyone has tried using it for processing
of near real-time log data as it is appears in your systems with any
success? My gut feeling is that Hadoop isn't suitable for this yet
given redundancy issues around the JobTracker/NameNode, as well as the
overhead of moving blocks around in HDFS. Thoughts?

Thanks,
Ryan

Re: Using Hadoop for near real-time processing of log data

Posted by Edward Capriolo <ed...@gmail.com>.

>>Yeah, but what's the point of using Hadoop then? i.e. we lost all the
>>parallelism?

Some jobs do not need it. For example, I am working with the Hive sub
project. If I have a table that is less then my block size. Having a
large number of mappers or reducers is counter productive. Hadoop will
start up mappers that never get any data. Setting the job tracker to
'local' or setting map tasks and reduce tasks to 1 makes  the job
finish faster. 20 seconds vs 10 seconds.

If you have a small data set and a system with 8 cores, the MiniMR
cluster can possibly be used as an embedded hadoop. For some jobs the
most efficient parallelism might be 1.

WordCount of "1 2 3 4 5 6" on  the MiniMRCluster test case takes less
then two seconds.

It may not be the common case, but it may be feasible to use hadoop in
that manner.

Re: Using Hadoop for near real-time processing of log data

Posted by Mikhail Yakshin <gr...@gmail.com>.

On Wed, Feb 25, 2009 at 10:09 PM, Edward Capriolo
>>> Is anyone using Hadoop as more of a near/almost real-time processing
>>> of log data for their systems to aggregate stats, etc?
>>
>> We do, although "near realtime" is pretty relative subject and your
>> mileage may vary. For example, startups / shutdowns of Hadoop jobs are
>> pretty expensive and it could take anything from 5-10 seconds up to
>> several minutes to get the job started and almost same thing goes for
>> job finalization. Generally, if your "near realtime" would tolerate
>> 3-4-5 minutes lag, it's possible to use Hadoop.
>
> I was thinking about this. Assuming your datasets are small would
> running a local jobtracker or even running the MinimMR cluster from
> the test case be an interesting way to run small jobs confided to one
> CPU?

Yeah, but what's the point of using Hadoop then? i.e. we lost all the
parallelism?

-- 
WBR, Mikhail Yakshin

Re: Using Hadoop for near real-time processing of log data

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Feb 25, 2009 at 1:13 PM, Mikhail Yakshin
<gr...@gmail.com> wrote:
> Hi,
>
>> Is anyone using Hadoop as more of a near/almost real-time processing
>> of log data for their systems to aggregate stats, etc?
>
> We do, although "near realtime" is pretty relative subject and your
> mileage may vary. For example, startups / shutdowns of Hadoop jobs are
> pretty expensive and it could take anything from 5-10 seconds up to
> several minutes to get the job started and almost same thing goes for
> job finalization. Generally, if your "near realtime" would tolerate
> 3-4-5 minutes lag, it's possible to use Hadoop.
>
> --
> WBR, Mikhail Yakshin
>

I was thinking about this. Assuming your datasets are small would
running a local jobtracker or even running the MinimMR cluster from
the test case be an interesting way to run small jobs confided to one
CPU?

Re: Using Hadoop for near real-time processing of log data

Posted by Mikhail Yakshin <gr...@gmail.com>.

Hi,

> Is anyone using Hadoop as more of a near/almost real-time processing
> of log data for their systems to aggregate stats, etc?

We do, although "near realtime" is pretty relative subject and your
mileage may vary. For example, startups / shutdowns of Hadoop jobs are
pretty expensive and it could take anything from 5-10 seconds up to
several minutes to get the job started and almost same thing goes for
job finalization. Generally, if your "near realtime" would tolerate
3-4-5 minutes lag, it's possible to use Hadoop.

-- 
WBR, Mikhail Yakshin

Re: Using Hadoop for near real-time processing of log data

Posted by Vadim Zaliva <kr...@gmail.com>.

On Wed, Feb 25, 2009 at 05:59, Ryan LeCompte <le...@gmail.com> wrote:
> Hello all,
>
> Is anyone using Hadoop as more of a near/almost real-time processing
> of log data for their systems to aggregate stats, etc? I know that
> Hadoop has generally been good at off-line processing of large amounts
> of data, but I've wondered if anyone has tried using it for processing
> of near real-time log data as it is appears in your systems with any
> success? My gut feeling is that Hadoop isn't suitable for this yet
> given redundancy issues around the JobTracker/NameNode, as well as the
> overhead of moving blocks around in HDFS. Thoughts?

Ryan,

Several people (myself including) asked similar question. You may want
to search the mailing list archives for previous discussions on the
topic.

In short, you are right, Hadoop is not perfecltly suited for realtime
processing.

Vadim