You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ricky Ho <rh...@adobe.com> on 2009/11/06 01:33:07 UTC

Stream Processing and Hadoop

I think the current form of Hadoop is not designed for stream-based processing where data is continuously stream-in and immediate processing (low latency) is required.  Please correct me if I am wrong.

The main reason is because Reduce phase cannot be started until the Map phase is complete.  This mandates the data stream to be broken into chunks and processing is conducted in a batch-oriented fashion.

But why can't we just remove the constraint and let Reduce starts before Map is complete.  What do we lost ?  Yes, there are something we'll lost ...

1) Keys arrived in the same reduce task is sorted.  If we start Reduce processing before all the data arrives, we cannot maintain the sort order anymore because data hasn't arrived yet.

2) If the Map process crashes in the middle of processing an input file, we don't know where to resume the processing.  If the Reduce process crashes, the result data can be lost as well.

But most of the stream-processing analytic application doesn't require the above.  If my reduce function is commutative and associative, then I can perform incremental reduce as the data stream-in.

Imagine a large social network site that is run on a server farm.  And each server has an agent process to track user behavior (what items is being searched, what photo is uploaded ... etc) across all the servers.

Lets say the social site want to analyze these user activity which comes in as data streams from many servers.  So I want each server running a Map process that emit the user key (or product key) to a group of reducers which compute the analytics.

Isn't this kind of processing can be run in Map/Reduce without the need for the Reduce to wait for the Map to be finished ?

Does it make sense ?  Am I missing something important ?

Rgds,
Ricky

RE: Stream Processing and Hadoop

Posted by Ricky Ho <rh...@adobe.com>.
Thanks !

I think this is exactly what I am looking for.
I guess the naïve implementation described in the paper is good enough.  In fact, here is a simple prototype I've done before doing the stream-base map/reduce.

http://horicky.blogspot.com/2008/06/exploring-erlang-with-mapreduce.html

Rgds,
Ricky

-----Original Message-----
From: Amandeep Khurana [mailto:amansk@gmail.com] 
Sent: Thursday, November 05, 2009 4:43 PM
To: common-user@hadoop.apache.org
Subject: Re: Stream Processing and Hadoop

There is a paper on this:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Nov 5, 2009 at 4:33 PM, Ricky Ho <rh...@adobe.com> wrote:

> I think the current form of Hadoop is not designed for stream-based
> processing where data is continuously stream-in and immediate processing
> (low latency) is required.  Please correct me if I am wrong.
>
> The main reason is because Reduce phase cannot be started until the Map
> phase is complete.  This mandates the data stream to be broken into chunks
> and processing is conducted in a batch-oriented fashion.
>
> But why can't we just remove the constraint and let Reduce starts before
> Map is complete.  What do we lost ?  Yes, there are something we'll lost ...
>
> 1) Keys arrived in the same reduce task is sorted.  If we start Reduce
> processing before all the data arrives, we cannot maintain the sort order
> anymore because data hasn't arrived yet.
>
> 2) If the Map process crashes in the middle of processing an input file, we
> don't know where to resume the processing.  If the Reduce process crashes,
> the result data can be lost as well.
>
> But most of the stream-processing analytic application doesn't require the
> above.  If my reduce function is commutative and associative, then I can
> perform incremental reduce as the data stream-in.
>
> Imagine a large social network site that is run on a server farm.  And each
> server has an agent process to track user behavior (what items is being
> searched, what photo is uploaded ... etc) across all the servers.
>
> Lets say the social site want to analyze these user activity which comes in
> as data streams from many servers.  So I want each server running a Map
> process that emit the user key (or product key) to a group of reducers which
> compute the analytics.
>
> Isn't this kind of processing can be run in Map/Reduce without the need for
> the Reduce to wait for the Map to be finished ?
>
> Does it make sense ?  Am I missing something important ?
>
> Rgds,
> Ricky
>

Re: Stream Processing and Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
There is a paper on this:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Nov 5, 2009 at 4:33 PM, Ricky Ho <rh...@adobe.com> wrote:

> I think the current form of Hadoop is not designed for stream-based
> processing where data is continuously stream-in and immediate processing
> (low latency) is required.  Please correct me if I am wrong.
>
> The main reason is because Reduce phase cannot be started until the Map
> phase is complete.  This mandates the data stream to be broken into chunks
> and processing is conducted in a batch-oriented fashion.
>
> But why can't we just remove the constraint and let Reduce starts before
> Map is complete.  What do we lost ?  Yes, there are something we'll lost ...
>
> 1) Keys arrived in the same reduce task is sorted.  If we start Reduce
> processing before all the data arrives, we cannot maintain the sort order
> anymore because data hasn't arrived yet.
>
> 2) If the Map process crashes in the middle of processing an input file, we
> don't know where to resume the processing.  If the Reduce process crashes,
> the result data can be lost as well.
>
> But most of the stream-processing analytic application doesn't require the
> above.  If my reduce function is commutative and associative, then I can
> perform incremental reduce as the data stream-in.
>
> Imagine a large social network site that is run on a server farm.  And each
> server has an agent process to track user behavior (what items is being
> searched, what photo is uploaded ... etc) across all the servers.
>
> Lets say the social site want to analyze these user activity which comes in
> as data streams from many servers.  So I want each server running a Map
> process that emit the user key (or product key) to a group of reducers which
> compute the analytics.
>
> Isn't this kind of processing can be run in Map/Reduce without the need for
> the Reduce to wait for the Map to be finished ?
>
> Does it make sense ?  Am I missing something important ?
>
> Rgds,
> Ricky
>