You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ricky Ho <rh...@adobe.com> on 2009/10/10 08:02:20 UTC

Parallel data stream processing

I'd like to get some Hadoop experts to verify my understanding ...

To my understanding, within a Map/Reduce cycle, the input data set is "freeze" (no change is allowed) while the output data set is "created from scratch" (doesn't exist before). Therefore, the map/reduce model is inherently "batch-oriented". Am I right ?

I am thinking whether Hadoop is usable in processing many data streams in parallel. For example, thinking about a e-commerce site which capture user's product search in many log files, and they want to run some analytics on the log files at real time.

One naïve way is to chunkify the log and perform Map/Reduce in small batches. Since the input data file must be freezed, therefore we need to switch subsequent write to a new logfile. However, the chunking approach is not good because the cutoff point is quite arbitrary. Imagine if I want to calculate the popularity of a product based on the frequency of searches within last 2 hours (a sliding time window). I don't think Hadoop can do this computation.

Of course, if we don't mind a distorted picture, we can use a jumping window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK. But this is still not good, because we have to wait for two hours before getting the new batch of result. (e.g. At 4:59 PM, we only have the result in the 1-3 PM batch)

It doesn't seem like Hadoop is good at handling this kind of processing: "Parallel processing of multiple real time data stream processing". Anyone disagree ? The term "Hadoop streaming" is confusing because it means completely different thing to me (ie: use stdout and stdin as input and output data)

I'm wondering if a "mapper-only" model would work better. In this case, there is no reducer (ie: no grouping). Each map task keep a history (ie: sliding window) of data that it has seen and then write the result to the output file.

I heard about the "append" mode of HDFS, but don't quite get it. Does it simply mean a writer can write to the end of an existing HDFS file ? Or does it mean a reader can read while a writer is appending on the same HDFS file ? Is this "append-mode" feature helpful in my situation ?

Rgds,
Ricky

Re: Parallel data stream processing

Posted by Hong Tang <ht...@yahoo-inc.com>.

MapReduce is indeed inherently a batch processing model, where each  
job's outcome is deterministically determined by the input and the  
operators (map, reduce, combiner) as long as the input stays immutable  
and the operator is deterministic and side-effect free. Such a model  
allows the framework to recover from failures without having to  
understand the semantics of the operators (unlike SQL). This is  
important because failures are bound to happen (frequently) for a  
large cluster assembled from commodity hardware.

A typical technique to bridge a batch system and a real-time system is  
to pair with the batch system with an incremental processing component  
that computes delta on top of some aggregated result. The incremental  
processing part would also serve real-time queries, so the data are  
typically stored in memory. Some times you have to choose some  
approximation algorithms for the incremental part, and periodically  
reset the internal state with the more precise batch processing  
results (e.g. top-k queries).

Hope this helps, Hong

On Oct 9, 2009, at 11:02 PM, Ricky Ho wrote:

> I'd like to get some Hadoop experts to verify my understanding ...
>
> To my understanding, within a Map/Reduce cycle, the input data set  
> is "freeze" (no change is allowed) while the output data set is  
> "created from scratch" (doesn't exist before).  Therefore, the map/ 
> reduce model is inherently "batch-oriented".  Am I right ?
>
> I am thinking whether Hadoop is usable in processing many data  
> streams in parallel.  For example, thinking about a e-commerce site  
> which capture user's product search in many log files, and they want  
> to run some analytics on the log files at real time.
>
> One naïve way is to chunkify the log and perform Map/Reduce in small  
> batches.  Since the input data file must be freezed, therefore we  
> need to switch subsequent write to a new logfile.  However, the  
> chunking approach is not good because the cutoff point is quite  
> arbitrary.  Imagine if I want to calculate the popularity of a  
> product based on the frequency of searches within last 2 hours (a  
> sliding time window).  I don't think Hadoop can do this computation.
>
> Of course, if we don't mind a distorted picture, we can use a  
> jumping window (1-3 PM, 3-5 PM ...) instead of a sliding window,  
> then maybe OK.  But this is still not good, because we have to wait  
> for two hours before getting the new batch of result.  (e.g. At 4:59  
> PM, we only have the result in the 1-3 PM batch)
>
> It doesn't seem like Hadoop is good at handling this kind of  
> processing:  "Parallel processing of multiple real time data stream  
> processing".  Anyone disagree ?  The term "Hadoop streaming" is  
> confusing because it means completely different thing to me (ie: use  
> stdout and stdin as input and output data)
>
> I'm wondering if a "mapper-only" model would work better.  In this  
> case, there is no reducer (ie: no grouping).  Each map task keep a  
> history (ie: sliding window) of data that it has seen and then write  
> the result to the output file.
>
> I heard about the "append" mode of HDFS, but don't quite get it.   
> Does it simply mean a writer can write to the end of an existing  
> HDFS file ?  Or does it mean a reader can read while a writer is  
> appending on the same HDFS file ?  Is this "append-mode" feature  
> helpful in my situation ?
>
> Rgds,
> Ricky

Re: Parallel data stream processing

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Oct 9, 2009 at 11:02 PM, Ricky Ho <rh...@adobe.com> wrote:

> ... To my understanding, within a Map/Reduce cycle, the input data set is
> "freeze" (no change is allowed) while the output data set is "created from
> scratch" (doesn't exist before).  Therefore, the map/reduce model is
> inherently "batch-oriented".  Am I right ?
>

Current implementations are definitely batch oriented.  Keep reading,
though.

> I am thinking whether Hadoop is usable in processing many data streams in
> parallel.

Abolutely.

> For example, thinking about a e-commerce site which capture user's product
> search in many log files, and they want to run some analytics on the log
> files at real time.
>

Or consider Yahoo running their ad inventories in real-time.

> One naïve way is to chunkify the log and perform Map/Reduce in small
> batches.  Since the input data file must be freezed, therefore we need to
> switch subsequent write to a new logfile.

Which is not a big deal.  Moreover, these small chunks can be merged every
so often.

> However, the chunking approach is not good because the cutoff point is
> quite arbitrary.  Imagine if I want to calculate the popularity of a product
> based on the frequency of searches within last 2 hours (a sliding time
> window).  I don't think Hadoop can do this computation.
>

Subject of a moderate delay of 5-20 minutes, this is no problem at all for
hadoop.  This is especially true if you are doing straightforward
aggregations that are associative and commutative.

> Of course, if we don't mind a distorted picture, we can use a jumping
> window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK.  But
> this is still not good, because we have to wait for two hours before getting
> the new batch of result.  (e.g. At 4:59 PM, we only have the result in the
> 1-3 PM batch)
>

Or just process each 10 minute period into aggregate form.  Then add up the
latest 12 aggregates.  Every day, merge all the small files for the day and
every month merge all the daily files.

There are very few businesses where a 10 minute delay is a big problem.

> It doesn't seem like Hadoop is good at handling this kind of processing:
>  "Parallel processing of multiple real time data stream processing".  Anyone
> disagree ?

It isn't entirely natural, but it isn't a problem.

> I'm wondering if a "mapper-only" model would work better.  In this case,
> there is no reducer (ie: no grouping).  Each map task keep a history (ie:
> sliding window) of data that it has seen and then write the result to the
> output file.
>

This doesn't scale at all well.

Take a look at the Chukwa project for a well worked example of how to
process logs in near real-time with Hadoop.

Re: Parallel data stream processing

Posted by Amandeep Khurana <am...@gmail.com>.

We needed to process a stream of data too and the best that we could get to
was incremental data imports and incremental processing. So, thats probably
your best bet as of now.

-Amandeep


On Sat, Oct 10, 2009 at 8:05 AM, Ricky Ho <rh...@adobe.com> wrote:

> PIG provides a higher level programming interface but doesn't change the
> fundamental batch-oriented semantics to a stream-based semantics.  As long
> as PIG is compiled into Map/Reduce job, it is using the same batch-oriented
> mechanism.
>
> I am not talking about "record boundary".  I am talking about the boundary
> between 2 consecutive map/reduce cycles within a continuous data stream.
>
> I am thinking Ted's suggestion on the incremental small batch approach may
> be a good solution although I am not sure how small the batch should be.  I
> assume there are certain overhead of running Hadoop so the batch shouldn't
> be too small.  And there is a tradeoff decision to make between the delay of
> result and the batch size.  And I guess in most case this should be ok.
>
> Rgds,
> Ricky
> -----Original Message-----
> From: Jeff Zhang [mailto:zjffdu@gmail.com]
> Sent: Saturday, October 10, 2009 1:51 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Parallel data stream processing
>
> I snuggest you to use pig to handle your problem.  Pig is a sub-project of
> hadoop.
>
> And you do not need to worry about the boundary problem. Actually hadoop
> handle that for you.
>
> InputFormat help you split the data , and RecordReader guarantee the record
> boundary.
>
>
> Jeff zhang
>
>
> On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho <rh...@adobe.com> wrote:
>
> > I'd like to get some Hadoop experts to verify my understanding ...
> >
> > To my understanding, within a Map/Reduce cycle, the input data set is
> > "freeze" (no change is allowed) while the output data set is "created
> from
> > scratch" (doesn't exist before).  Therefore, the map/reduce model is
> > inherently "batch-oriented".  Am I right ?
> >
> > I am thinking whether Hadoop is usable in processing many data streams in
> > parallel.  For example, thinking about a e-commerce site which capture
> > user's product search in many log files, and they want to run some
> analytics
> > on the log files at real time.
> >
> > One naïve way is to chunkify the log and perform Map/Reduce in small
> > batches.  Since the input data file must be freezed, therefore we need to
> > switch subsequent write to a new logfile.  However, the chunking approach
> is
> > not good because the cutoff point is quite arbitrary.  Imagine if I want
> to
> > calculate the popularity of a product based on the frequency of searches
> > within last 2 hours (a sliding time window).  I don't think Hadoop can do
> > this computation.
> >
> > Of course, if we don't mind a distorted picture, we can use a jumping
> > window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK.
>  But
> > this is still not good, because we have to wait for two hours before
> getting
> > the new batch of result.  (e.g. At 4:59 PM, we only have the result in
> the
> > 1-3 PM batch)
> >
> > It doesn't seem like Hadoop is good at handling this kind of processing:
> >  "Parallel processing of multiple real time data stream processing".
>  Anyone
> > disagree ?  The term "Hadoop streaming" is confusing because it means
> > completely different thing to me (ie: use stdout and stdin as input and
> > output data)
> >
> > I'm wondering if a "mapper-only" model would work better.  In this case,
> > there is no reducer (ie: no grouping).  Each map task keep a history (ie:
> > sliding window) of data that it has seen and then write the result to the
> > output file.
> >
> > I heard about the "append" mode of HDFS, but don't quite get it.  Does it
> > simply mean a writer can write to the end of an existing HDFS file ?  Or
> > does it mean a reader can read while a writer is appending on the same
> HDFS
> > file ?  Is this "append-mode" feature helpful in my situation ?
> >
> > Rgds,
> > Ricky
> >
>

RE: Parallel data stream processing

Posted by Ricky Ho <rh...@adobe.com>.

PIG provides a higher level programming interface but doesn't change the fundamental batch-oriented semantics to a stream-based semantics.  As long as PIG is compiled into Map/Reduce job, it is using the same batch-oriented mechanism.

I am not talking about "record boundary".  I am talking about the boundary between 2 consecutive map/reduce cycles within a continuous data stream.

I am thinking Ted's suggestion on the incremental small batch approach may be a good solution although I am not sure how small the batch should be.  I assume there are certain overhead of running Hadoop so the batch shouldn't be too small.  And there is a tradeoff decision to make between the delay of result and the batch size.  And I guess in most case this should be ok.

Rgds,
Ricky
-----Original Message-----
From: Jeff Zhang [mailto:zjffdu@gmail.com] 
Sent: Saturday, October 10, 2009 1:51 AM
To: common-user@hadoop.apache.org
Subject: Re: Parallel data stream processing

I snuggest you to use pig to handle your problem.  Pig is a sub-project of
hadoop.

And you do not need to worry about the boundary problem. Actually hadoop
handle that for you.

InputFormat help you split the data , and RecordReader guarantee the record
boundary.

Jeff zhang

On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho <rh...@adobe.com> wrote:

> I'd like to get some Hadoop experts to verify my understanding ...
>
> To my understanding, within a Map/Reduce cycle, the input data set is
> "freeze" (no change is allowed) while the output data set is "created from
> scratch" (doesn't exist before).  Therefore, the map/reduce model is
> inherently "batch-oriented".  Am I right ?
>
> I am thinking whether Hadoop is usable in processing many data streams in
> parallel.  For example, thinking about a e-commerce site which capture
> user's product search in many log files, and they want to run some analytics
> on the log files at real time.
>
> One naïve way is to chunkify the log and perform Map/Reduce in small
> batches.  Since the input data file must be freezed, therefore we need to
> switch subsequent write to a new logfile.  However, the chunking approach is
> not good because the cutoff point is quite arbitrary.  Imagine if I want to
> calculate the popularity of a product based on the frequency of searches
> within last 2 hours (a sliding time window).  I don't think Hadoop can do
> this computation.
>
> Of course, if we don't mind a distorted picture, we can use a jumping
> window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK.  But
> this is still not good, because we have to wait for two hours before getting
> the new batch of result.  (e.g. At 4:59 PM, we only have the result in the
> 1-3 PM batch)
>
> It doesn't seem like Hadoop is good at handling this kind of processing:
>  "Parallel processing of multiple real time data stream processing".  Anyone
> disagree ?  The term "Hadoop streaming" is confusing because it means
> completely different thing to me (ie: use stdout and stdin as input and
> output data)
>
> I'm wondering if a "mapper-only" model would work better.  In this case,
> there is no reducer (ie: no grouping).  Each map task keep a history (ie:
> sliding window) of data that it has seen and then write the result to the
> output file.
>
> I heard about the "append" mode of HDFS, but don't quite get it.  Does it
> simply mean a writer can write to the end of an existing HDFS file ?  Or
> does it mean a reader can read while a writer is appending on the same HDFS
> file ?  Is this "append-mode" feature helpful in my situation ?
>
> Rgds,
> Ricky
>

Re: Parallel data stream processing

Posted by Jeff Zhang <zj...@gmail.com>.

I snuggest you to use pig to handle your problem.  Pig is a sub-project of
hadoop.

And you do not need to worry about the boundary problem. Actually hadoop
handle that for you.

InputFormat help you split the data , and RecordReader guarantee the record
boundary.


Jeff zhang


On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho <rh...@adobe.com> wrote:

> I'd like to get some Hadoop experts to verify my understanding ...
>
> To my understanding, within a Map/Reduce cycle, the input data set is
> "freeze" (no change is allowed) while the output data set is "created from
> scratch" (doesn't exist before).  Therefore, the map/reduce model is
> inherently "batch-oriented".  Am I right ?
>
> I am thinking whether Hadoop is usable in processing many data streams in
> parallel.  For example, thinking about a e-commerce site which capture
> user's product search in many log files, and they want to run some analytics
> on the log files at real time.
>
> One naïve way is to chunkify the log and perform Map/Reduce in small
> batches.  Since the input data file must be freezed, therefore we need to
> switch subsequent write to a new logfile.  However, the chunking approach is
> not good because the cutoff point is quite arbitrary.  Imagine if I want to
> calculate the popularity of a product based on the frequency of searches
> within last 2 hours (a sliding time window).  I don't think Hadoop can do
> this computation.
>
> Of course, if we don't mind a distorted picture, we can use a jumping
> window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK.  But
> this is still not good, because we have to wait for two hours before getting
> the new batch of result.  (e.g. At 4:59 PM, we only have the result in the
> 1-3 PM batch)
>
> It doesn't seem like Hadoop is good at handling this kind of processing:
>  "Parallel processing of multiple real time data stream processing".  Anyone
> disagree ?  The term "Hadoop streaming" is confusing because it means
> completely different thing to me (ie: use stdout and stdin as input and
> output data)
>
> I'm wondering if a "mapper-only" model would work better.  In this case,
> there is no reducer (ie: no grouping).  Each map task keep a history (ie:
> sliding window) of data that it has seen and then write the result to the
> output file.
>
> I heard about the "append" mode of HDFS, but don't quite get it.  Does it
> simply mean a writer can write to the end of an existing HDFS file ?  Or
> does it mean a reader can read while a writer is appending on the same HDFS
> file ?  Is this "append-mode" feature helpful in my situation ?
>
> Rgds,
> Ricky
>