You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Hussein Baghdadi <hu...@hotmail.com> on 2012/10/31 00:28:51 UTC

Scenarios of Hadoop producers and consumers




Hi,Kafka comes with a support for Hadoop. I'm not sure what does this mean.Kafka is a publish-subscribe messaging system. What are some of the typical usage of Kafka-support for Hadoop producers and consumers?Well, producers are easy to digest. MapReduce job emitting data to Kafka.But what about Hadoop consumers?Hadoop is a batching system, not a continuous running system (as Storm or Dempsy). Say Kafka gets some data, what will happen?Thanks for help and time.

Re: Scenarios of Hadoop producers and consumers

Posted by David Arthur <mu...@gmail.com>.

Indeed Hadoop is not the ideal platform for stream processing, but there are plenty of use cases for Kakfa + Hadoop. I use it to consolidate log data from many different systems into HDFS. I have N systems using the log4j appender producing to a Kafka broker, and then in my Hadoop cluster I run a simple job that consumes that data and writes out an HDFS file. This, in effect, is what other log aggregators like Flume do - however, we have Kafka in our stack for other pub/sub stuff so it made sense to use it for log aggregation as well. 

To answer your question about consuming in Hadoop, the RecordReader will just continue to return records until the queue is exhausted. If you could manage to produce data faster than Hadoop was reading it out (very unlikely), the Hadoop job would run forever (or a least for quite a while). I believe you end up with one RecordReader per Kafka partition, so allocating more partitions would increase your throughput to Hadoop (at least until you saturate the network between the Kafka brokers and Hadoop)

Hope this helps
-David

On Oct 30, 2012, at 8:40 PM, Michal Haris wrote:

> When you need your data streams to be incrementally loaded into hadoop for
> offline batch processing and/or ad-hoc querying - some things cannot (or
> are expensive to) be computed in real-time. So you have a hadoop job that
> consumes kafka stream, potentially formats the data and saves into hdfs.
> 
> On 30 October 2012 23:28, Hussein Baghdadi <hu...@hotmail.com> wrote:
> 
>> 
>> 
>> 
>> 
>> Hi,Kafka comes with a support for Hadoop. I'm not sure what does this
>> mean.Kafka is a publish-subscribe messaging system. What are some of the
>> typical usage of Kafka-support for Hadoop producers and consumers?Well,
>> producers are easy to digest. MapReduce job emitting data to Kafka.But what
>> about Hadoop consumers?Hadoop is a batching system, not a continuous
>> running system (as Storm or Dempsy). Say Kafka gets some data, what will
>> happen?Thanks for help and time.
>> 
> 
> 
> 
> 
> -- 
> Michal Haris
> Software Engineer
> 
> www.visualdna.com | t: +44 (0) 207 734 7033

Re: Scenarios of Hadoop producers and consumers

Posted by Michal Haris <mi...@visualdna.com>.

When you need your data streams to be incrementally loaded into hadoop for
offline batch processing and/or ad-hoc querying - some things cannot (or
are expensive to) be computed in real-time. So you have a hadoop job that
consumes kafka stream, potentially formats the data and saves into hdfs.

On 30 October 2012 23:28, Hussein Baghdadi <hu...@hotmail.com> wrote:

>
>
>
>
> Hi,Kafka comes with a support for Hadoop. I'm not sure what does this
> mean.Kafka is a publish-subscribe messaging system. What are some of the
> typical usage of Kafka-support for Hadoop producers and consumers?Well,
> producers are easy to digest. MapReduce job emitting data to Kafka.But what
> about Hadoop consumers?Hadoop is a batching system, not a continuous
> running system (as Storm or Dempsy). Say Kafka gets some data, what will
> happen?Thanks for help and time.
>

-- 
Michal Haris
Software Engineer

www.visualdna.com | t: +44 (0) 207 734 7033