You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Blender Bl <bl...@aol.com> on 2014/01/14 22:16:25 UTC

Persist Queue On HDFS

Hi,


My team trying to implement lambda architecture.
We need to stream all our new data though Kafka to storm, and HDFS.


As i see it were are two options:

Using Camus - not very efficent
Streaming via Storm - not very efficent

Is it possible to persist the queue's files over the HDFS (with short-circuit setting switched on) instead of the local filesystem?


It suppose to have similar performance to the local file system, doesn't it?

Re: Persist Queue On HDFS

Posted by Rob Withers <re...@gmail.com>.

Nice idea, but different sort of animal.  Going to HDFS is different.  It requires aggregation of traffic, so there is the whole offset commit strategy concern.  When pulling traffic for per message work, we commit after every pull, so exactly once.  The tradeoff with aggregation is whether to allow for at-lease once, or have some traffic loss under extreme conditions.  We chose the later since we felt it occurs less often and rarely.  So we still commit after every message and if the rack falls over, we lose a couple of hundred (thousand ?) messages.  We could always replay manually, from the last successful offset, as we have that info available, somewhere, whereas duplication requires pruning.  Plus it seems to avoid the shutdown fetcher syndrome.  A concurrent writer to hdfs is helpful, so there is lower latency, just split the traffic into queues.  We go for the second time to prod in 3 days.

Is there anyway to use the high level consumer and read chunks of traffic through the KafkaStream, at a time?

Thank you,
Robert

On Jan 14, 2014, at 2:16 PM, Blender Bl <bl...@aol.com> wrote:

> 
> Hi,
> 
> 
> My team trying to implement lambda architecture.
> We need to stream all our new data though Kafka to storm, and HDFS.
> 
> 
> As i see it were are two options:
> 
> Using Camus - not very efficent
> Streaming via Storm - not very efficent
> 
> Is it possible to persist the queue's files over the HDFS (with short-circuit setting switched on) instead of the local filesystem?
> 
> 
> It suppose to have similar performance to the local file system, doesn't it?

- rob

Re: Persist Queue On HDFS

Posted by Joe Stein <jo...@stealth.ly>.

There is also hadoop contrib producer and consumer
https://github.com/apache/kafka/tree/0.8/contrib for hdfs

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/


On Tue, Jan 14, 2014 at 4:16 PM, Blender Bl <bl...@aol.com> wrote:

>
> Hi,
>
>
> My team trying to implement lambda architecture.
> We need to stream all our new data though Kafka to storm, and HDFS.
>
>
> As i see it were are two options:
>
> Using Camus - not very efficent
> Streaming via Storm - not very efficent
>
> Is it possible to persist the queue's files over the HDFS (with
> short-circuit setting switched on) instead of the local filesystem?
>
>
> It suppose to have similar performance to the local file system, doesn't
> it?
>

Re: Persist Queue On HDFS

Posted by Jun Rao <ju...@gmail.com>.

The api in HDFS is quite different from what's in a regular POSIX file
system.

Thanks,

Jun


On Tue, Jan 14, 2014 at 1:16 PM, Blender Bl <bl...@aol.com> wrote:

>
> Hi,
>
>
> My team trying to implement lambda architecture.
> We need to stream all our new data though Kafka to storm, and HDFS.
>
>
> As i see it were are two options:
>
> Using Camus - not very efficent
> Streaming via Storm - not very efficent
>
> Is it possible to persist the queue's files over the HDFS (with
> short-circuit setting switched on) instead of the local filesystem?
>
>
> It suppose to have similar performance to the local file system, doesn't
> it?
>