You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tobias Pfeiffer <tg...@preferred.jp> on 2014/09/24 10:59:53 UTC

All-time stream re-processing

Hi,

I have a setup (in mind) where data is written to Kafka and this data is
persisted in HDFS (e.g., using camus) so that I have an all-time archive of
all stream data ever received. Now I want to process that all-time archive
and when I am done with that, continue with the live stream, using Spark
Streaming. (In a perfect world, Kafka would have infinite storage and I
would always use the Kafka receiver, starting from offset 0.)
Does anyone have an idea how to realize such a setup? Would I write a
custom receiver that first reads the HDFS file and then connects to Kafka?
Is there an existing solution for that use case?

Thanks
Tobias

Re: All-time stream re-processing

Posted by Tobias Pfeiffer <tg...@preferred.jp>.
Hi,

On Wed, Sep 24, 2014 at 7:23 PM, Dibyendu Bhattacharya <
dibyendu.bhattachary@gmail.com> wrote:

> So you have a single Kafka topic which has very high retention period (
> that decides the storage capacity of a given Kafka topic) and you want to
> process all historical data first using Camus and then start the streaming
> process ?
>

I don't necessarily want to process the historical data "using Camus", but
I want to keep it forever (longer than Kafka's retention period) and
process the stored data and the stream. (I don't really care about how the
data got into HDFS, be it Camus or something else, but I assume that Kafka
can't store it forever.)

Imagine that I receive "all tweets posted to Twitter", they go into my
Kafka instance and are archived to HDFS. Now a user logs in and I want to
display to that user a) all posts that have ever mentioned him/her and b)
continue to update that list from the current stream. (In that order.) This
happens for a number of users, so it's a process that needs to be
repeatable with different Spark operations.

The challenge is, Camus and Spark are two different consumer for Kafka
> topic and both maintains their own consumed offset different way. Camus
> stores offset in HDFS, and Spark Consumer in ZK. What I understand, you
> need something which identify till which point Camus pulled ( for a given
> partitions of topic) and want to start Spark receiver from there ?
>

I think I need such a thing. Also, I think Camus stores those offsets, so
in theory it should be possible to consume all HDFS files, read the offset,
then start Kafka processing from that offset. That sounds very "lambda
architecture"-ish to me, so I was wondering if someone has realized a
similar setup.

Thanks
Tobias

Re: All-time stream re-processing

Posted by Dibyendu Bhattacharya <di...@gmail.com>.
So you have a single Kafka topic which has very high retention period (
that decides the storage capacity of a given Kafka topic) and you want to
process all historical data first using Camus and then start the streaming
process ?

The challenge is, Camus and Spark are two different consumer for Kafka
topic and both maintains their own consumed offset different way. Camus
stores offset in HDFS, and Spark Consumer in ZK. What I understand, you
need something which identify till which point Camus pulled ( for a given
partitions of topic) and want to start Spark receiver from there ?


Dib

On Wed, Sep 24, 2014 at 2:29 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> Hi,
>
> I have a setup (in mind) where data is written to Kafka and this data is
> persisted in HDFS (e.g., using camus) so that I have an all-time archive of
> all stream data ever received. Now I want to process that all-time archive
> and when I am done with that, continue with the live stream, using Spark
> Streaming. (In a perfect world, Kafka would have infinite storage and I
> would always use the Kafka receiver, starting from offset 0.)
> Does anyone have an idea how to realize such a setup? Would I write a
> custom receiver that first reads the HDFS file and then connects to Kafka?
> Is there an existing solution for that use case?
>
> Thanks
> Tobias
>
>