You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2014/12/14 21:41:58 UTC

spark kafka batch integration

hello all,
we at tresata wrote a library to provide for batch integration between
spark and kafka (distributed write of rdd to kafa, distributed read of rdd
from kafka). our main use cases are (in lambda architecture jargon):
* period appends to the immutable master dataset on hdfs from kafka using
spark
* make non-streaming data available in kafka with periodic data drops from
hdfs using spark. this is to facilitate merging the speed and batch layer
in spark-streaming
* distributed writes from spark-streaming

see here:
https://github.com/tresata/spark-kafka

best,
koert

Re: spark kafka batch integration

Posted by Cody Koeninger <co...@koeninger.org>.
For an alternative take on a similar idea, see

https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka

An advantage of the approach I'm taking is that the lower and upper offsets
of the RDD are known in advance, so it's deterministic.

I haven't had a need to write to kafka from spark yet, so that's an obvious
advantage of your library.

I think the existing kafka dstream is inadequate for a number of use cases,
and would really like to see some combination of these approaches make it
into the spark codebase.


On Sun, Dec 14, 2014 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> hello all,
> we at tresata wrote a library to provide for batch integration between
> spark and kafka (distributed write of rdd to kafa, distributed read of rdd
> from kafka). our main use cases are (in lambda architecture jargon):
> * period appends to the immutable master dataset on hdfs from kafka using
> spark
> * make non-streaming data available in kafka with periodic data drops from
> hdfs using spark. this is to facilitate merging the speed and batch layer
> in spark-streaming
> * distributed writes from spark-streaming
>
> see here:
> https://github.com/tresata/spark-kafka
>
> best,
> koert
>

Re: spark kafka batch integration

Posted by Koert Kuipers <ko...@tresata.com>.
thanks! i will take a look at your code. didn't realize there was already
something out there.

good point about upper offsets, i will add that feature to our version as
well if you dont mind.

i was thinking about making it deterministic for task failure transparently
(even if no upper offsets are provided) by doing a call to get the latest
offsets for all partitions, and filter the rdd based on that to make sure
nothing beyond those offsets ends up in the rdd. havent had time to test if
that works and is robust.

On Mon, Dec 15, 2014 at 11:39 AM, Cody Koeninger <co...@koeninger.org> wrote:
>
> For an alternative take on a similar idea, see
>
>
> https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka
>
> An advantage of the approach I'm taking is that the lower and upper
> offsets of the RDD are known in advance, so it's deterministic.
>
> I haven't had a need to write to kafka from spark yet, so that's an
> obvious advantage of your library.
>
> I think the existing kafka dstream is inadequate for a number of use
> cases, and would really like to see some combination of these approaches
> make it into the spark codebase.
>
> On Sun, Dec 14, 2014 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> hello all,
>> we at tresata wrote a library to provide for batch integration between
>> spark and kafka (distributed write of rdd to kafa, distributed read of rdd
>> from kafka). our main use cases are (in lambda architecture jargon):
>> * period appends to the immutable master dataset on hdfs from kafka using
>> spark
>> * make non-streaming data available in kafka with periodic data drops from
>> hdfs using spark. this is to facilitate merging the speed and batch layer
>> in spark-streaming
>> * distributed writes from spark-streaming
>>
>> see here:
>> https://github.com/tresata/spark-kafka
>>
>> best,
>> koert
>>
>

Re: spark kafka batch integration

Posted by Cody Koeninger <co...@koeninger.org>.
For an alternative take on a similar idea, see

https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka

An advantage of the approach I'm taking is that the lower and upper offsets
of the RDD are known in advance, so it's deterministic.

I haven't had a need to write to kafka from spark yet, so that's an obvious
advantage of your library.

I think the existing kafka dstream is inadequate for a number of use cases,
and would really like to see some combination of these approaches make it
into the spark codebase.

On Sun, Dec 14, 2014 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> hello all,
> we at tresata wrote a library to provide for batch integration between
> spark and kafka (distributed write of rdd to kafa, distributed read of rdd
> from kafka). our main use cases are (in lambda architecture jargon):
> * period appends to the immutable master dataset on hdfs from kafka using
> spark
> * make non-streaming data available in kafka with periodic data drops from
> hdfs using spark. this is to facilitate merging the speed and batch layer
> in spark-streaming
> * distributed writes from spark-streaming
>
> see here:
> https://github.com/tresata/spark-kafka
>
> best,
> koert
>