You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Durgapal <du...@gmail.com> on 2015/08/10 14:51:06 UTC

spark-kafka directAPI vs receivers based API

Hi All,

I just wanted to know how does directAPI for spark streaming compare with
earlier receivers based API. Has anyone used directAPI based approach on
production or is it still being used for pocs?

Also, since I'm new to spark, could anyone share a starting point from
where I could find a working code for both of the above APIs?

Also, in my use case I want to analyse a data stream(comma separated
string) & aggregate over certain fields based on their types. Ideally I
would like to push that aggregated data to a column family based
datastore(like HBase, we are using it currently). But my first I'd like to
find out how to aggregate that data and how does streaming work, whether It
polls & fetches data in batches or does it continuously listen to the kafka
queue for any new message. And how can I configure my application for
either cases. I hope my questions make sense.


Regards
Mohit

Re: spark-kafka directAPI vs receivers based API

Posted by Cody Koeninger <co...@koeninger.org>.

For direct stream questions:

https://github.com/koeninger/kafka-exactly-once

Yes, it is used in production.


For general spark streaming question:

http://spark.apache.org/docs/latest/streaming-programming-guide.html


On Mon, Aug 10, 2015 at 7:51 AM, Mohit Durgapal <du...@gmail.com>
wrote:

> Hi All,
>
> I just wanted to know how does directAPI for spark streaming compare with
> earlier receivers based API. Has anyone used directAPI based approach on
> production or is it still being used for pocs?
>
> Also, since I'm new to spark, could anyone share a starting point from
> where I could find a working code for both of the above APIs?
>
> Also, in my use case I want to analyse a data stream(comma separated
> string) & aggregate over certain fields based on their types. Ideally I
> would like to push that aggregated data to a column family based
> datastore(like HBase, we are using it currently). But my first I'd like to
> find out how to aggregate that data and how does streaming work, whether It
> polls & fetches data in batches or does it continuously listen to the kafka
> queue for any new message. And how can I configure my application for
> either cases. I hope my questions make sense.
>
>
> Regards
> Mohit
>