You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bruckwald Tamás <ta...@freemail.hu> on 2016/07/05 12:15:23 UTC

Read Kafka topic in a Spark batch job

Hello, I&#39;m writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job.What is the right approach to read from Kafka in a batch job? I&#39;m also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the next run it starts from that.But in this case how can I just fetch once and stop streaming after the first batch? I posted this question on StackOverflow recently (http://stackoverflow.com/q/38026627/4020050) but got no answer there, so I&#39;d ask here as well, hoping that I get some ideas on how to resolve this issue.
 Thanks - Bruckwald

Re: Read Kafka topic in a Spark batch job

Posted by Cody Koeninger <co...@koeninger.org>.

If it's a batch job, don't use a stream.

You have to store the offsets reliably somewhere regardless.  So it sounds
like your only issue is with identifying offsets per partition?  Look at
KafkaCluster.scala, methods getEarliestLeaderOffsets /
getLatestLeaderOffsets.

On Tue, Jul 5, 2016 at 7:40 AM, Bruckwald Tamás <tamas.bruckwald@freemail.hu
> wrote:

> Thanks for you answer. Unfortunately I'm bound to Kafka 0.8.2.1.
> --Bruckwald
>
> nihed mbarek <ni...@gmail.com> írta:
>
> Hi,
>
> Are you using a new version of kafka  ? if yes
> since 0.9 auto.offset.reset parameter take :
>
>    - earliest: automatically reset the offset to the earliest offset
>    - latest: automatically reset the offset to the latest offset
>    - none: throw exception to the consumer if no previous offset is found
>    for the consumer's group
>    - anything else: throw exception to the consumer.
>
> https://kafka.apache.org/documentation.html
>
>
> Regards,
>
> On Tue, Jul 5, 2016 at 2:15 PM, Bruckwald Tamás <
> tamas.bruckwald@freemail.hu> wrote:
>>
>> Hello,
>>
>> I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
>> For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD
>> however, I need to set the offsets for all the partitions and also need to
>> store them somewhere (ZK? HDFS?) to know from where to start the next batch
>> job.
>> What is the right approach to read from Kafka in a batch job?
>>
>> I'm also thinking about writing a streaming job instead, which reads from
>> auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the
>> next run it starts from that.
>> But in this case how can I just fetch once and stop streaming after the
>> first batch?
>>
>> I posted this question on StackOverflow recently (
>> http://stackoverflow.com/q/38026627/4020050) but got no answer there, so
>> I'd ask here as well, hoping that I get some ideas on how to resolve this
>> issue.
>>
>> Thanks - Bruckwald
>>
>
>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> <http://tn.linkedin.com/in/nihed>
>
>
>

Re: Read Kafka topic in a Spark batch job

Posted by Bruckwald Tamás <ta...@freemail.hu>.

Thanks for you answer. Unfortunately I&#39;m bound to Kafka 0.8.2.1.--Bruckwald
nihed mbarek <ni...@gmail.com> írta:
>Hi, Are you using a new version of kafka  ? if yessince 0.9 auto.offset.reset parameter take :earliest: automatically reset the offset to the earliest offsetlatest: automatically reset the offset to the latest offsetnone: throw exception to the consumer if no previous offset is found for the consumer&#39;s groupanything else: throw exception to the consumer.https://kafka.apache.org/documentation.html  Regards, On Tue, Jul 5, 2016 at 2:15 PM, Bruckwald Tamás <ta...@freemail.hu> wrote:

>>Hello, I&#39;m writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
>>For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job.What is the right approach to read from Kafka in a batch job? I&#39;m also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the next run it starts from that.But in this case how can I just fetch once and stop streaming after the first batch? I posted this question on StackOverflow recently (http://stackoverflow.com/q/38026627/4020050) but got no answer there, so I&#39;d ask here as well, hoping that I get some ideas on how to resolve this issue.
>> Thanks - Bruckwald

>
>
>
>--
>M&#39;BAREK Med Nihed,
>Fedora Ambassador, TUNISIA, Northern Africa
>http://www.nihed.com
>
>
>

Re: Read Kafka topic in a Spark batch job

Posted by nihed mbarek <ni...@gmail.com>.

Hi,

Are you using a new version of kafka  ? if yes
since 0.9 auto.offset.reset parameter take :

   - earliest: automatically reset the offset to the earliest offset
   - latest: automatically reset the offset to the latest offset
   - none: throw exception to the consumer if no previous offset is found
   for the consumer's group
   - anything else: throw exception to the consumer.

https://kafka.apache.org/documentation.html


Regards,

On Tue, Jul 5, 2016 at 2:15 PM, Bruckwald Tamás <tamas.bruckwald@freemail.hu
> wrote:

> Hello,
>
> I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
> For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD
> however, I need to set the offsets for all the partitions and also need to
> store them somewhere (ZK? HDFS?) to know from where to start the next batch
> job.
> What is the right approach to read from Kafka in a batch job?
>
> I'm also thinking about writing a streaming job instead, which reads from
> auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the
> next run it starts from that.
> But in this case how can I just fetch once and stop streaming after the
> first batch?
>
> I posted this question on StackOverflow recently (
> http://stackoverflow.com/q/38026627/4020050) but got no answer there, so
> I'd ask here as well, hoping that I get some ideas on how to resolve this
> issue.
>
> Thanks - Bruckwald
>



-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com

<http://tn.linkedin.com/in/nihed>