You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Renyi Xiong <re...@gmail.com> on 2016/05/02 22:48:11 UTC
Re: Spark streaming Kafka receiver WriteAheadLog question
sorry, I removed others by mistake
thanks a lot, Mario, for explaining. Appreciate it.
On Sun, May 1, 2016 at 11:51 PM, Mario Ds Briggs <ma...@in.ibm.com>
wrote:
> Not sure if it was a mistake that you removed others and the group on this
> response
>
> >>
>
> the data duplication in-efficiency (replication to WAL)
> <<
>
> You have covered this in 'direct mode's offset based Kafka fetch
> without the extra cost of WAL' . That was exactly what i was referring
> to
>
> >>
> single version of the truth of the offsets processed
> <<
> From the docs at
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
>
> *Exactly-once semantics:* ... *there is a small chance some records may
> get consumed twice under some failures. This occurs because of
> inconsistencies between data reliably received by Spark Streaming and
> offsets tracked by Zookeeper. *
>
>
> thanks
> Mario
>
> [image: Inactive hide details for Renyi Xiong ---01/05/2016 03:34:51
> am---Hi, Thanks a lot, Cody and Mario, for your comments.]Renyi Xiong
> ---01/05/2016 03:34:51 am---Hi, Thanks a lot, Cody and Mario, for your
> comments.
>
> From: Renyi Xiong <re...@gmail.com>
> To: Mario Ds Briggs/India/IBM@IBMIN
> Date: 01/05/2016 03:34 am
> Subject: Re: Spark streaming Kafka receiver WriteAheadLog question
> ------------------------------
>
>
>
> Hi,
>
> Thanks a lot, Cody and Mario, for your comments.
>
> Actually my question is that is it possible to have the benefits of both
> direct and receiver mode. i.e.
>
> 1. direct mode's offset based Kafka fetch without the extra cost of WAL
> 2. receiver mode's Kafka pre-fetch without the extra latency of direct
> mode.
>
> Mario,
>
> I don't quite get your comment b, did you mean WAL is due to receiver
> mode's nature? Can you explain a little bit more?
>
> thanks a lot,
> Renyi.
>
> On Tue, Apr 26, 2016 at 4:09 AM, Mario Ds Briggs <
> *mario.briggs@in.ibm.com* <ma...@in.ibm.com>> wrote:
>
> That was my initial thought as well. But then i was wondering if this
> approach could help remove
> a - the little extra latency overhead we have with the DirectApproach
> (compared to Receiver) and
> b - the data duplication in-efficiency (replication to WAL) and single
> version of the truth of the offsets processed (under some failures) in the
> Receiver approach.
>
> thanks
> Mario
>
> ----- Message from Cody Koeninger <*cody@koeninger.org*
> <co...@koeninger.org>> on Mon, 25 Apr 2016 09:23:32 -0500 -----
>
> *To:*
> Renyi Xiong <*renyixiong0@gmail.com* <re...@gmail.com>>
>
> *cc:*
> dev <*dev@spark.apache.org* <de...@spark.apache.org>>
>
> *Subject:*
> Re: Spark streaming Kafka receiver WriteAheadLog questionIf you want
> to refer back to Kafka based on offset ranges, why not use
> createDirectStream?
>
> On Fri, Apr 22, 2016 at 11:49 PM, Renyi Xiong <*renyixiong0@gmail.com*
> <re...@gmail.com>> wrote:
> > Hi,
> >
> > Is it possible for Kafka receiver generated
> WriteAheadLogBackedBlockRDD to
> > hold corresponded Kafka offset range so that during recovery the RDD
> can
> > refer back to Kafka queue instead of paying the cost of write ahead
> log?
> >
> > I guess there must be a reason here. Could anyone please help me
> understand?
> >
> > Thanks,
> > Renyi.
>
>
>
>
>
>
>
>