You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Renyi Xiong <re...@gmail.com> on 2016/05/02 22:48:11 UTC
Re: Spark streaming Kafka receiver WriteAheadLog question

sorry, I removed others by mistake

thanks a lot, Mario, for explaining. Appreciate it.

On Sun, May 1, 2016 at 11:51 PM, Mario Ds Briggs <ma...@in.ibm.com>
wrote:

> Not sure if it was a mistake that you removed others and the group on this
> response
>
> >>
>
>    the data duplication in-efficiency (replication to WAL)
>    <<
>
>    You have covered this in 'direct mode's offset based Kafka fetch
>    without the extra cost of WAL' . That was exactly what i was referring
>    to
>
>    >>
>    single version of the truth of the offsets processed
>    <<
>    From the docs at
>    http://spark.apache.org/docs/latest/streaming-kafka-integration.html
>
> *Exactly-once semantics:* ... *there is a small chance some records may
> get consumed twice under some failures. This occurs because of
> inconsistencies between data reliably received by Spark Streaming and
> offsets tracked by Zookeeper. *
>
>
> thanks
> Mario
>
> [image: Inactive hide details for Renyi Xiong ---01/05/2016 03:34:51
> am---Hi, Thanks a lot, Cody and Mario, for your comments.]Renyi Xiong
> ---01/05/2016 03:34:51 am---Hi, Thanks a lot, Cody and Mario, for your
> comments.
>
> From: Renyi Xiong <re...@gmail.com>
> To: Mario Ds Briggs/India/IBM@IBMIN
> Date: 01/05/2016 03:34 am
> Subject: Re: Spark streaming Kafka receiver WriteAheadLog question
> ------------------------------
>
>
>
> Hi,
>
> Thanks a lot, Cody and Mario, for your comments.
>
> Actually my question is that is it possible to have the benefits of both
> direct and receiver mode. i.e.
>
> 1. direct mode's offset based Kafka fetch without the extra cost of WAL
> 2. receiver mode's Kafka pre-fetch without the extra latency of direct
> mode.
>
> Mario,
>
> I don't quite get your comment b, did you mean WAL is due to receiver
> mode's nature? Can you explain a little bit more?
>
> thanks a lot,
> Renyi.
>
> On Tue, Apr 26, 2016 at 4:09 AM, Mario Ds Briggs <
> *mario.briggs@in.ibm.com* <ma...@in.ibm.com>> wrote:
>
>    That was my initial thought as well. But then i was wondering if this
>    approach could help remove
>    a - the little extra latency overhead we have with the DirectApproach
>    (compared to Receiver) and
>    b - the data duplication in-efficiency (replication to WAL) and single
>    version of the truth of the offsets processed (under some failures) in the
>    Receiver approach.
>
>    thanks
>    Mario
>
>    ----- Message from Cody Koeninger <*cody@koeninger.org*
>    <co...@koeninger.org>> on Mon, 25 Apr 2016 09:23:32 -0500 -----
>
>    *To:*
>    Renyi Xiong <*renyixiong0@gmail.com* <re...@gmail.com>>
>
>    *cc:*
>    dev <*dev@spark.apache.org* <de...@spark.apache.org>>
>
>    *Subject:*
>    Re: Spark streaming Kafka receiver WriteAheadLog questionIf you want
>    to refer back to Kafka based on offset ranges, why not use
>    createDirectStream?
>
>    On Fri, Apr 22, 2016 at 11:49 PM, Renyi Xiong <*renyixiong0@gmail.com*
>    <re...@gmail.com>> wrote:
>    > Hi,
>    >
>    > Is it possible for Kafka receiver generated
>    WriteAheadLogBackedBlockRDD to
>    > hold corresponded Kafka offset range so that during recovery the RDD
>    can
>    > refer back to Kafka queue instead of paying the cost of write ahead
>    log?
>    >
>    > I guess there must be a reason here. Could anyone please help me
>    understand?
>    >
>    > Thanks,
>    > Renyi.
>
>
>
>
>
>
>
>