You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by RodrigoB <ro...@aspect.com> on 2014/12/02 18:59:38 UTC

Re: Low Level Kafka Consumer for Spark

Hi Dibyendu,What are your thoughts on keeping this solution (or not),
considering that Spark Streaming v1.2 will have built-in recoverability of
the received data?https://issues.apache.org/jira/browse/SPARK-1647I'm
concerned about the complexity of this solution with regards the added
complexity and performance overhead by the writing of big amounts of data
into HDFS on a small batch
interval.https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit?pli=1#
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n20181/spark_streaming_v.png>
I think the whole solution is well designed and thought but I'm afraid if it
does really fit all needs with checkpoint based technologies like Flume or
Kafka, by hiding away the management of the offset from the user code. If
instead of saving received data into HDFS, the ReceiverHandler would be
saving some metadata (such as offset in the case of Kafka) specified by the
custom receiver passed into the StreamingContext, then upon driver restart,
that metadata could be used by the custom receiver to recover the point from
which it should start receiving data once more.Anyone's comments will be
greatly appreciated.Tnks,Rod



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20181.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Low Level Kafka Consumer for Spark

Posted by Luis Ángel Vicente Sánchez <la...@gmail.com>.

My main complain about the WAL mechanism in the new reliable kafka receiver
is that you have to enable checkpointing and for some reason, even if
spark.cleaner.ttl is set to a reasonable value, only the metadata is
cleaned periodically. In my tests, using a folder in my filesystem as the
checkpoint folder, the receivedMetaData folder remains almost constant in
size but the receivedData folder is always increasing; the spark.cleaner.ttl
was configured to 300 seconds.

2014-12-03 10:13 GMT+00:00 Dibyendu Bhattacharya <
dibyendu.bhattachary@gmail.com>:

> Hi,
>
> Yes, as Jerry mentioned, the Spark -3129 (
> https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature
> which solves the Driver failure problem. The way 3129 is designed , it
> solved the driver failure problem agnostic of the source of the stream (
> like Kafka or Flume etc) But with just 3129 you can not achieve complete
> solution for data loss. You need a reliable receiver which should also
> solves the data loss issue on receiver failure.
>
> The Low Level Consumer (https://github.com/dibbhatt/kafka-spark-consumer)
> for which this email thread was started has solved that problem with Kafka
> Low Level API.
>
> And Spark-4062 as Jerry mentioned also recently solved the same problem
> using Kafka High Level API.
>
> On the Kafka High Level Consumer API approach , I would like to mention
> that Kafka 0.8 has some issue as mentioned in this wiki (
> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design)
> where consumer re-balance sometime fails and that is one of the key reason
> Kafka is re-writing consumer API in Kafka 0.9.
>
> I know there are few folks already have faced this re-balancing issues
> while using Kafka High Level API , and If you ask my opinion, we at Pearson
> are still using the Low Level Consumer as this seems to be more robust and
> performant and we have been using this for few months without any issue
> ..and also I may be little biased :)
>
> Regards,
> Dibyendu
>
>
>
> On Wed, Dec 3, 2014 at 7:04 AM, Shao, Saisai <sa...@intel.com>
> wrote:
>
>> Hi Rod,
>>
>> The purpose of introducing  WAL mechanism in Spark Streaming as a general
>> solution is to make all the receivers be benefit from this mechanism.
>>
>> Though as you said, external sources like Kafka have their own checkpoint
>> mechanism, instead of storing data in WAL, we can only store metadata to
>> WAL, and recover from the last committed offsets. But this requires
>> sophisticated design of Kafka receiver with low-level API involved, also we
>> need to take care of rebalance and fault tolerance things by ourselves. So
>> right now instead of implementing a whole new receiver, we choose to
>> implement a simple one, though the performance is not so good, it's much
>> easier to understand and maintain.
>>
>> The design purpose and implementation of reliable Kafka receiver can be
>> found in (https://issues.apache.org/jira/browse/SPARK-4062). And in
>> future, to improve the reliable Kafka receiver like what you mentioned is
>> on our scheduler.
>>
>> Thanks
>> Jerry
>>
>>
>> -----Original Message-----
>> From: RodrigoB [mailto:rodrigo.boavida@aspect.com]
>> Sent: Wednesday, December 3, 2014 5:44 AM
>> To: user@spark.incubator.apache.org
>> Subject: Re: Low Level Kafka Consumer for Spark
>>
>> Dibyendu,
>>
>> Just to make sure I will not be misunderstood - My concerns are referring
>> to the Spark upcoming solution and not yours. I would to gather the
>> perspective of someone which implemented recovery with Kafka a different
>> way.
>>
>> Tnks,
>> Rod
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20196.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
>> commands, e-mail: user-help@spark.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Low Level Kafka Consumer for Spark

Posted by Dibyendu Bhattacharya <di...@gmail.com>.

Hi,

Yes, as Jerry mentioned, the Spark -3129 (
https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature
which solves the Driver failure problem. The way 3129 is designed , it
solved the driver failure problem agnostic of the source of the stream (
like Kafka or Flume etc) But with just 3129 you can not achieve complete
solution for data loss. You need a reliable receiver which should also
solves the data loss issue on receiver failure.

The Low Level Consumer (https://github.com/dibbhatt/kafka-spark-consumer)
for which this email thread was started has solved that problem with Kafka
Low Level API.

And Spark-4062 as Jerry mentioned also recently solved the same problem
using Kafka High Level API.

On the Kafka High Level Consumer API approach , I would like to mention
that Kafka 0.8 has some issue as mentioned in this wiki (
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design)
where consumer re-balance sometime fails and that is one of the key reason
Kafka is re-writing consumer API in Kafka 0.9.

I know there are few folks already have faced this re-balancing issues
while using Kafka High Level API , and If you ask my opinion, we at Pearson
are still using the Low Level Consumer as this seems to be more robust and
performant and we have been using this for few months without any issue
..and also I may be little biased :)

Regards,
Dibyendu

On Wed, Dec 3, 2014 at 7:04 AM, Shao, Saisai <sa...@intel.com> wrote:

> Hi Rod,
>
> The purpose of introducing  WAL mechanism in Spark Streaming as a general
> solution is to make all the receivers be benefit from this mechanism.
>
> Though as you said, external sources like Kafka have their own checkpoint
> mechanism, instead of storing data in WAL, we can only store metadata to
> WAL, and recover from the last committed offsets. But this requires
> sophisticated design of Kafka receiver with low-level API involved, also we
> need to take care of rebalance and fault tolerance things by ourselves. So
> right now instead of implementing a whole new receiver, we choose to
> implement a simple one, though the performance is not so good, it's much
> easier to understand and maintain.
>
> The design purpose and implementation of reliable Kafka receiver can be
> found in (https://issues.apache.org/jira/browse/SPARK-4062). And in
> future, to improve the reliable Kafka receiver like what you mentioned is
> on our scheduler.
>
> Thanks
> Jerry
>
>
> -----Original Message-----
> From: RodrigoB [mailto:rodrigo.boavida@aspect.com]
> Sent: Wednesday, December 3, 2014 5:44 AM
> To: user@spark.incubator.apache.org
> Subject: Re: Low Level Kafka Consumer for Spark
>
> Dibyendu,
>
> Just to make sure I will not be misunderstood - My concerns are referring
> to the Spark upcoming solution and not yours. I would to gather the
> perspective of someone which implemented recovery with Kafka a different
> way.
>
> Tnks,
> Rod
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20196.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

RE: Low Level Kafka Consumer for Spark

Posted by "Shao, Saisai" <sa...@intel.com>.

Hi Rod,

The purpose of introducing  WAL mechanism in Spark Streaming as a general solution is to make all the receivers be benefit from this mechanism. 

Though as you said, external sources like Kafka have their own checkpoint mechanism, instead of storing data in WAL, we can only store metadata to WAL, and recover from the last committed offsets. But this requires sophisticated design of Kafka receiver with low-level API involved, also we need to take care of rebalance and fault tolerance things by ourselves. So right now instead of implementing a whole new receiver, we choose to implement a simple one, though the performance is not so good, it's much easier to understand and maintain.

The design purpose and implementation of reliable Kafka receiver can be found in (https://issues.apache.org/jira/browse/SPARK-4062). And in future, to improve the reliable Kafka receiver like what you mentioned is on our scheduler.

Thanks
Jerry


-----Original Message-----
From: RodrigoB [mailto:rodrigo.boavida@aspect.com] 
Sent: Wednesday, December 3, 2014 5:44 AM
To: user@spark.incubator.apache.org
Subject: Re: Low Level Kafka Consumer for Spark

Dibyendu,

Just to make sure I will not be misunderstood - My concerns are referring to the Spark upcoming solution and not yours. I would to gather the perspective of someone which implemented recovery with Kafka a different way.

Tnks,
Rod



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20196.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Low Level Kafka Consumer for Spark

Posted by RodrigoB <ro...@aspect.com>.

Dibyendu,

Just to make sure I will not be misunderstood - My concerns are referring to
the Spark upcoming solution and not yours. I would to gather the perspective
of someone which implemented recovery with Kafka a different way.

Tnks,
Rod



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20196.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org