You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tathagata Das (JIRA)" <ji...@apache.org> on 2015/01/27 08:48:34 UTC

[jira] [Comment Edited] (SPARK-4964) Exactly-once + WAL-free Kafka Support in Spark Streaming

    [ https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293129#comment-14293129 ] 

Tathagata Das edited comment on SPARK-4964 at 1/27/15 7:47 AM:
---------------------------------------------------------------

I am renaming this JIRA to "Exactly-once + WAL-free Kafka Support in Spark Streaming
" because there are two problems that we are trying to solve, which gets solved by the associated PR. See the design doc for more details. 


was (Author: tdas):
I am renaming this JIRA to "Native Kafka Support" because there are two problems that we are trying to solve, which gets solved by the associated PR.

> Exactly-once + WAL-free Kafka Support in Spark Streaming
> --------------------------------------------------------
>
>                 Key: SPARK-4964
>                 URL: https://issues.apache.org/jira/browse/SPARK-4964
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Cody Koeninger
>
> for background, see http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html
> Requirements:
> - allow client code to implement exactly-once end-to-end semantics for Kafka messages, in cases where their output storage is either idempotent or transactional
> - allow client code access to Kafka offsets, rather than automatically committing them
> - do not assume Zookeeper as a repository for offsets (for the transactional case, offsets need to be stored in the same store as the data)
> - allow failure recovery without lost or duplicated messages, even in cases where a checkpoint cannot be restored (for instance, because code must be updated)
> Design:
> The basic idea is to make an rdd where each partition corresponds to a given Kafka topic, partition, starting offset, and ending offset.  That allows for deterministic replay of data from Kafka (as long as there is enough log retention).
> Client code is responsible for committing offsets, either transactionally to the same store that data is being written to, or in the case of idempotent data, after data has been written.
> PR of a sample implementation for both the batch and dstream case is forthcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org