You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ya...@apache.org on 2015/06/19 02:03:19 UTC

samza git commit: SAMZA-716: fixed broken link in Spark Streaming comparison page

Repository: samza
Updated Branches:
  refs/heads/master f0f94cda2 -> c04daabd3


SAMZA-716: fixed broken link in Spark Streaming comparison page


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/c04daabd
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/c04daabd
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/c04daabd

Branch: refs/heads/master
Commit: c04daabd3875c229702a27a0ed104bdb567102e7
Parents: f0f94cd
Author: Aleksandar Bircakovic <a....@levi9.com>
Authored: Thu Jun 18 17:02:58 2015 -0700
Committer: Yan Fang <ya...@gmail.com>
Committed: Thu Jun 18 17:02:58 2015 -0700

----------------------------------------------------------------------
 docs/learn/documentation/versioned/comparisons/spark-streaming.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/c04daabd/docs/learn/documentation/versioned/comparisons/spark-streaming.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/comparisons/spark-streaming.md b/docs/learn/documentation/versioned/comparisons/spark-streaming.md
index e1ccc3e..d11e8b1 100644
--- a/docs/learn/documentation/versioned/comparisons/spark-streaming.md
+++ b/docs/learn/documentation/versioned/comparisons/spark-streaming.md
@@ -42,7 +42,7 @@ Samza guarantees processing the messages as the order they appear in the partiti
 
 ### Fault-tolerance semantics
 
-Spark Streaming has different fault-tolerance semantics for different data sources. Here, for a better comparison, only discuss the semantic when using Spark Streaming with Kafka. In Spark 1.2, Spark Streaming provides at-least-once semantic in the receiver side (See the [post](https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html])). In Spark 1.3, it uses the no-receiver approach ([more detail](https://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers)), which provides some benefits. However, it still does not guarantee exactly-once semantics for output actions. Because the side-effecting output operations maybe repeated when the job fails and recovers from the checkpoint. If the updates in your output operations are not idempotent or transactional (such as send messages to a Kafka topic), you will get duplicated messages. Do not be confused by the "exactly-once semantic"
  in Spark Streaming guide. This only means a given item is only processed once and always gets the same result (Also check the "Delivery Semantics" section [posted](http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/) by Cloudera).
+Spark Streaming has different fault-tolerance semantics for different data sources. Here, for a better comparison, only discuss the semantic when using Spark Streaming with Kafka. In Spark 1.2, Spark Streaming provides at-least-once semantic in the receiver side (See the [post](https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html)). In Spark 1.3, it uses the no-receiver approach ([more detail](https://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers)), which provides some benefits. However, it still does not guarantee exactly-once semantics for output actions. Because the side-effecting output operations maybe repeated when the job fails and recovers from the checkpoint. If the updates in your output operations are not idempotent or transactional (such as send messages to a Kafka topic), you will get duplicated messages. Do not be confused by the "exactly-once semantic" 
 in Spark Streaming guide. This only means a given item is only processed once and always gets the same result (Also check the "Delivery Semantics" section [posted](http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/) by Cloudera).
 
 Samza provides an at-least-once message delivery guarantee. When the job failure happens, it restarts the containers and reads the latest offset from the [checkpointing](../container/checkpointing.html). When a Samza job recovers from a failure, it's possible that it will process some data more than once. This happens because the job restarts at the last checkpoint, and any messages that had been processed between that checkpoint and the failure are processed again. The amount of reprocessed data can be minimized by setting a small checkpoint interval period.