You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ya...@apache.org on 2015/06/14 10:17:16 UTC
samza git commit: SAMZA-698: Update Spark Streaming and Samza comparison doc

Repository: samza
Updated Branches:
  refs/heads/0.9.0 6b85a1a88 -> e67397691


SAMZA-698: Update Spark Streaming and Samza comparison doc


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/e6739769
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/e6739769
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/e6739769

Branch: refs/heads/0.9.0
Commit: e673976917c26cb9af45c547d445670bfa69fd1b
Parents: 6b85a1a
Author: Yan Fang <ya...@gmail.com>
Authored: Sun Jun 14 01:16:55 2015 -0700
Committer: Yan Fang <ya...@gmail.com>
Committed: Sun Jun 14 01:16:55 2015 -0700

----------------------------------------------------------------------
 .../versioned/comparisons/spark-streaming.md    | 44 +++++++++++++-------
 1 file changed, 29 insertions(+), 15 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/e6739769/docs/learn/documentation/versioned/comparisons/spark-streaming.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/comparisons/spark-streaming.md b/docs/learn/documentation/versioned/comparisons/spark-streaming.md
index b8a521f..e1ccc3e 100644
--- a/docs/learn/documentation/versioned/comparisons/spark-streaming.md
+++ b/docs/learn/documentation/versioned/comparisons/spark-streaming.md
@@ -21,22 +21,32 @@ title: Spark Streaming
 
 *People generally want to know how similar systems compare. We've done our best to fairly contrast the feature sets of Samza with other systems. But we aren't experts in these frameworks, and we are, of course, totally biased. If we have goofed anything, please let us know and we will correct it.*
 
+*This overview is comparing Spark Streaming 1.3.1 and Samza 0.9.0. Things may change in the future versions.*
+
 [Spark Streaming](http://spark.apache.org/docs/latest/streaming-programming-guide.html) is a stream processing system that uses the core [Apache Spark](http://spark.apache.org/) API. Both Samza and Spark Streaming provide data consistency, fault tolerance, a programming API, etc. Spark's approach to streaming is different from Samza's. Samza processes messages as they are received, while Spark Streaming treats streaming as a series of deterministic batch operations. Spark Streaming groups the stream into batches of a fixed duration (such as 1 second). Each batch is represented as a Resilient Distributed Dataset ([RDD](http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf)). A neverending sequence of these RDDs is called a Discretized Stream ([DStream](http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf)).
 
 ### Overview of Spark Streaming
 
 Before going into the comparison, here is a brief overview of the Spark Streaming application. If you already are familiar with Spark Streaming, you may skip this part. There are two main parts of a Spark Streaming application: data receiving and data processing. 
 
-* Data receiving is accomplished by a [receiver](https://spark.apache.org/docs/latest/streaming-custom-receivers.html) which receives data and stores data in Spark (though not in an RDD at this point). 
+* Data receiving is accomplished by a [receiver](https://spark.apache.org/docs/latest/streaming-custom-receivers.html) which receives data and stores data in Spark (though not in an RDD at this point). They are experiementing a [non-receiver approach](https://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers) for Kafka in the 1.3 release.
 * Data processing transfers the data stored in Spark into the DStream. You can then apply the two [operations](https://spark.apache.org/docs/latest/streaming-programming-guide.html#operations) -- transformations and output operations -- on the DStream. The operations for DStream are a little different from what you can use for the general Spark RDD because of the streaming environment.
 
-Here is an overview of the Spark Streaming's [deploy](https://spark.apache.org/docs/latest/cluster-overview.html). Spark has a SparkContext (in SparkStreaming, it’s called [StreamingContext](https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.streaming.StreamingContext)) object in the driver program. The SparkContext talks with cluster manager (e.g. YARN, Mesos) which then allocates resources (that is, executors) for the Spark application. And executors will run tasks sent by the SparkContext ([read more](http://spark.apache.org/docs/latest/cluster-overview.html#compenents)). In YARN’s context, one executor is equivalent to one container. Tasks are what is running in the containers. The driver program runs in the client machine that submits job ([client mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn)) or in the application manager ([cluster mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spa
 rk-on-yarn)). Both data receiving and data processing are tasks for executors. One receiver (receives one input stream) is a long-running task. Processing has a bunch of tasks. All the tasks are sent to the available executors.
+Here is an overview of the Spark Streaming's [deploy](https://spark.apache.org/docs/latest/cluster-overview.html). Spark has a SparkContext (in SparkStreaming, it’s called [StreamingContext](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext) object in the driver program. The SparkContext talks with cluster manager (e.g. YARN, Mesos) which then allocates resources (that is, executors) for the Spark application. And executors will run tasks sent by the SparkContext ([read more](http://spark.apache.org/docs/latest/cluster-overview.html#compenents)). In YARN’s context, one executor is equivalent to one container. Tasks are what is running in the containers. The driver program runs in the client machine that submits job ([client mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn)) or in the application manager ([cluster mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spa
 rk-on-yarn)). Both data receiving and data processing are tasks for executors. One receiver (receives one input stream) is a long-running task. Processing has a bunch of tasks. All the tasks are sent to the available executors.
+
+### Ordering of Messages
+
+Spark Streaming guarantees ordered processing of RDDs in one DStream. Since each RDD is processed in parallel, there is not order guaranteed within the RDD. This is a tradeoff design Spark made. If you want to process the messages in order within the RDD, you have to process them in one thread, which does not have the benefit of parallelism.
+
+Samza guarantees processing the messages as the order they appear in the partition of the stream. Samza also allows you to define a deterministic ordering of messages between partitions using a [MessageChooser](../container/streams.html). 
 
-### Ordering and Guarantees
+### Fault-tolerance semantics
 
-Spark Streaming guarantees ordered processing of batches in a DStream. Since messages are processed in batches by side-effect-free operators, the exact ordering of messages is not important in Spark Streaming. Spark Streaming does not gurantee at-least-once or at-most-once messaging semantics because in some situations it may lose data when the driver program fails (see [fault-tolerance](#fault-tolerance)). In addition, because Spark Streaming requires transformation operations to be deterministic, it is unsuitable for nondeterministic processing, e.g. a randomized machine learning algorithm.
+Spark Streaming has different fault-tolerance semantics for different data sources. Here, for a better comparison, only discuss the semantic when using Spark Streaming with Kafka. In Spark 1.2, Spark Streaming provides at-least-once semantic in the receiver side (See the [post](https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html])). In Spark 1.3, it uses the no-receiver approach ([more detail](https://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers)), which provides some benefits. However, it still does not guarantee exactly-once semantics for output actions. Because the side-effecting output operations maybe repeated when the job fails and recovers from the checkpoint. If the updates in your output operations are not idempotent or transactional (such as send messages to a Kafka topic), you will get duplicated messages. Do not be confused by the "exactly-once semantic"
  in Spark Streaming guide. This only means a given item is only processed once and always gets the same result (Also check the "Delivery Semantics" section [posted](http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/) by Cloudera).
 
-Samza guarantees processing the messages as the order they appear in the partition of the stream. Samza also allows you to define a deterministic ordering of messages between partitions using a [MessageChooser](../container/streams.html). It provides an at-least-once message delivery guarantee. And it does not require operations to be deterministic.
+Samza provides an at-least-once message delivery guarantee. When the job failure happens, it restarts the containers and reads the latest offset from the [checkpointing](../container/checkpointing.html). When a Samza job recovers from a failure, it's possible that it will process some data more than once. This happens because the job restarts at the last checkpoint, and any messages that had been processed between that checkpoint and the failure are processed again. The amount of reprocessed data can be minimized by setting a small checkpoint interval period.
+
+There is possible for both Spark Streaming and Samza to achieve end-to-end exactly-once semantics if you can ensure [idempotent updates or transactional updates](https://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations). The link is pointing to the Spark Streaming page, the same idea works in the Samza as well.
 
 ### State Management
 
@@ -54,13 +64,19 @@ One of the common use cases in state management is [stream-stream join](../conta
 
 ### Partitioning and Parallelism
 
-Spark Streaming's Parallelism is achieved by splitting the job into small tasks and sending them to executors. There are two types of [parallelism in Spark Streaming](http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving): parallelism in receiving the stream and parallelism in processing the stream. On the receiving side, one input DStream creates one receiver, and one receiver receives one input stream of data and runs as a long-running task. So in order to parallelize the receiving process, you can split one input stream into multiple input streams based on some criteria (e.g. if you are receiving a Kafka stream with some partitions, you may split this stream based on the partition). Then you can create multiple input DStreams (so multiple receivers) for these streams and the receivers will run as multiple tasks. Accordingly, you should provide enough resources by increasing the core number of the executors or bringing up more 
 executors. Then you can combine all the input Dstreams into one DStream during the processing if necessary. On the processing side, since a DStream is a continuous sequence of RDDs, the parallelism is simply accomplished by normal RDD operations, such as map, reduceByKey, reduceByWindow (check [here] (https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism)).
+Spark Streaming's Parallelism is achieved by splitting the job into small tasks and sending them to executors. There are two types of [parallelism in Spark Streaming](http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving): parallelism in receiving the stream and parallelism in processing the stream:
+* On the receiving side, one input DStream creates one receiver, and one receiver receives one input stream of data and runs as a long-running task. So in order to parallelize the receiving process, you can split one input stream into multiple input streams based on some criteria (e.g. if you are receiving a Kafka stream with some partitions, you may split this stream based on the partition). Then you can create multiple input DStreams (so multiple receivers) for these streams and the receivers will run as multiple tasks. Accordingly, you should provide enough resources by increasing the core number of the executors or bringing up more executors. Then you can combine all the input Dstreams into one DStream during the processing if necessary. In Spark 1.3, Spark Streaming + Kafka Integration is using the no-receiver approach (called directSream). Spark Streaming creates a RDD whose partitions map to the Kafka partitions one-to-one. This simplifies the parallelism in the receiver side
 .
+* On the processing side, since a DStream is a continuous sequence of RDDs, the parallelism is simply accomplished by normal RDD operations, such as map, reduceByKey, reduceByWindow (check [here](https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism)).
 
 Samza’s parallelism is achieved by splitting processing into independent [tasks](../api/overview.html) which can be parallelized. You can run multiple tasks in one container or only one task per container. That depends on your workload and latency requirement. For example, if you want to quickly [reprocess a stream](../jobs/reprocessing.html), you may increase the number of containers to one task per container. It is important to notice that one container only uses [one thread](../container/event-loop.html), which maps to exactly one CPU. This design attempts to simplify  resource management and the isolation between jobs.
 
+In Samza, you have the flexibility to define what one task contains. For example, in the Kafka use case, by default, Samza groups the partitions with the same partition id into one task. This allows easy join between different streams. Out-of-box, Samza also provides the grouping strategy which assigns one partition to one task. This provides maximum scalability in terms of how many containers can be used to process those input streams and is appropriate for very high volume jobs that need no grouping of the input streams.
+
 ### Buffering &amp; Latency
 
-Spark streaming essentially is a sequence of small batch processes. With a fast execution engine, it can reach the latency as low as one second (from their [paper](http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf)). If the processing is slower than receiving, the data will be queued as DStreams in memory and the queue will keep increasing. In order to run a healthy Spark streaming application, the system should be [tuned](http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning) until the speed of processing is as fast as receiving.
+Spark streaming essentially is a sequence of small batch processes. With a fast execution engine, it can reach the latency as low as one second (from their [paper](http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf)). From their [page](https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving), "the recommended minimum value of block interval is about 50 ms, below which the task launching overheads may be a problem."
+
+If the processing is slower than receiving, the data will be queued as DStreams in memory and the queue will keep increasing. In order to run a healthy Spark streaming application, the system should be [tuned](http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning) until the speed of processing is as fast as receiving.
 
 Samza jobs can have latency in the low milliseconds when running with Apache Kafka. It has a different approach to buffering. The buffering mechanism is dependent on the input and output system. For example, when using [Kafka](http://kafka.apache.org/) as the input and output system, data is actually buffered to disk. This design decision, by sacrificing a little latency, allows the buffer to absorb a large backlog of messages when a job has fallen behind in its processing.
 
@@ -68,21 +84,19 @@ Samza jobs can have latency in the low milliseconds when running with Apache Kaf
 
 There are two kinds of failures in both Spark Streaming and Samza: worker node (running executors) failure in Spark Streaming (equivalent to container failure in Samza) and driver node (running driver program) failure (equivalent to application manager (AM) failure in Samza).
 
-When a worker node fails in Spark Streaming, it will be restarted by the cluster manager. When a container fails in Samza, the application manager will work with YARN to start a new container. 
-
-When a driver node fails in Spark Streaming, Spark’s [standalone cluster mode](http://spark.apache.org/docs/latest/spark-standalone.html) will restart the driver node automatically. But it is currently not supported in YARN and Mesos. You will need other mechanisms to restart the driver node automatically. Spark Streaming can use the checkpoint in HDFS to recreate the StreamingContext. When the AM fails in Samza, YARN will handle restarting the AM. Samza will restart all the containers if the AM restarts.
+When a worker node fails in Spark Streaming, it will be restarted by the cluster manager. When a container fails in Samza, the application manager will work with YARN to start a new container. When a driver node fails in Spark Streaming, YARN/Mesos/Spark Standalone will automatically restart the driver node. Spark Streaming can use the checkpoint in HDFS to recreate the StreamingContext. 
 
-In terms of data lost, there is a difference between Spark Streaming and Samza. If the input stream is active streaming system, such as Flume, Kafka, Spark Streaming may lose data if the failure happens when the data is received but not yet replicated to other nodes (also see [SPARK-1647](https://issues.apache.org/jira/browse/SPARK-1647)). Samza will not lose data when the failure happens because it has the concept of [checkpointing](../container/checkpointing.html) that stores the offset of the latest processed message and always commits the checkpoint after processing the data. There is not data lost situation like Spark Streaming has. If a container fails, it reads from the latest checkpoint. When a Samza job recovers from a failure, it's possible that it will process some data more than once. This happens because the job restarts at the last checkpoint, and any messages that had been processed between that checkpoint and the failure are processed again. The amount of reprocessed
  data can be minimized by setting a small checkpoint interval period.
+In Samza, YARN takes care of the fault-tolerance. When the AM fails in Samza, YARN will handle restarting the AM. Samza will restart all the containers if the AM restarts. When the container fails, the AM will bring up a new container.
 
 ### Deployment &amp; Execution
 
 Spark has a SparkContext object to talk with cluster managers, which then allocate resources for the application. Currently Spark supports three types of cluster managers: [Spark standalone](http://spark.apache.org/docs/latest/spark-standalone.html), [Apache Mesos](http://mesos.apache.org/) and [Hadoop YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). Besides these, Spark has a script for launching in [Amazon EC2](http://spark.apache.org/docs/latest/ec2-scripts.html).
 
-Samza only supports YARN and local execution currently.
+Samza supports YARN and local execution currently. There is also [Mesos support](https://github.com/banno/samza-mesos) being integrating.
 
 ### Isolation
 
-Spark Streaming and Samza have the same isolation. Spark Streaming depends on cluster managers (e.g Mesos or YARN) and Samza depend on YARN to provide processor isolation. Different applications run in different JVMs. Data cannot be shared among different applications unless it is written to external storage. Since Samza provides out-of-box Kafka integration, it is very easy to reuse the output of other Samza jobs (see [here](../introduction/concepts.html#dataflow-graphs)).
+Spark Streaming and Samza have the same isolation. Spark Streaming depends on cluster managers (e.g Mesos or YARN) and Samza depend on YARN/Mesos to provide processor isolation. Different applications run in different JVMs. Data cannot be shared among different applications unless it is written to external storage. Since Samza provides out-of-box Kafka integration, it is very easy to reuse the output of other Samza jobs (see [here](../introduction/concepts.html#dataflow-graphs)).
 
 ### Language Support
 
@@ -98,8 +112,8 @@ Although a Storm/Spark Streaming job could in principle write its output to a me
 
 ### Maturity
 
-Spark has an active user and developer community, and recently releases 1.0.0 version. It has a list of companies that use it on its [Powered by](https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark) page. Since Spark contains Spark Streaming, Spark SQL, MLlib, GraphX and Bagel, it's tough to tell what portion of companies on the list are actually using Spark Streaming, and not just Spark.
+Spark has an active user and developer community, and recently releases 1.3.1 version. It has a list of companies that use it on its [Powered by](https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark) page. Since Spark contains Spark Streaming, Spark SQL, MLlib, GraphX and Bagel, it's tough to tell what portion of companies on the list are actually using Spark Streaming, and not just Spark.
 
-Samza is still young, but has just released version 0.7.0. It has a responsive community and is being developed actively. That said, it is built on solid systems such as YARN and Kafka. Samza is heavily used at LinkedIn and we hope others will find it useful as well.
+Samza is still young, but has just released version 0.9.0. It has a responsive community and is being developed actively. That said, it is built on solid systems such as YARN and Kafka. Samza is heavily used at LinkedIn and [other companies](https://cwiki.apache.org/confluence/display/SAMZA/Powered+By). we hope others will find it useful as well.
 
 ## [API Overview &raquo;](../api/overview.html)