You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tdas <gi...@git.apache.org> on 2014/12/10 01:42:57 UTC

[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/3653

    [SPARK-4806] Streaming doc update for 1.2

    Important updates to the streaming programming guide
    - Make the fault-tolerance properties easier to understand, with information about write ahead logs
    - Update the information about deploying the spark streaming app with information about Driver HA
    - Update Receiver guide to discuss reliable vs unreliable receivers.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark streaming-doc-update-1.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3653.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3653
    
----
commit 192c7a72d23f1bf28644df4fa14ff8f95d7f728c
Author: Tathagata Das <ta...@gmail.com>
Date:   2014-12-08T06:08:03Z

    Added more info about Python API, and rewrote the checkpointing section.

commit e45453b7289f01058fbb5c7db115ca85446f998a
Author: Tathagata Das <ta...@gmail.com>
Date:   2014-12-09T22:57:33Z

    Update streaming guide, added deploying section.

commit 67fcffc92df70656e105cb33095aa69664221485
Author: Tathagata Das <ta...@gmail.com>
Date:   2014-12-09T22:58:10Z

    Added cluster mode + supervise example to submitting application guide.

commit a0217c02fedfc674a05ac29df456566cf55f9ffb
Author: Tathagata Das <ta...@gmail.com>
Date:   2014-12-09T22:58:29Z

    Changed Deploying menu layout

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3653#discussion_r21652907
  
    --- Diff: docs/streaming-flume-integration.md ---
    @@ -66,9 +66,16 @@ configuring Flume agents.
     
     ## Approach 2 (Experimental): Pull-based Approach using a Custom Sink
     Instead of Flume pushing data directly to Spark Streaming, this approach runs a custom Flume sink that allows the following.
    +
     - Flume pushes data into the sink, and the data stays buffered.
    -- Spark Streaming uses transactions to pull data from the sink. Transactions succeed only after data is received and replicated by Spark Streaming.
    -This ensures that better reliability and fault-tolerance than the previous approach. However, this requires configuring Flume to run a custom sink. Here are the configuration steps.
    +- Spark Streaming uses a [reliable Flume receiver](streaming-programming-guide.html#receiver-reliability)
    +  and transactions to pull data from the sink. Transactions succeed only after data is received and
    +  replicated by Spark Streaming.
    +
    +This ensures that stronger reliability and
    --- End diff --
    
    Can cut 'that'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66626293
  
    @JoshRosen I have addressed your final comments and merged it. Thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66559922
  
      [Test build #24340 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24340/consoleFull) for   PR 3653 at commit [`ce299e4`](https://github.com/apache/spark/commit/ce299e48474b518c6451a9579d195e0df3a221ee).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3653#discussion_r21652851
  
    --- Diff: docs/streaming-custom-receivers.md ---
    @@ -191,9 +196,68 @@ The full source code is in the example [JavaCustomReceiver.java](https://github.
     </div>
     </div>
     
    -
    -
    -### Implementing and Using a Custom Actor-based Receiver
    +## Receiver Reliability
    +As discussed in brief in the
    +[Spark Streaming Programming Guide](streaming-programming-guide.html#receiver-reliability),
    +there are two kinds of receivers based on their reliability and fault-tolerance semantics.
    +
    +1. *Reliable Receiver* - For *reliable sources* that allow sent data to be acknowledged, a
    +  *reliable receiver* correctly acknowledges to the source that the data has been received
    +  and stored in Spark reliably (that is, replicated successfully). Usually,
    +  implementing this receiver involves careful consideration of the semantics of source
    +  acknowledgements.
    +1. *Unreliable Receiver* - These are receivers for unreliable sources that do not support
    +  acknowledging. Even for reliable sources, one may implement an unreliable receiver that
    +  do not go into the complexity of acknowledging correctly.
    +
    +To implement a *reliable receiver*, you have to use `store(multiple-records)` to store data.
    +This flavour of `store` is a blocking call which returns only after all the given records have
    +been stored inside Spark. If replication is enabled receiver's configured storage level
    +(enabled by default), then this call returns after replication has completed.
    +Thus it ensures that the data is reliably stored, and the receiver can now acknowledge the
    +source appropriately. This ensures that no data is caused when the receiver fails in the middle
    +of replicating data -- the buffered data will not be acknowledged and hence will be later resent
    +by the source.
    +
    +An *unreliable receiver* does not have to implement any of this logic. It can simply receive
    +records from the source and insert them one-at-a-time using `store(single-record)`. While it does
    +not get the reliability guarantees of `store(multiple-records)`, it has the following advantages.
    +
    +- The system takes care of chunking that data into appropriate sized blocks (look for block
    +interval in the [Spark Streaming Programming Guide](streaming-programming-guide.html)).
    +- The system takes care of controlling the receiving rates if the rate limits have been specified.
    +- Because of these two, *unreliable receivers are simpler to implement than reliable receivers.
    --- End diff --
    
    Stray asterisk near `*unreliable`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66395160
  
      [Test build #24276 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24276/consoleFull) for   PR 3653 at commit [`a0217c0`](https://github.com/apache/spark/commit/a0217c02fedfc674a05ac29df456566cf55f9ffb).
     * This patch **passes all tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66473044
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24312/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3653#discussion_r21652039
  
    --- Diff: docs/streaming-programming-guide.md ---
    @@ -1238,49 +1333,260 @@ information on different persistence levels can be found in
     ***
     
     ## Checkpointing
    -A _stateful operation_ is one which operates over multiple batches of data. This includes all
    -window-based operations and the `updateStateByKey` operation. Since stateful operations have a
    -dependency on previous batches of data, they continuously accumulate metadata over time.
    -To clear this metadata, streaming supports periodic _checkpointing_ by saving intermediate data
    -to HDFS. Note that checkpointing also incurs the cost of saving to HDFS which may cause the
    -corresponding batch to take longer to process. Hence, the interval of checkpointing needs to be
    -set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly
    -reduce operation throughput. Conversely, checkpointing too slowly causes the lineage and task
    -sizes to grow which may have detrimental effects. Typically, a checkpoint interval of 5 - 10
    -times of sliding interval of a DStream is good setting to try.
    -
    -To enable checkpointing, the developer has to provide the HDFS path to which RDD will be saved.
    -This is done by using
    +A streaming application must operate 24/7 and hence must be resilient to failures unrelated
    +to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible,
    +Spark Streaming needs to *checkpoints* enough information to a fault-
    +tolerant storage system such that it can recover from failures. There are two types of data
    +that are checkpointed.
    +
    +- *Metadata checkpointing* - Saving of the information defining the streaming computation to
    +  fault-tolerant storage like HDFS. This is used to recover from failure of the node running the
    +  driver of the streaming application (discussed in detail later). Metadata includes:
    +  +  *Configuration* - The configuration that were used to create the streaming application.
    +  +  *DStream operations* - The set of DStream operations that define the streaming application.
    +  +  *Incomplete batches* - Batches whose jobs are queued but have not completed yet.
    +- *Data checkpointing* - Saving of the generated RDDs to reliable storage. This is necessary
    +  in some *stateful* transformations that combine data across multiple batches. In such
    +  transformations, the generated RDDs depends on RDDs of previous batches, which causes the length
    +  of the dependency chain to keep increasing with time. To avoid such unbounded increase in recovery
    +   time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically
    +  *checkpointed* to reliable storage (e.g. HDFS) to cut off the dependency chains.
    +
    +To summarize, metadata checkpointing is primarily needed for recovery from driver failures,
    +whereas data or RDD checkpointing is necessary even for basic functioning if stateful
    +transformations are used.
    +
    +#### When to enable Checkpointing
    +{:.no_toc}
    +
    +Checkpointing must be enabled for applications with any of the following requirements:
    +
    +- *Usage of stateful transformations* - If either `updateStateByKey` or `reduceByKeyAndWindow` (with
    +  inverse function) is used in the application, then the checkpoint directory must be provided for
    +  allowing periodic RDD checkpointing.
    +- *Recovering from failures of the driver running the application* - Metadata checkpoints are used
    +  for to recover with progress information.
    +
    +Note that simple streaming applications without the aforementioned stateful transformations can be
    +run without enabling checkpointing. The recovery from driver failures will also be partial in
    +that case (some received but unprocessed data may be lost). This is often acceptable and many run
    +Spark Streaming applications in this way. Support for non-Hadoop environments is expected
    +to improve in the future.
    +
    +#### How to configure Checkpointing
    +{:.no_toc}
    +
    +Checkpointing can be enabled by setting a directory in a fault-tolerant,
    +reliable file system (e.g., HDFS, S3, etc.) to which the checkpoint information will be saved.
    +This is done by using `streamingContext.checkpoint(checkpointDirectory)`. This will allow you to
    +use the aforementioned stateful transformations. Additionally,
    +if you want make the application recover from driver failures, you should rewrite your
    +streaming application to have the following behavior.
    +
    +  + When the program is being started for the first time, it will create a new StreamingContext,
    +    set up all the streams and then call start().
    +  + When the program is being restarted after failure, it will re-create a StreamingContext
    +    from the checkpoint data in the checkpoint directory.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +This behavior is made simple by using `StreamingContext.getOrCreate`. This is used as follows.
     
     {% highlight scala %}
    -ssc.checkpoint(hdfsPath) // assuming ssc is the StreamingContext or JavaStreamingContext
    +// Function to create and setup a new StreamingContext
    +def functionToCreateContext(): StreamingContext = {
    +    val ssc = new StreamingContext(...)   // new context
    +    val lines = ssc.socketTextStream(...) // create DStreams
    +    ...
    +    ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
    +    ssc
    +}
    +
    +// Get StreamingContext from checkpoint data or create a new one
    +val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
    +
    +// Do additional setup on context that needs to be done,
    +// irrespective of whether it is being started or restarted
    +context. ...
    +
    +// Start the context
    +context.start()
    +context.awaitTermination()
     {% endhighlight %}
     
    -The interval of checkpointing of a DStream can be set by using
    +If the `checkpointDirectory` exists, then the context will be recreated from the checkpoint data.
    +If the directory does not exist (i.e., running for the first time),
    +then the function `functionToCreateContext` will be called to create a new
    +context and set up the DStreams. See the Scala example
    +[RecoverableNetworkWordCount]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala).
    +This example appends the word counts of network data into a file.
     
    -{% highlight scala %}
    -dstream.checkpoint(checkpointInterval)
    +</div>
    +<div data-lang="java" markdown="1">
    +
    +This behavior is made simple by using `JavaStreamingContext.getOrCreate`. This is used as follows.
    +
    +{% highlight java %}
    +// Create a factory object that can create a and setup a new JavaStreamingContext
    +JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
    +  @Override public JavaStreamingContext create() {
    +    JavaStreamingContext jssc = new JavaStreamingContext(...);  // new context
    +    JavaDStream<String> lines = jssc.socketTextStream(...);     // create DStreams
    +    ...
    +    jssc.checkpoint(checkpointDirectory);                       // set checkpoint directory
    +    return jssc;
    +  }
    +};
    +
    +// Get JavaStreamingContext from checkpoint data or create a new one
    +JavaStreamingContext context = JavaStreamingContext.getOrCreate(checkpointDirectory, contextFactory);
    +
    +// Do additional setup on context that needs to be done,
    +// irrespective of whether it is being started or restarted
    +context. ...
    +
    +// Start the context
    +context.start();
    +context.awaitTermination();
    +{% endhighlight %}
    +
    +If the `checkpointDirectory` exists, then the context will be recreated from the checkpoint data.
    +If the directory does not exist (i.e., running for the first time),
    +then the function `contextFactory` will be called to create a new
    +context and set up the DStreams. See the Scala example
    +[JavaRecoverableNetworkWordCount]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java).
    +This example appends the word counts of network data into a file.
    +
    +</div>
    +<div data-lang="python" markdown="1">
    +
    +This behavior is made simple by using `StreamingContext.getOrCreate`. This is used as follows.
    +
    +{% highlight python %}
    +# Function to create and setup a new StreamingContext
    +def functionToCreateContext():
    +    sc = SparkContext(...)   # new context
    +    ssc = new StreamingContext(...)
    +    lines = ssc.socketTextStream(...) # create DStreams
    +    ...
    +    ssc.checkpoint(checkpointDirectory)   # set checkpoint directory
    +    return ssc
    +
    +# Get StreamingContext from checkpoint data or create a new one
    +context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
    +
    +# Do additional setup on context that needs to be done,
    +# irrespective of whether it is being started or restarted
    +context. ...
    +
    +# Start the context
    +context.start()
    +context.awaitTermination()
     {% endhighlight %}
     
    -For DStreams that must be checkpointed (that is, DStreams created by `updateStateByKey` and
    -`reduceByKeyAndWindow` with inverse function), the checkpoint interval of the DStream is by
    -default set to a multiple of the DStream's sliding interval such that its at least 10 seconds.
    +If the `checkpointDirectory` exists, then the context will be recreated from the checkpoint data.
    +If the directory does not exist (i.e., running for the first time),
    +then the function `functionToCreateContext` will be called to create a new
    +context and set up the DStreams. See the Python example
    +[recoverable_network_wordcount.py]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming/recoverable_network_wordcount.py).
    +This example appends the word counts of network data into a file.
    +
    +You can also explicitly create a `StreamingContext` from the checkpoint data and start the
    + computation by using `StreamingContext.getOrCreate(checkpointDirectory, None)`.
    +
    +</div>
    +</div>
    +
    +In addition to using `getOrCreate` one also needs to ensure that the driver process gets
    +restarted automatically on failure. This can only be done by the deployment infrastructure that is
    +used to run the application. This is further discussed in the
    +[Deployment](#deploying-applications.html) section.
    +
    +Note that checkpointing of RDDs incurs the cost of saving to reliable storage.
    +This may cause an increase in the processing time of those batches where RDDs get checkpointed.
    +Hence, the interval of
    +checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every
    +batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently
    +causes the lineage and task sizes to grow which may have detrimental effects. For stateful
    +transformations that require RDD checkpointing, the default interval is a multiple of the
    +batch interval that is at least 10 seconds. It can be set by using
    +`dstream.checkpoint(checkpointInterval)`. Typically, a checkpoint interval of 5 - 10 times of
    +sliding interval of a DStream is good setting to try.
     
     ***
     
     ## Deploying Applications
    -A Spark Streaming application is deployed on a cluster in the same way as any other Spark application.
    -Please refer to the [deployment guide](cluster-overview.html) for more details.
    +This section discusses the steps to deploy a Spark Streaming application.
     
    -Note that the applications
    -that use [advanced sources](#advanced-sources) (e.g. Kafka, Flume, Twitter) are also required to package the
    -extra artifact they link to, along with their dependencies, in the JAR that is used to deploy the application.
    -For example, an application using `TwitterUtils` will have to include
    -`spark-streaming-twitter_{{site.SCALA_BINARY_VERSION}}` and all its transitive
    -dependencies in the application JAR.
    +### Requirements
    +{:.no_toc}
     
    -If a running Spark Streaming application needs to be upgraded (with new application code), then
    -there are two possible mechanism.
    +To run a Spark Streaming applications, you need to have the following.
    +
    +- *Cluster with a cluster manager* - This is the general requirement of any Spark application,
    +  and discussed in detail in the [deployment guide](cluster-overview.html).
    +
    +- *Package the application JAR* - You have to compile your streaming application into a JAR.
    +  If you are using [`spark-submit`](submitting-applications.html) to start the
    +  application, then you will not need to provide Spark and Spark Streaming in the JAR. However,
    +  if your application uses [advanced sources](#advanced-sources) (e.g. Kafka, Flume, Twitter),
    +  then you will have to package the extra artifact they link to, along with their dependencies,
    +  in the JAR that is used to deploy the application. For example, an application using `TwitterUtils`
    +  will have to include `spark-streaming-twitter_{{site.SCALA_BINARY_VERSION}}` and all its
    +  transitive dependencies in the application JAR.
    +
    +- *Configuring sufficient memory for the executors* - Since the received data must be stored in
    +  memory, the executors must be configured with sufficient memory to hold the received data. Note
    +  that if you are doing 10 minute window operations, the system has to keep at least last 10 minutes
    +  of data in memory. So the memory requirements for the application depends on the operations
    +  used in it.
    +
    +- *Configuring checkpointing* - If the stream application requires it, then a directory in the
    +  Hadoop API compatible fault-tolerant storage (e.g. HDFS, S3, etc.) must be configured as the
    +  checkpoint directory and the streaming application written in a way that checkpoint
    +  information can be used for failure recovery. See the [checkpointing](#checkpointing) section
    +  for more details.
    +
    +- *Configuring automatic restart of the application driver* - To automatically recover from a
    +  driver failure, the deployment infrastructure that is
    +  used to run the streaming application must monitor the driver process and relaunch the driver
    +  if it fails. Different [cluster managers](cluster-overview.html#cluster-manager-types)
    +  have different tools to achieve this.
    +    + *Spark Standalone* - A Spark application driver can be submitted to run within the Spark
    +      Standalone cluster (see
    +      [cluster deploy mode](spark-standalone.html#launching-spark-applications)), that is, the
    +      application driver itself runs on one of the worker nodes. Furthermore, the
    +      Standalone cluster manager can be instructed to *supervise* the driver,
    +      and relaunch it if the driver fails either due to non-zero exit code,
    +      or due to failure of the node running the driver. See *cluster mode* and *supervise* in the
    +      [Spark Standalone guide](spark-standalone.html) for more details.
    +    + *YARN* - Yarn supports a similar mechanism for automatically restarting an application.
    +      Please refer to YARN documentation for more details.
    +    + *Mesos* - [Marathon](https://github.com/mesosphere/marathon) has been used to achieve this
    +      with Mesos.
    +
    +
    +- *[Experimental in Spark 1.2] Configuring write ahead logs* - In Spark 1.2,
    +  we have introduced a new experimental feature of write ahead logs for achieved strong
    +  fault-tolerance guarantees. If enabled,  all the data received from a receiver gets written into
    +  a write ahead log in the configuration checkpoint directory. This prevents data loss on driver
    +  recovery, thus ensuring zero data loss (discussed in detail in the
    +  [Fault-tolerance Semantics](#fault-tolerance-semantics) section). This can be enabled by setting
    +  the [configuration parameter](configuration.html#spark-streaming)
    +  `spark.streaming.receiver.writeAheadLogs.enable` to `true`. However, this stronger semantics may
    +  come at the cost of the receiving throughput of individual receivers. can be corrected by running
    --- End diff --
    
    Typo here.  Looks like maybe a word was dropped at the start of this next sentence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66557735
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24333/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66473033
  
      [Test build #24312 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24312/consoleFull) for   PR 3653 at commit [`195852c`](https://github.com/apache/spark/commit/195852c8bf3a36bfcebff54b3188eac152b010b7).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66473670
  
      [Test build #24313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24313/consoleFull) for   PR 3653 at commit [`aa8bb87`](https://github.com/apache/spark/commit/aa8bb8771d08968d5564be51732c5062b2a7883a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66403488
  
      [Test build #24294 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24294/consoleFull) for   PR 3653 at commit [`17b99fb`](https://github.com/apache/spark/commit/17b99fbaf699c54bf75893b98c66ec5e3fde30ba).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66386788
  
      [Test build #24276 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24276/consoleFull) for   PR 3653 at commit [`a0217c0`](https://github.com/apache/spark/commit/a0217c02fedfc674a05ac29df456566cf55f9ffb).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66459924
  
      [Test build #24313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24313/consoleFull) for   PR 3653 at commit [`aa8bb87`](https://github.com/apache/spark/commit/aa8bb8771d08968d5564be51732c5062b2a7883a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66398673
  
      [Test build #24294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24294/consoleFull) for   PR 3653 at commit [`17b99fb`](https://github.com/apache/spark/commit/17b99fbaf699c54bf75893b98c66ec5e3fde30ba).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3653#discussion_r21652896
  
    --- Diff: docs/streaming-custom-receivers.md ---
    @@ -191,9 +196,68 @@ The full source code is in the example [JavaCustomReceiver.java](https://github.
     </div>
     </div>
     
    -
    -
    -### Implementing and Using a Custom Actor-based Receiver
    +## Receiver Reliability
    +As discussed in brief in the
    +[Spark Streaming Programming Guide](streaming-programming-guide.html#receiver-reliability),
    +there are two kinds of receivers based on their reliability and fault-tolerance semantics.
    +
    +1. *Reliable Receiver* - For *reliable sources* that allow sent data to be acknowledged, a
    +  *reliable receiver* correctly acknowledges to the source that the data has been received
    +  and stored in Spark reliably (that is, replicated successfully). Usually,
    +  implementing this receiver involves careful consideration of the semantics of source
    +  acknowledgements.
    +1. *Unreliable Receiver* - These are receivers for unreliable sources that do not support
    +  acknowledging. Even for reliable sources, one may implement an unreliable receiver that
    +  do not go into the complexity of acknowledging correctly.
    +
    +To implement a *reliable receiver*, you have to use `store(multiple-records)` to store data.
    +This flavour of `store` is a blocking call which returns only after all the given records have
    +been stored inside Spark. If replication is enabled receiver's configured storage level
    +(enabled by default), then this call returns after replication has completed.
    +Thus it ensures that the data is reliably stored, and the receiver can now acknowledge the
    +source appropriately. This ensures that no data is caused when the receiver fails in the middle
    +of replicating data -- the buffered data will not be acknowledged and hence will be later resent
    +by the source.
    +
    +An *unreliable receiver* does not have to implement any of this logic. It can simply receive
    +records from the source and insert them one-at-a-time using `store(single-record)`. While it does
    +not get the reliability guarantees of `store(multiple-records)`, it has the following advantages.
    +
    +- The system takes care of chunking that data into appropriate sized blocks (look for block
    +interval in the [Spark Streaming Programming Guide](streaming-programming-guide.html)).
    +- The system takes care of controlling the receiving rates if the rate limits have been specified.
    +- Because of these two, *unreliable receivers are simpler to implement than reliable receivers.
    +
    +The following table summarizes the characteristics of both types of receivers
    +
    +<table class="table">
    +<tr>
    +  <th>Receiver Type</th>
    +  <th>Characteristics</th>
    +</tr>
    +<tr>
    +  <td><b>Unreliable Receivers</b></td>
    +  <td>
    +    Simple to implement.<br>
    +    System takes care of block generation and rate control.
    +    No fault-tolerance guarantees, can loose data on receiver failure.
    --- End diff --
    
    loose -> lose


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66395163
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24276/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66559405
  
      [Test build #24338 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24338/consoleFull) for   PR 3653 at commit [`ca19078`](https://github.com/apache/spark/commit/ca19078b576dc4ae7cbffc2d6806344ce5eeb8a0).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66403490
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24294/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66458371
  
    @JoshRosen @pwendell @andrewor14


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66458888
  
      [Test build #24312 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24312/consoleFull) for   PR 3653 at commit [`195852c`](https://github.com/apache/spark/commit/195852c8bf3a36bfcebff54b3188eac152b010b7).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66627047
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24370/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66557731
  
      [Test build #24333 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24333/consoleFull) for   PR 3653 at commit [`7787209`](https://github.com/apache/spark/commit/7787209ccf5312f43fdfd638af140e237d327c3b).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66556433
  
    Left a couple of minor comments, but overall this looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66559929
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24340/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66559407
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24338/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66473682
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24313/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3653


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3653#issuecomment-66520888
  
    Is this before or after @JoshRosen's extensive comments offline?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3653#discussion_r21652835
  
    --- Diff: docs/streaming-custom-receivers.md ---
    @@ -191,9 +196,68 @@ The full source code is in the example [JavaCustomReceiver.java](https://github.
     </div>
     </div>
     
    -
    -
    -### Implementing and Using a Custom Actor-based Receiver
    +## Receiver Reliability
    +As discussed in brief in the
    +[Spark Streaming Programming Guide](streaming-programming-guide.html#receiver-reliability),
    +there are two kinds of receivers based on their reliability and fault-tolerance semantics.
    +
    +1. *Reliable Receiver* - For *reliable sources* that allow sent data to be acknowledged, a
    +  *reliable receiver* correctly acknowledges to the source that the data has been received
    +  and stored in Spark reliably (that is, replicated successfully). Usually,
    +  implementing this receiver involves careful consideration of the semantics of source
    +  acknowledgements.
    +1. *Unreliable Receiver* - These are receivers for unreliable sources that do not support
    +  acknowledging. Even for reliable sources, one may implement an unreliable receiver that
    +  do not go into the complexity of acknowledging correctly.
    +
    +To implement a *reliable receiver*, you have to use `store(multiple-records)` to store data.
    +This flavour of `store` is a blocking call which returns only after all the given records have
    +been stored inside Spark. If replication is enabled receiver's configured storage level
    --- End diff --
    
    Maybe a missing word here; how about "if the receiver's configured storage level uses replication (which is enabled by default), ..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org