You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rekhajoshm <gi...@git.apache.org> on 2017/10/01 00:21:42 UTC

[GitHub] spark pull request #19404: [SPARK-21760] [Streaming] Fix for Structured stre...

GitHub user rekhajoshm opened a pull request:

    https://github.com/apache/spark/pull/19404

    [SPARK-21760] [Streaming] Fix for Structured streaming terminates with Exception

    ## What changes were proposed in this pull request?
    Updated serialize call for flush and FSDataOutputStream hflush
    
    ## How was this patch tested?
    manual test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rekhajoshm/spark SPARK-21760

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19404
    
----
commit e3677c9fa9697e0d34f9df52442085a6a481c9e9
Author: Rekha Joshi <re...@gmail.com>
Date:   2015-05-05T23:10:08Z

    Merge pull request #1 from apache/master
    
    Pulling functionality from apache spark

commit 106fd8eee8f6a6f7c67cfc64f57c1161f76d8f75
Author: Rekha Joshi <re...@gmail.com>
Date:   2015-05-08T21:49:09Z

    Merge pull request #2 from apache/master
    
    pull latest from apache spark

commit 0be142d6becba7c09c6eba0b8ea1efe83d649e8c
Author: Rekha Joshi <re...@gmail.com>
Date:   2015-06-22T00:08:08Z

    Merge pull request #3 from apache/master
    
    Pulling functionality from apache spark

commit 6c6ee12fd733e3f9902e10faf92ccb78211245e3
Author: Rekha Joshi <re...@gmail.com>
Date:   2015-09-17T01:03:09Z

    Merge pull request #4 from apache/master
    
    Pulling functionality from apache spark

commit b123c601e459d1ad17511fd91dd304032154882a
Author: Rekha Joshi <re...@gmail.com>
Date:   2015-11-25T18:50:32Z

    Merge pull request #5 from apache/master
    
    pull request from apache/master

commit c73c32aadd6066e631956923725a48d98a18777e
Author: Rekha Joshi <re...@gmail.com>
Date:   2016-03-18T19:13:51Z

    Merge pull request #6 from apache/master
    
    pull latest from apache spark

commit 7dbf7320057978526635bed09dabc8cf8657a28a
Author: Rekha Joshi <re...@gmail.com>
Date:   2016-04-05T20:26:40Z

    Merge pull request #8 from apache/master
    
    pull latest from apache spark

commit 5e9d71827f8e2e4d07027281b80e4e073e7fecd1
Author: Rekha Joshi <re...@gmail.com>
Date:   2017-05-01T23:00:30Z

    Merge pull request #9 from apache/master
    
    Pull apache spark

commit 63d99b3ce5f222d7126133170a373591f0ac67dd
Author: Rekha Joshi <re...@gmail.com>
Date:   2017-09-30T22:26:44Z

    Merge pull request #10 from apache/master
    
    pull latest apache spark

commit 89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb
Author: rjoshi2 <re...@gmail.com>
Date:   2017-10-01T00:15:58Z

    Fix for Structured streaming terminating with Exception

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    **[Test build #82439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82439/testReport)** for PR 19404 at commit [`a2d5bc7`](https://github.com/apache/spark/commit/a2d5bc706987accaeb6c9516e7d4a07b5bb3104f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Seems to be apache spark git/jenkins issue.Please retest after a while.thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    **[Test build #82433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82433/testReport)** for PR 19404 at commit [`9cd3ee6`](https://github.com/apache/spark/commit/9cd3ee69b9ddd678b77d8b78feec51ea8c55e377).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82439/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Problem here is that a stream which doesn't implement hflush/hsync is required to throw an exception; it's a way of guaranteeing that if hsync/hflush does complete, the action has done what you want - HBase &c utterly depend on this.
    
    The fact that FSDataOutputStream implements Syncable and yet streams it may relay to may not is the whole reason for [HDFS-11644](https://issues.apache.org/jira/browse/HDFS-11644) and the `StreamCapabilities` method. As with Erasure Coding, even HDFS streams may not support hflush/hsync
    
    This patch is at risk of raising an exception whenever it tries to call `hflush()` on non HDFS store or HDFS with Erasure Coding enabled. IF you were targeting Hadoop 2.9+ you could just check `hasCapability("hsync")` use it if present. For Hadoop 2.6+ you'll have to call `out.hflush()` on the first attempt, if any exception (IOE, UnsupportedOperationException, RTE) is raised, catch, swallow and never try to hflush again. 
    
    Sorry, it's messy: its why I'd like that `hasCapability(`) probe up for all features which are only intermittently available. Can complicate caller code if you want to know these things, but stops you getting caught out when you really want to know the durability semantics of the FS.
    
    see also WiP [OutputStream](https://github.com/steveloughran/hadoop/blob/s3/HADOOP-13327-outputstream-trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md)
    
    (thanks for mentioning me BTW; this is one of those things that would probably work well in local tests but blow up in production somewhere)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82433/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82359/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19404: [SPARK-21760] [Streaming] Fix for Structured stre...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19404#discussion_r183480609
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala ---
    @@ -123,6 +123,7 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path:
               serialize(metadata, output)
               return Some(tempPath)
             } finally {
    +          output.hflush()
    --- End diff --
    
    not needed, close will do it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    **[Test build #3940 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)** for PR 19404 at commit [`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    I think the sync is important, but that you just need to handle the case of "fs doesn't support it".
    
    Thinking about this a bit more, I didn't like my proposed patch. Better to have
    
    * probe for feature after open(), through a check for implementing Syncable, and then calling `hflush()`. It's the lower cost call and if you implement one, you have to implement the other.
    * if hflush fails, don't use sync, so set `syncable: Optional<Syncable>` to None
    * when checkpointing, go `syncable.map(_.hsync())`. Which is the core of your current patch
    
    you will take a perf hit on the sync, as on HDFS you won't get it returning until it has been written down the entire replication chain. But after that, you've got a guarantee of durability, which is what checkpoints tend to expect...
    
    (side topic: some JIRAs on Flink checkpointing to other stores, especially [FLINK-9061](https://issues.apache.org/jira/browse/FLINK-9061)
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82434/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19404: [SPARK-21760] [Streaming] Fix for Structured stre...

Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm closed the pull request at:

    https://github.com/apache/spark/pull/19404


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    BTW, perf wise: hflush() is required to block until the flush has got to the store (visible to others), and with hsync actually  saved to the durable store. So it will take time, but if you want durability, that's that price. Without that hsync call though, there's no guarantee anything will be written to the store. If you need this log to recover from failures: hsync is what you are going to need, even though it currently only works on HDFS and WASB.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    **[Test build #3940 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)** for PR 19404 at commit [`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    **[Test build #82434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82434/testReport)** for PR 19404 at commit [`f945f39`](https://github.com/apache/spark/commit/f945f39438c6a1cabff3f18489dd2082c4cba24c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    **[Test build #82433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82433/testReport)** for PR 19404 at commit [`9cd3ee6`](https://github.com/apache/spark/commit/9cd3ee69b9ddd678b77d8b78feec51ea8c55e377).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    **[Test build #82439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82439/testReport)** for PR 19404 at commit [`a2d5bc7`](https://github.com/apache/spark/commit/a2d5bc706987accaeb6c9516e7d4a07b5bb3104f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    **[Test build #82434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82434/testReport)** for PR 19404 at commit [`f945f39`](https://github.com/apache/spark/commit/f945f39438c6a1cabff3f18489dd2082c4cba24c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    @steveloughran what do you think of this? 
    
    flushing sounds safe but is there a performance impact here if done on every `serialize`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Thanks for the good inputs.Closing this PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19404
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19404: [SPARK-21760] [Streaming] Fix for Structured stre...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19404#discussion_r183482802
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala ---
    @@ -139,6 +139,9 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
           out.write('\n')
           out.write(Serialization.write(data).getBytes(UTF_8))
         }
    +    if (out.isInstanceOf[FSDataOutputStream]) {
    +      out.asInstanceOf[FSDataOutputStream].hflush
    +    }
    --- End diff --
    
    Something like 
    
    When opening stream
    ```scala
    var syncable: Option[Syncable]] = if (out.isInstanceOf[FSDataOutputStream]) Some(out.asInstanceOf[FSDataOutputStream]) else None
    ```
    
    then when at flush time, any flush failure just downgrades to a "don't flush any more"
    
    ```scala
    try {
    syncable.map(_.hflush)
    } catch {
    case _: Exception =>  syncable = None
    }
    ```
    
    I know, risky,  but I'm not 100% sure that its always UnsupportedOperationException which gets raised. If IO is failing, it'll surface later on anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org