You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rekhajoshm <gi...@git.apache.org> on 2017/10/01 00:21:42 UTC
[GitHub] spark pull request #19404: [SPARK-21760] [Streaming] Fix for Structured stre...
GitHub user rekhajoshm opened a pull request:
https://github.com/apache/spark/pull/19404
[SPARK-21760] [Streaming] Fix for Structured streaming terminates with Exception
## What changes were proposed in this pull request?
Updated serialize call for flush and FSDataOutputStream hflush
## How was this patch tested?
manual test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rekhajoshm/spark SPARK-21760
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19404.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19404
----
commit e3677c9fa9697e0d34f9df52442085a6a481c9e9
Author: Rekha Joshi <re...@gmail.com>
Date: 2015-05-05T23:10:08Z
Merge pull request #1 from apache/master
Pulling functionality from apache spark
commit 106fd8eee8f6a6f7c67cfc64f57c1161f76d8f75
Author: Rekha Joshi <re...@gmail.com>
Date: 2015-05-08T21:49:09Z
Merge pull request #2 from apache/master
pull latest from apache spark
commit 0be142d6becba7c09c6eba0b8ea1efe83d649e8c
Author: Rekha Joshi <re...@gmail.com>
Date: 2015-06-22T00:08:08Z
Merge pull request #3 from apache/master
Pulling functionality from apache spark
commit 6c6ee12fd733e3f9902e10faf92ccb78211245e3
Author: Rekha Joshi <re...@gmail.com>
Date: 2015-09-17T01:03:09Z
Merge pull request #4 from apache/master
Pulling functionality from apache spark
commit b123c601e459d1ad17511fd91dd304032154882a
Author: Rekha Joshi <re...@gmail.com>
Date: 2015-11-25T18:50:32Z
Merge pull request #5 from apache/master
pull request from apache/master
commit c73c32aadd6066e631956923725a48d98a18777e
Author: Rekha Joshi <re...@gmail.com>
Date: 2016-03-18T19:13:51Z
Merge pull request #6 from apache/master
pull latest from apache spark
commit 7dbf7320057978526635bed09dabc8cf8657a28a
Author: Rekha Joshi <re...@gmail.com>
Date: 2016-04-05T20:26:40Z
Merge pull request #8 from apache/master
pull latest from apache spark
commit 5e9d71827f8e2e4d07027281b80e4e073e7fecd1
Author: Rekha Joshi <re...@gmail.com>
Date: 2017-05-01T23:00:30Z
Merge pull request #9 from apache/master
Pull apache spark
commit 63d99b3ce5f222d7126133170a373591f0ac67dd
Author: Rekha Joshi <re...@gmail.com>
Date: 2017-09-30T22:26:44Z
Merge pull request #10 from apache/master
pull latest apache spark
commit 89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb
Author: rjoshi2 <re...@gmail.com>
Date: 2017-10-01T00:15:58Z
Fix for Structured streaming terminating with Exception
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19404
**[Test build #82439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82439/testReport)** for PR 19404 at commit [`a2d5bc7`](https://github.com/apache/spark/commit/a2d5bc706987accaeb6c9516e7d4a07b5bb3104f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm commented on the issue:
https://github.com/apache/spark/pull/19404
Seems to be apache spark git/jenkins issue.Please retest after a while.thanks
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19404
**[Test build #82433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82433/testReport)** for PR 19404 at commit [`9cd3ee6`](https://github.com/apache/spark/commit/9cd3ee69b9ddd678b77d8b78feec51ea8c55e377).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19404
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82439/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/19404
Problem here is that a stream which doesn't implement hflush/hsync is required to throw an exception; it's a way of guaranteeing that if hsync/hflush does complete, the action has done what you want - HBase &c utterly depend on this.
The fact that FSDataOutputStream implements Syncable and yet streams it may relay to may not is the whole reason for [HDFS-11644](https://issues.apache.org/jira/browse/HDFS-11644) and the `StreamCapabilities` method. As with Erasure Coding, even HDFS streams may not support hflush/hsync
This patch is at risk of raising an exception whenever it tries to call `hflush()` on non HDFS store or HDFS with Erasure Coding enabled. IF you were targeting Hadoop 2.9+ you could just check `hasCapability("hsync")` use it if present. For Hadoop 2.6+ you'll have to call `out.hflush()` on the first attempt, if any exception (IOE, UnsupportedOperationException, RTE) is raised, catch, swallow and never try to hflush again.
Sorry, it's messy: its why I'd like that `hasCapability(`) probe up for all features which are only intermittently available. Can complicate caller code if you want to know these things, but stops you getting caught out when you really want to know the durability semantics of the FS.
see also WiP [OutputStream](https://github.com/steveloughran/hadoop/blob/s3/HADOOP-13327-outputstream-trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md)
(thanks for mentioning me BTW; this is one of those things that would probably work well in local tests but blow up in production somewhere)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19404
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82433/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19404
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82359/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19404: [SPARK-21760] [Streaming] Fix for Structured stre...
Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:
https://github.com/apache/spark/pull/19404#discussion_r183480609
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala ---
@@ -123,6 +123,7 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path:
serialize(metadata, output)
return Some(tempPath)
} finally {
+ output.hflush()
--- End diff --
not needed, close will do it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19404
**[Test build #3940 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)** for PR 19404 at commit [`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19404
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/19404
I think the sync is important, but that you just need to handle the case of "fs doesn't support it".
Thinking about this a bit more, I didn't like my proposed patch. Better to have
* probe for feature after open(), through a check for implementing Syncable, and then calling `hflush()`. It's the lower cost call and if you implement one, you have to implement the other.
* if hflush fails, don't use sync, so set `syncable: Optional<Syncable>` to None
* when checkpointing, go `syncable.map(_.hsync())`. Which is the core of your current patch
you will take a perf hit on the sync, as on HDFS you won't get it returning until it has been written down the entire replication chain. But after that, you've got a guarantee of durability, which is what checkpoints tend to expect...
(side topic: some JIRAs on Flink checkpointing to other stores, especially [FLINK-9061](https://issues.apache.org/jira/browse/FLINK-9061)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19404
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19404
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82434/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19404: [SPARK-21760] [Streaming] Fix for Structured stre...
Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm closed the pull request at:
https://github.com/apache/spark/pull/19404
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/19404
BTW, perf wise: hflush() is required to block until the flush has got to the store (visible to others), and with hsync actually saved to the durable store. So it will take time, but if you want durability, that's that price. Without that hsync call though, there's no guarantee anything will be written to the store. If you need this log to recover from failures: hsync is what you are going to need, even though it currently only works on HDFS and WASB.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19404
**[Test build #3940 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)** for PR 19404 at commit [`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19404
**[Test build #82434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82434/testReport)** for PR 19404 at commit [`f945f39`](https://github.com/apache/spark/commit/f945f39438c6a1cabff3f18489dd2082c4cba24c).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19404
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19404
**[Test build #82433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82433/testReport)** for PR 19404 at commit [`9cd3ee6`](https://github.com/apache/spark/commit/9cd3ee69b9ddd678b77d8b78feec51ea8c55e377).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19404
**[Test build #82439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82439/testReport)** for PR 19404 at commit [`a2d5bc7`](https://github.com/apache/spark/commit/a2d5bc706987accaeb6c9516e7d4a07b5bb3104f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19404
**[Test build #82434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82434/testReport)** for PR 19404 at commit [`f945f39`](https://github.com/apache/spark/commit/f945f39438c6a1cabff3f18489dd2082c4cba24c).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/19404
@steveloughran what do you think of this?
flushing sounds safe but is there a performance impact here if done on every `serialize`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm commented on the issue:
https://github.com/apache/spark/pull/19404
Thanks for the good inputs.Closing this PR.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19404
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19404: [SPARK-21760] [Streaming] Fix for Structured stre...
Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:
https://github.com/apache/spark/pull/19404#discussion_r183482802
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala ---
@@ -139,6 +139,9 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
out.write('\n')
out.write(Serialization.write(data).getBytes(UTF_8))
}
+ if (out.isInstanceOf[FSDataOutputStream]) {
+ out.asInstanceOf[FSDataOutputStream].hflush
+ }
--- End diff --
Something like
When opening stream
```scala
var syncable: Option[Syncable]] = if (out.isInstanceOf[FSDataOutputStream]) Some(out.asInstanceOf[FSDataOutputStream]) else None
```
then when at flush time, any flush failure just downgrades to a "don't flush any more"
```scala
try {
syncable.map(_.hflush)
} catch {
case _: Exception => syncable = None
}
```
I know, risky, but I'm not 100% sure that its always UnsupportedOperationException which gets raised. If IO is failing, it'll surface later on anyway.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org