You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/10/25 23:09:47 UTC
[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...
GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/19577
[SPARK-22355][SQL] Dataset.collect is not threadsafe
## What changes were proposed in this pull request?
It's possible that users create a `Dataset`, and call `collect` of this `Dataset` in many threads at the same time. Currently `Dataset#collect` just call `encoder.fromRow` to convert spark rows to objects of type T, and this encoder is per-dataset. This means `Dataset#collect` is not thread-safe, because the encoder uses a projection to output the object to a re-usable row.
This PR fixes this problem, by creating a new projection when calling `Dataset#collect`, so that we have the re-usable row for each method call, instead of each Dataset.
## How was this patch tested?
N/A
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark encoder
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19577.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19577
----
commit cecea8cdb36f3c5e65abd08643bd0d181d72008d
Author: Wenchen Fan <we...@databricks.com>
Date: 2017-10-25T23:02:27Z
Dataset.collect is not threadsafe
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19577#discussion_r147032429
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2661,7 +2657,12 @@ class Dataset[T] private[sql](
*/
def toLocalIterator(): java.util.Iterator[T] = {
withAction("toLocalIterator", queryExecution) { plan =>
- plan.executeToIterator().map(boundEnc.fromRow).asJava
+ val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
--- End diff --
It should be better to explain we keep the projection inside for thread-safe.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19577#discussion_r147037025
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3102,7 +3103,12 @@ class Dataset[T] private[sql](
* Collect all elements from a spark plan.
*/
private def collectFromPlan(plan: SparkPlan): Array[T] = {
- plan.executeCollect().map(boundEnc.fromRow)
+ val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
--- End diff --
Ok. Looks good.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19577#discussion_r147032280
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3102,7 +3103,12 @@ class Dataset[T] private[sql](
* Collect all elements from a spark plan.
*/
private def collectFromPlan(plan: SparkPlan): Array[T] = {
- plan.executeCollect().map(boundEnc.fromRow)
+ val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
--- End diff --
`fromRow` has caught `RuntimeException`. Shall we also catch it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19577
**[Test build #83091 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83091/testReport)** for PR 19577 at commit [`dfcdb23`](https://github.com/apache/spark/commit/dfcdb2362f51b2b139340150ea581266cf61d1c3).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19577#discussion_r147035745
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3102,7 +3103,12 @@ class Dataset[T] private[sql](
* Collect all elements from a spark plan.
*/
private def collectFromPlan(plan: SparkPlan): Array[T] = {
- plan.executeCollect().map(boundEnc.fromRow)
+ val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
--- End diff --
it just rethrow the exception, not a big deal
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19577
Nice catch! LGTM with two minor comments.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/19577#discussion_r147194202
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2661,7 +2657,12 @@ class Dataset[T] private[sql](
*/
def toLocalIterator(): java.util.Iterator[T] = {
withAction("toLocalIterator", queryExecution) { plan =>
- plan.executeToIterator().map(boundEnc.fromRow).asJava
+ val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
--- End diff --
+1
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19577
Good catch, LGTM.
Here is my observation. This problem occurs since (this SpecicInternalRow)[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala#L193] is used per `Dataset`. This can be shared by multiple threads. To avoid this sharing, this PR assigns a Projection into a local variable as `val objProj = GenerateSafeProjection.generate(deserializer :: Nil)`.
It is not easy to create a test case for this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19577
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83091/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19577
**[Test build #83063 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83063/testReport)** for PR 19577 at commit [`cecea8c`](https://github.com/apache/spark/commit/cecea8cdb36f3c5e65abd08643bd0d181d72008d).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19577
**[Test build #83063 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83063/testReport)** for PR 19577 at commit [`cecea8c`](https://github.com/apache/spark/commit/cecea8cdb36f3c5e65abd08643bd0d181d72008d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19577
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19577
**[Test build #83091 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83091/testReport)** for PR 19577 at commit [`dfcdb23`](https://github.com/apache/spark/commit/dfcdb2362f51b2b139340150ea581266cf61d1c3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19577
cc @zsxwing @viirya @kiszk
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19577
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19577
Thanks! Merged to master. cc @zsxwing
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19577
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19577
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83063/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org