You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/10/25 23:09:47 UTC

[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19577

    [SPARK-22355][SQL] Dataset.collect is not threadsafe

    ## What changes were proposed in this pull request?
    
    It's possible that users create a `Dataset`, and call `collect` of this `Dataset` in many threads at the same time. Currently `Dataset#collect` just call `encoder.fromRow` to convert spark rows to objects of type T, and this encoder is per-dataset. This means `Dataset#collect` is not thread-safe, because the encoder uses a projection to output the object to a re-usable row.
    
    This PR fixes this problem, by creating a new projection when calling `Dataset#collect`, so that we have the re-usable row for each method call, instead of each Dataset.
    
    ## How was this patch tested?
    
    N/A

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark encoder

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19577.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19577
    
----
commit cecea8cdb36f3c5e65abd08643bd0d181d72008d
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-10-25T23:02:27Z

    Dataset.collect is not threadsafe

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19577#discussion_r147032429
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2661,7 +2657,12 @@ class Dataset[T] private[sql](
        */
       def toLocalIterator(): java.util.Iterator[T] = {
         withAction("toLocalIterator", queryExecution) { plan =>
    -      plan.executeToIterator().map(boundEnc.fromRow).asJava
    +      val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
    --- End diff --
    
    It should be better to explain we keep the projection inside for thread-safe.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19577#discussion_r147037025
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -3102,7 +3103,12 @@ class Dataset[T] private[sql](
        * Collect all elements from a spark plan.
        */
       private def collectFromPlan(plan: SparkPlan): Array[T] = {
    -    plan.executeCollect().map(boundEnc.fromRow)
    +    val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
    --- End diff --
    
    Ok. Looks good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19577#discussion_r147032280
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -3102,7 +3103,12 @@ class Dataset[T] private[sql](
        * Collect all elements from a spark plan.
        */
       private def collectFromPlan(plan: SparkPlan): Array[T] = {
    -    plan.executeCollect().map(boundEnc.fromRow)
    +    val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
    --- End diff --
    
    `fromRow` has caught `RuntimeException`. Shall we also catch it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    **[Test build #83091 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83091/testReport)** for PR 19577 at commit [`dfcdb23`](https://github.com/apache/spark/commit/dfcdb2362f51b2b139340150ea581266cf61d1c3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19577#discussion_r147035745
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -3102,7 +3103,12 @@ class Dataset[T] private[sql](
        * Collect all elements from a spark plan.
        */
       private def collectFromPlan(plan: SparkPlan): Array[T] = {
    -    plan.executeCollect().map(boundEnc.fromRow)
    +    val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
    --- End diff --
    
    it just rethrow the exception, not a big deal


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    Nice catch! LGTM with two minor comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19577#discussion_r147194202
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2661,7 +2657,12 @@ class Dataset[T] private[sql](
        */
       def toLocalIterator(): java.util.Iterator[T] = {
         withAction("toLocalIterator", queryExecution) { plan =>
    -      plan.executeToIterator().map(boundEnc.fromRow).asJava
    +      val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
    --- End diff --
    
    +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    Good catch, LGTM.
    Here is my observation. This problem occurs since (this SpecicInternalRow)[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala#L193] is used per `Dataset`. This can be shared by multiple threads. To avoid this sharing, this PR assigns a Projection into a local variable as `val objProj = GenerateSafeProjection.generate(deserializer :: Nil)`.  
    It is not easy to create a test case for this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83091/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    **[Test build #83063 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83063/testReport)** for PR 19577 at commit [`cecea8c`](https://github.com/apache/spark/commit/cecea8cdb36f3c5e65abd08643bd0d181d72008d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    **[Test build #83063 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83063/testReport)** for PR 19577 at commit [`cecea8c`](https://github.com/apache/spark/commit/cecea8cdb36f3c5e65abd08643bd0d181d72008d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    **[Test build #83091 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83091/testReport)** for PR 19577 at commit [`dfcdb23`](https://github.com/apache/spark/commit/dfcdb2362f51b2b139340150ea581266cf61d1c3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    cc @zsxwing @viirya @kiszk 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19577


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    Thanks! Merged to master. cc @zsxwing 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19577: [SPARK-22355][SQL] Dataset.collect is not threadsafe

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83063/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org