You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2016/01/07 05:03:40 UTC

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/10626

    [SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping splits

    https://issues.apache.org/jira/browse/SPARK-12662
    
    cc @yhuai 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark randomsplit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10626.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10626
    
----
commit 8da496f7b9b11290b20152d0f837f213c2571d60
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-01-07T02:59:48Z

    Fix DataFrame.randomSplit to avoid creating overlapping splits

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49048563
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    what do u mean? the shuffle happens outside of catalyst, so the optimizer can't push it beneath it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49044672
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1062,10 +1062,17 @@ class DataFrame private[sql](
        * @since 1.4.0
        */
       def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = {
    +    // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its
    +    // constituent partitions each time a split is materialized which could result in
    +    // overlapping splits. To prevent this, we explicitly sort each input partition to make the
    +    // ordering deterministic.
    +    val logicalPlanWithLocalSort =
    --- End diff --
    
    to make it more concise, just call this "sorted"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169594446
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169563087
  
    **[Test build #48893 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48893/consoleFull)** for PR 10626 at commit [`8da496f`](https://github.com/apache/spark/commit/8da496f7b9b11290b20152d0f837f213c2571d60).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49041824
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala ---
    @@ -129,6 +129,19 @@ class HiveSparkSubmitSuite
         runSparkSubmit(args)
       }
     
    +  test("SPARK-12662 fix DataFrame.randomSplit to avoid creating overlapping splits") {
    --- End diff --
    
    Do we need to start a whole new process to test this? I think we can just run randomSplit in the normal DataFrameSuite?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169544262
  
    **[Test build #48893 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48893/consoleFull)** for PR 10626 at commit [`8da496f`](https://github.com/apache/spark/commit/8da496f7b9b11290b20152d0f837f213c2571d60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-170055361
  
    @gatorsmile where is the fix?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169622194
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48924/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49048290
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    once we implement sample pushdown in catalyst, it shouldn't be: http://research.microsoft.com/pubs/76565/sig99sam.pdf :) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169625827
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48919/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169622377
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48901/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49047615
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    ok in that case, i'd keep the old test, and just remove the verify sample size testing from this test case. This test case should only care about repeatability. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49048710
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    to be clear, i'm suggesting removing everything the previous test case already tests, and only keep
    ```
    // Verify that the results are deterministic across multiple runs
    val data = sparkContext.parallelize(1 to n, 2).mapPartitions(scala.util.Random.shuffle(_)).toDF("id")
    val splits = data.randomSplit(Array[Double](1, 2, 3), seed = 1)
    val firstRun = splits.toSeq.map(_.collect().toSeq)
    val secondRun = data.randomSplit(Array[Double](1, 2, 3), seed = 1).toSeq.map(_.collect().toSeq)
    assert(firstRun == secondRun)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169600516
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48908/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10626


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49047972
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    sure, but to be fair, this new test does test a new codepath (that of inserting a sampling operator after a shuffle)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169622213
  
    **[Test build #48901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48901/consoleFull)** for PR 10626 at commit [`6e211ff`](https://github.com/apache/spark/commit/6e211ff69063386c2f891e688368cfefb39078e4).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49042523
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala ---
    @@ -129,6 +129,19 @@ class HiveSparkSubmitSuite
         runSparkSubmit(args)
       }
     
    +  test("SPARK-12662 fix DataFrame.randomSplit to avoid creating overlapping splits") {
    --- End diff --
    
    ```scala
    sc.parallelize(1 to 10).mapPartitions(scala.util.Random.shuffle(_)).collect()
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49045970
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169565166
  
    **[Test build #48901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48901/consoleFull)** for PR 10626 at commit [`6e211ff`](https://github.com/apache/spark/commit/6e211ff69063386c2f891e688368cfefb39078e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-170058937
  
    @yhuai I just submitted to my original PR: https://github.com/apache/spark/pull/10630


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169563356
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169622191
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49045017
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    also you can just run it twice and make sure the result is deterministic, i.e.
    ```
    val a = df.randomSplit(...).toSeq.map(_.collect())
    val b = df.randomSplit(...).toSeq.map(_.collect())
    assert(a == b)
    ```
    
    as long as these are scala collections, I think they will work.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49048105
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    isn't that the same code path?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-170023482
  
    The fix is done. Thank you! It is not related to your PR. Sorry : (
    
    After the fix, the plan should be like 
    ```
    Aggregate [id#1], [id#1]
    +- Join LeftSemi, Some((id#1 <=> id#24))
       :- Sample 0.0, 0.4, false, 1
       :  +- Sort [id#1 ASC], false
       :     +- InMemoryRelation [id#1], true, 10000, StorageLevel(true, true, false, true, 1), Project [_1#0 AS id#1], None
       +- Sample 0.4, 1.0, false, 1
          +- Sort [id#24 ASC], false
             +- InMemoryRelation [id#24], true, 10000, StorageLevel(true, true, false, true, 1), Project [_1#0 AS id#1], None
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49044550
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala ---
    @@ -129,6 +129,19 @@ class HiveSparkSubmitSuite
         runSparkSubmit(args)
       }
     
    +  test("SPARK-12662 fix DataFrame.randomSplit to avoid creating overlapping splits") {
    --- End diff --
    
    That's neat! Converted it into a unit test in DataFrameStatSuite.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169766900
  
    Thanks - I'm going to merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169591053
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169928530
  
    I might not pick up your latest code changes. Let me merge the code. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169600024
  
    **[Test build #48908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48908/consoleFull)** for PR 10626 at commit [`9a77c40`](https://github.com/apache/spark/commit/9a77c40e00c664d21b297ea1a7018b6cd9e75371).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169573854
  
    **[Test build #48908 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48908/consoleFull)** for PR 10626 at commit [`9a77c40`](https://github.com/apache/spark/commit/9a77c40e00c664d21b297ea1a7018b6cd9e75371).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169567648
  
    **[Test build #48903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48903/consoleFull)** for PR 10626 at commit [`eba673d`](https://github.com/apache/spark/commit/eba673d576d697d0a842bececc9863485e00e68f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49043437
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala ---
    @@ -129,6 +129,19 @@ class HiveSparkSubmitSuite
         runSparkSubmit(args)
       }
     
    +  test("SPARK-12662 fix DataFrame.randomSplit to avoid creating overlapping splits") {
    --- End diff --
    
    oh, right. We missed it. It is a good one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-170008457
  
    Thank you @sameeragarwal for investigating this! Sorry to bring this to you at midnight.  
    
    For helping anyone understand the problem, let me post the logical plan if we do not collect the data to the local node:
    ```
    Aggregate [id#1], [id#1]
    +- Join LeftSemi, None
       :- Filter (id#1 <=> id#1)
       :  +- Sample 0.0, 0.4, false, 1
       :     +- Sort [id#1 ASC], false
       :        +- Project [_1#0 AS id#1]
       :           +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22
       +- Sample 0.4, 1.0, false, 1
          +- Project
             +- Sort [id#1 ASC], false
                +- Project [_1#0 AS id#1]
                   +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22
    ```
    The non-empty result of DF `Intersect` is expected due to either randomness of data distribution or nondeterministic results of `Sample`. 
    
    Since this is one line change, I think I just include it and to let the tests passed. I will also add a comment to explain it in the test case. Sometimes, users might read the test case to implement their work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169765653
  
    All comments addressed!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169622376
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49044659
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    since the tests are run so frequently, I don't think you need to try these many times ... doing it once should be enough.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169597974
  
    **[Test build #48924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48924/consoleFull)** for PR 10626 at commit [`1b30119`](https://github.com/apache/spark/commit/1b30119d4524bec42bec1b1aaece3f3ab691f0e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49044813
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    This is the just the size of the dataset. We do however test for 5 different seeds. Should I just test for 1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169625569
  
    **[Test build #48919 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48919/consoleFull)** for PR 10626 at commit [`252dbc3`](https://github.com/apache/spark/commit/252dbc3052db9e51bdd6f950f501d29004450313).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49044814
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1062,10 +1062,17 @@ class DataFrame private[sql](
        * @since 1.4.0
        */
       def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = {
    +    // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its
    +    // constituent partitions each time a split is materialized which could result in
    +    // overlapping splits. To prevent this, we explicitly sort each input partition to make the
    +    // ordering deterministic.
    +    val logicalPlanWithLocalSort =
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49047415
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    sure, that sounds reasonable. The only thing that the old test does that this doesn't is that it tests for multiple sampling seeds. I trust that logic might've been also tested somewhere else in the physical planner as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169593877
  
    **[Test build #48903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48903/consoleFull)** for PR 10626 at commit [`eba673d`](https://github.com/apache/spark/commit/eba673d576d697d0a842bececc9863485e00e68f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49047097
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    i took another look - maybe you should just remove the original randomSplit test case and keep only this one?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169600515
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169929536
  
    @gatorsmile it seems like your PR is changing the behavior of SQL intersect that this test relies on. I can take a closer look at the PR but if you think this test is using intersect in a way that is not supported in SparkSQL, we can change DataFrameStatSuite:L76 to `assert(splits(0).collect().intersect(splits(1).collect()).isEmpty)` to make this work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49044959
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -62,6 +62,32 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    +  test("randomSplit on reordered partitions") {
    +    val n = 600
    --- End diff --
    
    yea 1 is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169591054
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48917/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169594448
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48903/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10626#discussion_r49041900
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala ---
    @@ -129,6 +129,19 @@ class HiveSparkSubmitSuite
         runSparkSubmit(args)
       }
     
    +  test("SPARK-12662 fix DataFrame.randomSplit to avoid creating overlapping splits") {
    --- End diff --
    
    We have not figure out a case that can trigger the problem in the local mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169625824
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169918040
  
    @sameeragarwal Could you take a look at the following test failure? 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49007/consoleFull
    
    It sounds like this is caused by this fix. If you are busy, I can work on it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169563357
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48893/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169935761
  
    @gatorsmile I pulled your changes and verified that the new intersect implementation fails even when there is a deterministic sampling operator in the plan, i.e., something along the lines of the following will now fail:
    ```scala
    val data = sparkContext.parallelize(1 to 600, 1).toDF("id")
    assert(data.sample(false, 0.1, 1).intersect(data.sample(false, 0.1, 1)).collect().isEmpty) //FAILS
    ```
    
    If that's intentional, please let me know if you'd like me to fix the test or fold the above change as part of your PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169931662
  
    It is best for us to use local collection's intersect rather than relying on dataframe's.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169591670
  
    **[Test build #48919 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48919/consoleFull)** for PR 10626 at commit [`252dbc3`](https://github.com/apache/spark/commit/252dbc3052db9e51bdd6f950f501d29004450313).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12662][SQL] Fix DataFrame.randomSplit t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10626#issuecomment-169621780
  
    **[Test build #48924 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48924/consoleFull)** for PR 10626 at commit [`1b30119`](https://github.com/apache/spark/commit/1b30119d4524bec42bec1b1aaece3f3ab691f0e4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org