You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by bersprockets <gi...@git.apache.org> on 2018/08/11 16:19:26 UTC

[GitHub] spark pull request #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartit...

GitHub user bersprockets opened a pull request:

    https://github.com/apache/spark/pull/22079

    [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on a DataFrame could lead to incorrect answers

    ## What changes were proposed in this pull request?
    
    Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined.
    
    The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below:
    upstream stage -> repartition stage -> result stage
    (-> indicate a shuffle)
    When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data.
    
    The following code returns 931532, instead of 1000000:
    ```
    import scala.sys.process._
    
    import org.apache.spark.TaskContext
    val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
      x
    }.repartition(200).map { x =>
      if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
        throw new Exception("pkill -f java".!!)
      }
      x
    }
    res.distinct().count()
    ```
    
    In this PR, we propose a most straight-forward way to fix this problem by performing a local sort before partitioning, after we make the input row ordering deterministic, the function from rows to partitions is fully deterministic too.
    
    The downside of the approach is that with extra local sort inserted, the performance of repartition() will go down, so we add a new config named `spark.sql.execution.sortBeforeRepartition` to control whether this patch is applied. The patch is default enabled to be safe-by-default, but user may choose to manually turn it off to avoid performance regression.
    
    This patch also changes the output rows ordering of repartition(), that leads to a bunch of test cases failure because they are comparing the results directly.
    
    Add unit test in ExchangeSuite.
    
    With this patch(and `spark.sql.execution.sortBeforeRepartition` set to true), the following query returns 1000000:
    ```
    import scala.sys.process._
    
    import org.apache.spark.TaskContext
    
    spark.conf.set("spark.sql.execution.sortBeforeRepartition", "true")
    
    val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
      x
    }.repartition(200).map { x =>
      if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
        throw new Exception("pkill -f java".!!)
      }
      x
    }
    res.distinct().count()
    
    res7: Long = 1000000
    ```
    
    Author: Xingbo Jiang <xi...@databricks.com>
    
    ## How was this patch tested?
    
    Ran all SBT unit tests for org.apache.spark.sql.*.
    
    Ran pyspark tests for module pyspark-sql.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bersprockets/spark SPARK-23207

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22079
    
----
commit efccc028bce64bf4754ce81ee16533c19b4384b2
Author: Xingbo Jiang <xi...@...>
Date:   2018-01-26T23:01:03Z

    [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame could lead to incorrect answers
    
    Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined.
    
    The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below:
    upstream stage -> repartition stage -> result stage
    (-> indicate a shuffle)
    When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data.
    
    The following code returns 931532, instead of 1000000:
    ```
    import scala.sys.process._
    
    import org.apache.spark.TaskContext
    val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
      x
    }.repartition(200).map { x =>
      if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
        throw new Exception("pkill -f java".!!)
      }
      x
    }
    res.distinct().count()
    ```
    
    In this PR, we propose a most straight-forward way to fix this problem by performing a local sort before partitioning, after we make the input row ordering deterministic, the function from rows to partitions is fully deterministic too.
    
    The downside of the approach is that with extra local sort inserted, the performance of repartition() will go down, so we add a new config named `spark.sql.execution.sortBeforeRepartition` to control whether this patch is applied. The patch is default enabled to be safe-by-default, but user may choose to manually turn it off to avoid performance regression.
    
    This patch also changes the output rows ordering of repartition(), that leads to a bunch of test cases failure because they are comparing the results directly.
    
    Add unit test in ExchangeSuite.
    
    With this patch(and `spark.sql.execution.sortBeforeRepartition` set to true), the following query returns 1000000:
    ```
    import scala.sys.process._
    
    import org.apache.spark.TaskContext
    
    spark.conf.set("spark.sql.execution.sortBeforeRepartition", "true")
    
    val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
      x
    }.repartition(200).map { x =>
      if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
        throw new Exception("pkill -f java".!!)
      }
      x
    }
    res.distinct().count()
    
    res7: Long = 1000000
    ```
    
    Author: Xingbo Jiang <xi...@databricks.com>
    
    Closes #20393 from jiangxb1987/shuffle-repartition.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    The test "model load / save" in ChiSqSelectorSuite fails because of this line in 
    [ChiSqSelector.scala](https://github.com/apache/spark/blob/branch-2.2/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala#L147)
    
    <pre>
    spark.createDataFrame(dataArray).repartition(1).write.parquet(Loader.dataPath(path))
    </pre>
    
    In 2.4, the line is:
    
    <pre>
    spark.createDataFrame(sc.makeRDD(dataArray, 1)).write.parquet(Loader.dataPath(path))
    </pre>
    
    If you change 2.4 to also have that line, and also remove the follow-up PR (#20426) to avoid sorting when there is one partition, this test also fails on 2.4 in the same way.
    
    So I am not sure which way to go: Update ChiSqSelector.scala to be like 2.4 (simply a one line change), or make the test accept this new order.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    The PR https://github.com/apache/spark/pull/22101 has been merged. Please merge the latest one to this PR. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    We shall also include #20088 in this backport PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartit...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22079#discussion_r209705612
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java ---
    @@ -0,0 +1,70 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution;
    +
    +import org.apache.spark.unsafe.Platform;
    +import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
    +
    +public final class RecordBinaryComparator extends RecordComparator {
    +
    +  // TODO(jiangxb) Add test suite for this.
    +  @Override
    +  public int compare(
    +      Object leftObj, long leftOff, int leftLen, Object rightObj, long rightOff, int rightLen) {
    +    int i = 0;
    +    int res = 0;
    +
    +    // If the arrays have different length, the longer one is larger.
    +    if (leftLen != rightLen) {
    +      return leftLen - rightLen;
    +    }
    +
    +    // The following logic uses `leftLen` as the length for both `leftObj` and `rightObj`, since
    +    // we have guaranteed `leftLen` == `rightLen`.
    +
    +    // check if stars align and we can get both offsets to be aligned
    +    if ((leftOff % 8) == (rightOff % 8)) {
    +      while ((leftOff + i) % 8 != 0 && i < leftLen) {
    +        res = (Platform.getByte(leftObj, leftOff + i) & 0xff) -
    +                (Platform.getByte(rightObj, rightOff + i) & 0xff);
    +        if (res != 0) return res;
    +        i += 1;
    +      }
    +    }
    +    // for architectures that support unaligned accesses, chew it up 8 bytes at a time
    +    if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) {
    +      while (i <= leftLen - 8) {
    +        res = (int) ((Platform.getLong(leftObj, leftOff + i) -
    +                Platform.getLong(rightObj, rightOff + i)) % Integer.MAX_VALUE);
    +        if (res != 0) return res;
    --- End diff --
    
    It is possible for two objects to be unequal and yet we consider them as equal with this code, if the long values are separated by Int.MaxValue.
    Any particular reason why this code is written like this with the inaccuracy and not simply the idiomatic comparator comparison ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Thanks! Merged to 2.2


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95053/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Both seems fine to me, it's just a minor improvement. Normally we don't backport a improvement, but since it's a simple and small change I'm confident it is safe to also include the change in a backport PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94633/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartit...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22079#discussion_r209736691
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -144,7 +144,7 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] {
           val dataArray = Array.tabulate(model.selectedFeatures.length) { i =>
             Data(model.selectedFeatures(i))
           }
    -      spark.createDataFrame(dataArray).repartition(1).write.parquet(Loader.dataPath(path))
    +      spark.createDataFrame(sc.makeRDD(dataArray, 1)).write.parquet(Loader.dataPath(path))
    --- End diff --
    
    @dongjoon-hyun Thanks, I will include the other commit and also update the title.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94656/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    @bersprockets Could you please close this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #95054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95054/testReport)** for PR 22079 at commit [`2edad85`](https://github.com/apache/spark/commit/2edad85c39f0fbfea8c8361e9ee420abc6fe4202).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94701/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22079#discussion_r209822194
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java ---
    @@ -0,0 +1,70 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution;
    +
    +import org.apache.spark.unsafe.Platform;
    +import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
    +
    +public final class RecordBinaryComparator extends RecordComparator {
    +
    +  // TODO(jiangxb) Add test suite for this.
    +  @Override
    +  public int compare(
    +      Object leftObj, long leftOff, int leftLen, Object rightObj, long rightOff, int rightLen) {
    +    int i = 0;
    +    int res = 0;
    +
    +    // If the arrays have different length, the longer one is larger.
    +    if (leftLen != rightLen) {
    +      return leftLen - rightLen;
    +    }
    +
    +    // The following logic uses `leftLen` as the length for both `leftObj` and `rightObj`, since
    +    // we have guaranteed `leftLen` == `rightLen`.
    +
    +    // check if stars align and we can get both offsets to be aligned
    +    if ((leftOff % 8) == (rightOff % 8)) {
    +      while ((leftOff + i) % 8 != 0 && i < leftLen) {
    +        res = (Platform.getByte(leftObj, leftOff + i) & 0xff) -
    +                (Platform.getByte(rightObj, rightOff + i) & 0xff);
    +        if (res != 0) return res;
    +        i += 1;
    +      }
    +    }
    +    // for architectures that support unaligned accesses, chew it up 8 bytes at a time
    +    if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) {
    +      while (i <= leftLen - 8) {
    +        res = (int) ((Platform.getLong(leftObj, leftOff + i) -
    +                Platform.getLong(rightObj, rightOff + i)) % Integer.MAX_VALUE);
    +        if (res != 0) return res;
    --- End diff --
    
    As far as I can recall It's actually a mistake. At first the function returned a Long value and later I changed the return value to Integer, let me fix it. Thanks for discovering this @mridulm !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    overall I'm in favor of backporting this, and it looks like the only changes to the original were very small, so I'm in favor of this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-251...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets closed the pull request at:

    https://github.com/apache/spark/pull/22079


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    @gatorsmile So I should include all the related PRs merged to master as a single PR here? Just verifying.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    cc @jiangxb1987 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #95053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95053/testReport)** for PR 22079 at commit [`2edad85`](https://github.com/apache/spark/commit/2edad85c39f0fbfea8c8361e9ee420abc6fe4202).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    LGTM, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94701/testReport)** for PR 22079 at commit [`81f57fe`](https://github.com/apache/spark/commit/81f57febfa5f81cd41ac7803eb9d8931df88fc5c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartit...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22079#discussion_r209731698
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -144,7 +144,7 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] {
           val dataArray = Array.tabulate(model.selectedFeatures.length) { i =>
             Data(model.selectedFeatures(i))
           }
    -      spark.createDataFrame(dataArray).repartition(1).write.parquet(Loader.dataPath(path))
    +      spark.createDataFrame(sc.makeRDD(dataArray, 1)).write.parquet(Loader.dataPath(path))
    --- End diff --
    
    @jiangxb1987 and @bersprockets . SPARK-22905 consists of two commits. 
    - ChiSqSelector (https://github.com/apache/spark/pull/20088)
    - GaussianMixtureModel (https://github.com/apache/spark/pull/20113)
    
    If we want to include SPARK-22905 here, it had better be explicit and complete by putting `[SPARK-22905]` into the PR title and includes both patches.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94619/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94619 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94619/testReport)** for PR 22079 at commit [`efccc02`](https://github.com/apache/spark/commit/efccc028bce64bf4754ce81ee16533c19b4384b2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public final class RecordBinaryComparator extends RecordComparator `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94656/testReport)** for PR 22079 at commit [`8d2d558`](https://github.com/apache/spark/commit/8d2d5585b2c2832cd4d88b3851607ce15180cca5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Hmmm... I somehow managed to break SparkR tests but fixing a comment. It seems to have auto-retried and broke the second time too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Once this is merged, I will also back-port:
    
    - [[SPARK-24564][TEST] Add test suite for RecordBinaryComparator](https://github.com/apache/spark/commit/5b0596648854c0c733b7c607661b78af7df18b89) 
    - #22101



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    cc @jiangxb1987 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94704/testReport)** for PR 22079 at commit [`bab8e68`](https://github.com/apache/spark/commit/bab8e68bc292a3a71ee378a839fd540dcf0a72bd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #95053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95053/testReport)** for PR 22079 at commit [`2edad85`](https://github.com/apache/spark/commit/2edad85c39f0fbfea8c8361e9ee420abc6fe4202).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    @jiangxb1987 
    
    Here are some of the differences from the original PR
    - I also ported the follow up PR #20426
    - I ported #20088 (for SPARK-22905) to get the tests to pass. I also ported its followup, #20113.
    - sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java line 112
    UnsafeExternalSorter.create does not take a Supplier for the 5th argument, so I put get() on the argument to directly pass a RecordComparator object.
    - sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchange.scala: I access SQLConf differently (since SQLConf.get doesn't appear to work in 2.2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    @jiangxb1987 
    
    > We shall also include #20088 in this backport PR.
    
    I did that shortly after commenting, which allowed the tests to pass. I squashed it into the first commit, so it wasn't obvious I did it.
    
    Should I also include #20426 in this PR, or treat that separately?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94656 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94656/testReport)** for PR 22079 at commit [`8d2d558`](https://github.com/apache/spark/commit/8d2d5585b2c2832cd4d88b3851607ce15180cca5).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    https://github.com/apache/spark/commit/d7c3aae2074b3dd3923dd754c0a3c97308c66893 Done


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94704/testReport)** for PR 22079 at commit [`bab8e68`](https://github.com/apache/spark/commit/bab8e68bc292a3a71ee378a839fd540dcf0a72bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94704/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94701/testReport)** for PR 22079 at commit [`81f57fe`](https://github.com/apache/spark/commit/81f57febfa5f81cd41ac7803eb9d8931df88fc5c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94665/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shu...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22079#discussion_r209815627
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java ---
    @@ -0,0 +1,70 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution;
    +
    +import org.apache.spark.unsafe.Platform;
    +import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
    +
    +public final class RecordBinaryComparator extends RecordComparator {
    +
    +  // TODO(jiangxb) Add test suite for this.
    +  @Override
    +  public int compare(
    +      Object leftObj, long leftOff, int leftLen, Object rightObj, long rightOff, int rightLen) {
    +    int i = 0;
    +    int res = 0;
    +
    +    // If the arrays have different length, the longer one is larger.
    +    if (leftLen != rightLen) {
    +      return leftLen - rightLen;
    +    }
    +
    +    // The following logic uses `leftLen` as the length for both `leftObj` and `rightObj`, since
    +    // we have guaranteed `leftLen` == `rightLen`.
    +
    +    // check if stars align and we can get both offsets to be aligned
    +    if ((leftOff % 8) == (rightOff % 8)) {
    +      while ((leftOff + i) % 8 != 0 && i < leftLen) {
    +        res = (Platform.getByte(leftObj, leftOff + i) & 0xff) -
    +                (Platform.getByte(rightObj, rightOff + i) & 0xff);
    +        if (res != 0) return res;
    +        i += 1;
    +      }
    +    }
    +    // for architectures that support unaligned accesses, chew it up 8 bytes at a time
    +    if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) {
    +      while (i <= leftLen - 8) {
    +        res = (int) ((Platform.getLong(leftObj, leftOff + i) -
    +                Platform.getLong(rightObj, rightOff + i)) % Integer.MAX_VALUE);
    +        if (res != 0) return res;
    --- End diff --
    
    good question, @jiangxb1987 any thoughts on this?  (it was this way in the original)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    @gatorsmile Weird, I don't see it on branch-2.2. Is that a sync issue?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #95054 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95054/testReport)** for PR 22079 at commit [`2edad85`](https://github.com/apache/spark/commit/2edad85c39f0fbfea8c8361e9ee420abc6fe4202).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94665 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94665/testReport)** for PR 22079 at commit [`8d2d558`](https://github.com/apache/spark/commit/8d2d5585b2c2832cd4d88b3851607ce15180cca5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94665 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94665/testReport)** for PR 22079 at commit [`8d2d558`](https://github.com/apache/spark/commit/8d2d5585b2c2832cd4d88b3851607ce15180cca5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94633 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94633/testReport)** for PR 22079 at commit [`495cba5`](https://github.com/apache/spark/commit/495cba55aee0223daef089fc8513962997468f77).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    @jiangxb1987 gentle ping.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by bersprockets <gi...@git.apache.org>.
Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94633 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94633/testReport)** for PR 22079 at commit [`495cba5`](https://github.com/apache/spark/commit/495cba55aee0223daef089fc8513962997468f77).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public final class RecordBinaryComparator extends RecordComparator `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95054/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    **[Test build #94619 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94619/testReport)** for PR 22079 at commit [`efccc02`](https://github.com/apache/spark/commit/efccc028bce64bf4754ce81ee16533c19b4384b2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    The original fix https://github.com/apache/spark/pull/22079/commits/efccc028bce64bf4754ce81ee16533c19b4384b2 has been merged to Spark 2.3. After 5+ months, we have not received any correctness regression that is caused by this fix, although this fix definitely will introduce a performance regression. I think we should merge it. The general risk is not very high and it resolves a serious correctness bug. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22079
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org