You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jerryshao <gi...@git.apache.org> on 2014/09/24 06:45:32 UTC

[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

GitHub user jerryshao opened a pull request:

    https://github.com/apache/spark/pull/2514

    [SPARK-3032][Shuffle] Fix key comparison integer overflow introduced sorting exception

    Previous key comparison in `ExternalSorter` will get wrong sorting result or exception when key comparison overflows, details can be seen in [SPARK-3032](https://issues.apache.org/jira/browse/SPARK-3032). Here fix this and add a unit test to prove it.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jerryshao/apache-spark SPARK-3032

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2514.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2514
    
----
commit fa2a08f52b5cd59b0016c6bb4931c40f32985881
Author: jerryshao <sa...@intel.com>
Date:   2014-09-24T04:35:02Z

    Fix key comparison integer overflow introduced sorting exception

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56624542
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20741/consoleFull) for   PR 2514 at commit [`fa2a08f`](https://github.com/apache/spark/commit/fa2a08f52b5cd59b0016c6bb4931c40f32985881).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-57075989
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20928/consoleFull) for   PR 2514 at commit [`6f3c302`](https://github.com/apache/spark/commit/6f3c30263560853c4cfb5b65b74bce3e39801e05).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56628035
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20741/consoleFull) for   PR 2514 at commit [`fa2a08f`](https://github.com/apache/spark/commit/fa2a08f52b5cd59b0016c6bb4931c40f32985881).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r18058855
  
    --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalSorterSuite.scala ---
    @@ -707,4 +709,63 @@ class ExternalSorterSuite extends FunSuite with LocalSparkContext with PrivateMe
           Some(agg), Some(new HashPartitioner(FEW_PARTITIONS)), None, None)
         assertDidNotBypassMergeSort(sorter4)
       }
    +
    +  test("sort without breaking sorting contracts") {
    +    val conf = createSparkConf(true)
    +    conf.set("spark.shuffle.memoryFraction", "0.001")
    +    conf.set("spark.shuffle.manager", "sort")
    +    sc = new SparkContext("local-cluster[1,1,512]", "test", conf)
    +
    +    // Using wrongHashOrdering to show integer overflow introduced exception.
    +    val rand = new Random
    --- End diff --
    
    Just seed this at a given value that you see causes the exception to occur.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-57077291
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56792685
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20788/consoleFull) for   PR 2514 at commit [`01911e6`](https://github.com/apache/spark/commit/01911e6f447b161d5a7a7832af417ab0407da271).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56787444
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20788/consoleFull) for   PR 2514 at commit [`01911e6`](https://github.com/apache/spark/commit/01911e6f447b161d5a7a7832af417ab0407da271).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56789130
  
    Hi Matei, I modify the unit test to reproduce the exception. Seems it is difficult to reproduce this exception with small dataset manually, as this exception is unreliable.
    
    Here is the reason searched from [stack overflow](http://stackoverflow.com/questions/24951257/when-does-timsort-complain-about-broken-comparator)
    
    >the exception behavior is unreliable: As long as you have small data sets (so small that a generated run may never gallop, as MIN_GALLOP is 7) or the generated runs always coincidentally generate a merge that never gallops, you will never receive the exception. Thus, without further reviewing the gallopRight method, we can come to the conclusion that you cannot rely on the exception: It may never be thrown no matter how wrong your comparator is.
    
    So here I generate 1m random integer values to reproduce the exception. Seems in my local test with  above 1000 rounds of test, this exception can always be produced. But it cannot be logically proved and still have chance to not throw exception.
    
    Also I tested with 1k random integer plus some large data like Integer.MaxValue and Integer.MinValue, hardly to reproduce this exception. And with 10k dataset, 1/3 chance will get the exception.
    
    I think unless someone familiar with TimSort can manually create effectively small dataset, potentially this unit test may fail.
    
    So would you give me some suggestions? Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r17954749
  
    --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalSorterSuite.scala ---
    @@ -707,4 +707,53 @@ class ExternalSorterSuite extends FunSuite with LocalSparkContext with PrivateMe
           Some(agg), Some(new HashPartitioner(FEW_PARTITIONS)), None, None)
         assertDidNotBypassMergeSort(sorter4)
       }
    +
    +  test("sort without breaking sorting contracts") {
    +    val conf = createSparkConf(true)
    +    conf.set("spark.shuffle.memoryFraction", "0.001")
    +    conf.set("spark.shuffle.manager", "sort")
    +    sc = new SparkContext("local-cluster[1,1,512]", "test", conf)
    +
    +    val testData = Array[String](
    +      "hierarch",         // -1732884796
    +      "variants",         // -1249574770
    +      "inwork",           // -1183663690
    +      "isohel",           // -1179291542
    +      "misused"           // 1069518484
    +      )
    +    val expected = testData.map(s => (s, 200000))
    +
    +    def createCombiner(i: Int) = ArrayBuffer(i)
    +    def mergeValue(c: ArrayBuffer[Int], i: Int) = c += i
    +    def mergeCombiners(c1: ArrayBuffer[Int], c2: ArrayBuffer[Int]) = c1 ++= c2
    +
    +    val agg = new Aggregator[String, Int, ArrayBuffer[Int]](
    +      createCombiner, mergeValue, mergeCombiners)
    +
    +    // Using wrongHashOrdering to show that integer overflow will lead to wrong sort result.
    +    val wrongHashOrdering = new Ordering[String] {
    +      override def compare(a: String, b: String) = {
    +        val h1 = a.hashCode()
    +        val h2 = b.hashCode()
    +        h1 - h2
    +      }
    +    }
    +    val sorter1 = new ExternalSorter[String, Int, ArrayBuffer[Int]](
    +      None, None, Some(wrongHashOrdering), None)
    +    sorter1.insertAll(expected.iterator)
    --- End diff --
    
    Seems these test data do not throw exception as "violates its general contract", but only give the wrong results. I'm not familiar with TimSort, probably this exception is not always happening, or the test data is not enough to reproduce the exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r17957564
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
    @@ -152,7 +152,7 @@ private[spark] class ExternalSorter[K, V, C](
         override def compare(a: K, b: K): Int = {
           val h1 = if (a == null) 0 else a.hashCode()
           val h2 = if (b == null) 0 else b.hashCode()
    -      h1 - h2
    +      if (h1 < h2) -1 else if (h1 == h2) 0 else 1
    --- End diff --
    
    @mateiz per my comment, that would no longer run in Java 6 as `Integer.compare` doesn't exist before Java 7.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-57078715
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20931/consoleFull) for   PR 2514 at commit [`6f3c302`](https://github.com/apache/spark/commit/6f3c30263560853c4cfb5b65b74bce3e39801e05).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class IndexedRecordToJavaConverter extends Converter[IndexedRecord, JMap[String, Any]]`
      * `class AvroWrapperToJavaConverter extends Converter[Any, Any] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-57076884
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20928/consoleFull) for   PR 2514 at commit [`6f3c302`](https://github.com/apache/spark/commit/6f3c30263560853c4cfb5b65b74bce3e39801e05).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class IndexedRecordToJavaConverter extends Converter[IndexedRecord, JMap[String, Any]]`
      * `class AvroWrapperToJavaConverter extends Converter[Any, Any] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-57076888
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20928/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56629822
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20742/consoleFull) for   PR 2514 at commit [`83acb38`](https://github.com/apache/spark/commit/83acb38649ef41917130d7837ab9f4177fc3262d).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56881218
  
    If you have a test that reproduces it in 1000 runs, it's fine. I'd improve it by seeding a random generator with a fixed seed that you saw produce the problem. TimSort is deterministic, so it will always throw it in that case (though this may break if we change the implementation).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r18012559
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
    @@ -152,7 +152,7 @@ private[spark] class ExternalSorter[K, V, C](
         override def compare(a: K, b: K): Int = {
           val h1 = if (a == null) 0 else a.hashCode()
           val h2 = if (b == null) 0 else b.hashCode()
    -      h1 - h2
    +      if (h1 < h2) -1 else if (h1 == h2) 0 else 1
    --- End diff --
    
    Alright, we can keep it as is then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r17954596
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
    @@ -152,7 +152,7 @@ private[spark] class ExternalSorter[K, V, C](
         override def compare(a: K, b: K): Int = {
           val h1 = if (a == null) 0 else a.hashCode()
           val h2 = if (b == null) 0 else b.hashCode()
    -      h1 - h2
    +      if (h1 < h2) -1 else if (h1 == h2) 0 else 1
    --- End diff --
    
    Don't use this, use Integer.compare instead. Reason is Java will likely optimize the latter one. You can fix it in ExternalAppendOnlyMap too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-57078720
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20931/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-57205650
  
    Thanks Saisai, this looks good. I've merged it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r17954612
  
    --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalSorterSuite.scala ---
    @@ -707,4 +707,53 @@ class ExternalSorterSuite extends FunSuite with LocalSparkContext with PrivateMe
           Some(agg), Some(new HashPartitioner(FEW_PARTITIONS)), None, None)
         assertDidNotBypassMergeSort(sorter4)
       }
    +
    +  test("sort without breaking sorting contracts") {
    +    val conf = createSparkConf(true)
    +    conf.set("spark.shuffle.memoryFraction", "0.001")
    +    conf.set("spark.shuffle.manager", "sort")
    +    sc = new SparkContext("local-cluster[1,1,512]", "test", conf)
    +
    +    val testData = Array[String](
    +      "hierarch",         // -1732884796
    +      "variants",         // -1249574770
    +      "inwork",           // -1183663690
    +      "isohel",           // -1179291542
    +      "misused"           // 1069518484
    +      )
    +    val expected = testData.map(s => (s, 200000))
    +
    +    def createCombiner(i: Int) = ArrayBuffer(i)
    +    def mergeValue(c: ArrayBuffer[Int], i: Int) = c += i
    +    def mergeCombiners(c1: ArrayBuffer[Int], c2: ArrayBuffer[Int]) = c1 ++= c2
    +
    +    val agg = new Aggregator[String, Int, ArrayBuffer[Int]](
    +      createCombiner, mergeValue, mergeCombiners)
    +
    +    // Using wrongHashOrdering to show that integer overflow will lead to wrong sort result.
    +    val wrongHashOrdering = new Ordering[String] {
    +      override def compare(a: String, b: String) = {
    +        val h1 = a.hashCode()
    +        val h2 = b.hashCode()
    +        h1 - h2
    +      }
    +    }
    +    val sorter1 = new ExternalSorter[String, Int, ArrayBuffer[Int]](
    +      None, None, Some(wrongHashOrdering), None)
    +    sorter1.insertAll(expected.iterator)
    --- End diff --
    
    Why is this expected to run at all without throwing an exception? It should just fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r18012867
  
    --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalSorterSuite.scala ---
    @@ -707,4 +707,53 @@ class ExternalSorterSuite extends FunSuite with LocalSparkContext with PrivateMe
           Some(agg), Some(new HashPartitioner(FEW_PARTITIONS)), None, None)
         assertDidNotBypassMergeSort(sorter4)
       }
    +
    +  test("sort without breaking sorting contracts") {
    +    val conf = createSparkConf(true)
    +    conf.set("spark.shuffle.memoryFraction", "0.001")
    +    conf.set("spark.shuffle.manager", "sort")
    +    sc = new SparkContext("local-cluster[1,1,512]", "test", conf)
    +
    +    val testData = Array[String](
    +      "hierarch",         // -1732884796
    +      "variants",         // -1249574770
    +      "inwork",           // -1183663690
    +      "isohel",           // -1179291542
    +      "misused"           // 1069518484
    +      )
    +    val expected = testData.map(s => (s, 200000))
    +
    +    def createCombiner(i: Int) = ArrayBuffer(i)
    +    def mergeValue(c: ArrayBuffer[Int], i: Int) = c += i
    +    def mergeCombiners(c1: ArrayBuffer[Int], c2: ArrayBuffer[Int]) = c1 ++= c2
    +
    +    val agg = new Aggregator[String, Int, ArrayBuffer[Int]](
    +      createCombiner, mergeValue, mergeCombiners)
    +
    +    // Using wrongHashOrdering to show that integer overflow will lead to wrong sort result.
    +    val wrongHashOrdering = new Ordering[String] {
    +      override def compare(a: String, b: String) = {
    +        val h1 = a.hashCode()
    +        val h2 = b.hashCode()
    +        h1 - h2
    +      }
    +    }
    +    val sorter1 = new ExternalSorter[String, Int, ArrayBuffer[Int]](
    +      None, None, Some(wrongHashOrdering), None)
    +    sorter1.insertAll(expected.iterator)
    --- End diff --
    
    OK, thanks for your advice,  will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56792692
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20788/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r17954265
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
    @@ -152,7 +152,7 @@ private[spark] class ExternalSorter[K, V, C](
         override def compare(a: K, b: K): Int = {
           val h1 = if (a == null) 0 else a.hashCode()
           val h2 = if (b == null) 0 else b.hashCode()
    -      h1 - h2
    +      Integer.compare(h1, h2)
    --- End diff --
    
    OK, thanks Sean.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r18012545
  
    --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalSorterSuite.scala ---
    @@ -707,4 +707,53 @@ class ExternalSorterSuite extends FunSuite with LocalSparkContext with PrivateMe
           Some(agg), Some(new HashPartitioner(FEW_PARTITIONS)), None, None)
         assertDidNotBypassMergeSort(sorter4)
       }
    +
    +  test("sort without breaking sorting contracts") {
    +    val conf = createSparkConf(true)
    +    conf.set("spark.shuffle.memoryFraction", "0.001")
    +    conf.set("spark.shuffle.manager", "sort")
    +    sc = new SparkContext("local-cluster[1,1,512]", "test", conf)
    +
    +    val testData = Array[String](
    +      "hierarch",         // -1732884796
    +      "variants",         // -1249574770
    +      "inwork",           // -1183663690
    +      "isohel",           // -1179291542
    +      "misused"           // 1069518484
    +      )
    +    val expected = testData.map(s => (s, 200000))
    +
    +    def createCombiner(i: Int) = ArrayBuffer(i)
    +    def mergeValue(c: ArrayBuffer[Int], i: Int) = c += i
    +    def mergeCombiners(c1: ArrayBuffer[Int], c2: ArrayBuffer[Int]) = c1 ++= c2
    +
    +    val agg = new Aggregator[String, Int, ArrayBuffer[Int]](
    +      createCombiner, mergeValue, mergeCombiners)
    +
    +    // Using wrongHashOrdering to show that integer overflow will lead to wrong sort result.
    +    val wrongHashOrdering = new Ordering[String] {
    +      override def compare(a: String, b: String) = {
    +        val h1 = a.hashCode()
    +        val h2 = b.hashCode()
    +        h1 - h2
    +      }
    +    }
    +    val sorter1 = new ExternalSorter[String, Int, ArrayBuffer[Int]](
    +      None, None, Some(wrongHashOrdering), None)
    +    sorter1.insertAll(expected.iterator)
    --- End diff --
    
    Well, if the test data is not enough, please add suitable test data! You got an exception in your reported version of this error, right? I think it's fairly easy to add suitable data, just put integers that include Int.MaxValue and MinValue and such (or other extreme integers). You can use a custom class with a configurable hash code (we have one in our test suites). Otherwise this is just testing that the Ordering is different.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56628042
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20741/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2514


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r17954025
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
    @@ -152,7 +152,7 @@ private[spark] class ExternalSorter[K, V, C](
         override def compare(a: K, b: K): Int = {
           val h1 = if (a == null) 0 else a.hashCode()
           val h2 = if (b == null) 0 else b.hashCode()
    -      h1 - h2
    +      Integer.compare(h1, h2)
    --- End diff --
    
    This method does not exist in Java 6. There is an equivalent in Guava or you can just write the comparisons directly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56629824
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20742/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-56625977
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20742/consoleFull) for   PR 2514 at commit [`83acb38`](https://github.com/apache/spark/commit/83acb38649ef41917130d7837ab9f4177fc3262d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-57077406
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20931/consoleFull) for   PR 2514 at commit [`6f3c302`](https://github.com/apache/spark/commit/6f3c30263560853c4cfb5b65b74bce3e39801e05).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/2514#issuecomment-57076000
  
    Hi Matei, thanks a lot for your suggestions. I've updated the code with fixed seed. Would you mind taking a look at this? Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2514#discussion_r17954666
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
    @@ -152,7 +152,7 @@ private[spark] class ExternalSorter[K, V, C](
         override def compare(a: K, b: K): Int = {
           val h1 = if (a == null) 0 else a.hashCode()
           val h2 = if (b == null) 0 else b.hashCode()
    -      h1 - h2
    +      if (h1 < h2) -1 else if (h1 == h2) 0 else 1
    --- End diff --
    
    So what about Java 6 compatibility?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org