You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by kanzhang <gi...@git.apache.org> on 2014/05/14 23:28:43 UTC

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

GitHub user kanzhang opened a pull request:

    https://github.com/apache/spark/pull/776

    [SPARK-1837] NumericRange should be partitioned in the same way as other...

    ... sequences

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kanzhang/spark SPARK-1837

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/776.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #776
    
----
commit 403f9b1af8f6381035f4d863b9d0720817a227be
Author: Kan Zhang <kz...@apache.org>
Date:   2014-05-14T21:25:51Z

    [SPARK-1837] NumericRange should be partitioned in the same way as other sequences

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13163063
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    I'd wait for Scala to fix it (or I'd fix it in Scala).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13269342
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
         if (numSlices < 1) {
           throw new IllegalArgumentException("Positive number of slices required")
         }
    +    // Sequences need to be sliced at same positions for operations
    +    // like RDD.zip() to behave as expected
    +    def positions(length: Long, numSlices: Int) = {
    --- End diff --
    
    > This needs an explicit return type (e.g. : Seq[(Int, Int)])
    For binary compatibility?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-46098058
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-43395120
  
    Jenkins, this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-46099076
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45015310
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-43395161
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45020744
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15399/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13269332
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    Okay, that makes sense then; I didn't realize that we were already using `drop` and `take`. In that case we should merge this patch as is and maybe create a JIRA for Double ranges so people see it's a known issue. Made one other small comment on the patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r12675655
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    ```
    scala> dr.zip(lr)
    res10: scala.collection.immutable.IndexedSeq[(Double, Long)] = Vector((1.0,1), (1.2,4), (1.4,7))
    
    scala> dr.take(1)
    res11: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0)
    
    scala> dr.take(2)
    res12: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0)
    
    scala> dr.take(3)
    res13: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0, 1.2)
    
    scala> dr.take(4)
    res14: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0, 1.2, 1.4, 1.5999999999999999)
    
    scala> lr.take(1)
    res15: scala.collection.immutable.NumericRange[Long] = NumericRange(1)
    
    scala> lr.take(2)
    res16: scala.collection.immutable.NumericRange[Long] = NumericRange(1, 4)
    
    scala> lr.take(3)
    res17: scala.collection.immutable.NumericRange[Long] = NumericRange(1, 4, 7)
    
    scala> lr.take(4)
    res18: scala.collection.immutable.NumericRange[Long] = NumericRange(1, 4, 7)
    ```
    `(1D to 2D).by(0.2).take(2) => NumericRange(1.0)` why ? This is a bug?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-46037683
  
    @mateiz could you take another look at this when you get a chance? SPARK-1817 has been marked as resolved, but the fix for the original issue depends on this patch. Thx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13315403
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
         if (numSlices < 1) {
           throw new IllegalArgumentException("Positive number of slices required")
         }
    +    // Sequences need to be sliced at same positions for operations
    +    // like RDD.zip() to behave as expected
    +    def positions(length: Long, numSlices: Int) = {
    --- End diff --
    
    Ah, Ok. Just for my knowledge, if I had made it a private method, would it pass without explicit return type? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13195998
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    Alright, maybe we should wait for Scala then. By the way, for your original use case, was the range you wanted always (0 to numElements)? If so you can also try RDD.zipWithIndex.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45004121
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13269320
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
         if (numSlices < 1) {
           throw new IllegalArgumentException("Positive number of slices required")
         }
    +    // Sequences need to be sliced at same positions for operations
    +    // like RDD.zip() to behave as expected
    +    def positions(length: Long, numSlices: Int) = {
    --- End diff --
    
    This needs an explicit return type (e.g. `: Seq[(Int, Int)]`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45015891
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45020741
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45014312
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13269335
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/ParallelCollectionSplitSuite.scala ---
    @@ -111,6 +111,24 @@ class ParallelCollectionSplitSuite extends FunSuite with Checkers {
         assert(slices.forall(_.isInstanceOf[Range]))
       }
     
    +  test("identical slice sizes between Range and NumericRange") {
    +    val r = ParallelCollectionRDD.slice(1 to 7, 4)
    +    val nr = ParallelCollectionRDD.slice(1L to 7L, 4)
    +    assert(r.size === 4)
    +    for (i <- 0 until r.size) {
    +      assert(r(i).size === nr(i).size)
    +    }
    +  }
    +
    +  test("identical slice sizes between List and NumericRange") {
    +    val r = ParallelCollectionRDD.slice(List(1,2), 4)
    --- End diff --
    
    Space after ,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-43140579
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45004023
  
    Updated patch based on @mateiz comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13269358
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/ParallelCollectionSplitSuite.scala ---
    @@ -111,6 +111,24 @@ class ParallelCollectionSplitSuite extends FunSuite with Checkers {
         assert(slices.forall(_.isInstanceOf[Range]))
       }
     
    +  test("identical slice sizes between Range and NumericRange") {
    +    val r = ParallelCollectionRDD.slice(1 to 7, 4)
    +    val nr = ParallelCollectionRDD.slice(1L to 7L, 4)
    +    assert(r.size === 4)
    +    for (i <- 0 until r.size) {
    +      assert(r(i).size === nr(i).size)
    +    }
    +  }
    +
    +  test("identical slice sizes between List and NumericRange") {
    +    val r = ParallelCollectionRDD.slice(List(1,2), 4)
    --- End diff --
    
    Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45014315
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15392/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45004103
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-46098063
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13269334
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
         if (numSlices < 1) {
           throw new IllegalArgumentException("Positive number of slices required")
         }
    +    // Sequences need to be sliced at same positions for operations
    +    // like RDD.zip() to behave as expected
    +    def positions(length: Long, numSlices: Int) = {
    --- End diff --
    
    Actually it would be better if this returned an Iterator, so that it doesn't materialize the whole sequence. You can do `(0 until numSlices).iterator.map(...)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13269104
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    @mateiz the current implementation would lose elements for all types of numeric ranges (including Long and Double) when we zip a numeric range with other sequences, because we partition numeric ranges differently from other sequences. This patch fixes it by partitioning numeric ranges at exactly the same indexes as we would on other sequences. However, we still depend on ```take``` and ```drop``` being implemented correctly on numeric ranges for things to work. The Scala bug affects ```take``` and ```drop``` on Double ranges, but not on other numeric ranges like Long (hence, the unit tests in this patch, which is based on Long ranges, work).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13161774
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    Do you want to wait on this to be fixed by Scala, or do you want to work around it for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r12704287
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    Looks like a bug to me. I couldn't google any pointer on it. You might want to post it on Scala forum?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-43395959
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15060/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13268425
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    Won't this patch make it lose numbers out of Double ranges? Whereas the current implementation works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13314967
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
         if (numSlices < 1) {
           throw new IllegalArgumentException("Positive number of slices required")
         }
    +    // Sequences need to be sliced at same positions for operations
    +    // like RDD.zip() to behave as expected
    +    def positions(length: Long, numSlices: Int) = {
    --- End diff --
    
    This is just our style throughout the code. It makes it easier to avoid compatibility-breaking changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-43395160
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-46100288
  
    Alright, merged this. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r12706894
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    This issue has been reported at Scala and is still open, https://issues.scala-lang.org/browse/SI-8518


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-46099077
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15790/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/776#discussion_r13241563
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
    @@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
               r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
           }
           case r: Range => {
    -        (0 until numSlices).map(i => {
    -          val start = ((i * r.length.toLong) / numSlices).toInt
    -          val end = (((i + 1) * r.length.toLong) / numSlices).toInt
    -          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
    +        positions(r.length, numSlices).map({
    +          case (start, end) =>
    +            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
             }).asInstanceOf[Seq[Seq[T]]]
           }
           case nr: NumericRange[_] => {
             // For ranges of Long, Double, BigInteger, etc
             val slices = new ArrayBuffer[Seq[T]](numSlices)
    -        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
             var r = nr
    -        for (i <- 0 until numSlices) {
    +        for ((start, end) <- positions(nr.length, numSlices)) {
    +          val sliceSize = end - start
               slices += r.take(sliceSize).asInstanceOf[Seq[T]]
    --- End diff --
    
    @mateiz the use case wasn't mine, it was from reporter of SPARK-1817. Btw, I think this PR can be committed independent of Scala fix. It fixes the issue for other numeric ranges (e.g., Long), and will also work on Double once the Scala fix is in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-46097979
  
    Oh, sorry, I forgot to merge this after testing it. Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/776


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-43395958
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/776#issuecomment-45015904
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---