You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kanzhang <gi...@git.apache.org> on 2014/05/14 23:28:43 UTC
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
GitHub user kanzhang opened a pull request:
https://github.com/apache/spark/pull/776
[SPARK-1837] NumericRange should be partitioned in the same way as other...
... sequences
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kanzhang/spark SPARK-1837
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/776.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #776
----
commit 403f9b1af8f6381035f4d863b9d0720817a227be
Author: Kan Zhang <kz...@apache.org>
Date: 2014-05-14T21:25:51Z
[SPARK-1837] NumericRange should be partitioned in the same way as other sequences
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13163063
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
I'd wait for Scala to fix it (or I'd fix it in Scala).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13269342
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of slices required")
}
+ // Sequences need to be sliced at same positions for operations
+ // like RDD.zip() to behave as expected
+ def positions(length: Long, numSlices: Int) = {
--- End diff --
> This needs an explicit return type (e.g. : Seq[(Int, Int)])
For binary compatibility?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-46098058
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-43395120
Jenkins, this is ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-46099076
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45015310
Jenkins, retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-43395161
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45020744
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15399/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13269332
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
Okay, that makes sense then; I didn't realize that we were already using `drop` and `take`. In that case we should merge this patch as is and maybe create a JIRA for Double ranges so people see it's a known issue. Made one other small comment on the patch.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r12675655
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
```
scala> dr.zip(lr)
res10: scala.collection.immutable.IndexedSeq[(Double, Long)] = Vector((1.0,1), (1.2,4), (1.4,7))
scala> dr.take(1)
res11: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0)
scala> dr.take(2)
res12: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0)
scala> dr.take(3)
res13: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0, 1.2)
scala> dr.take(4)
res14: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0, 1.2, 1.4, 1.5999999999999999)
scala> lr.take(1)
res15: scala.collection.immutable.NumericRange[Long] = NumericRange(1)
scala> lr.take(2)
res16: scala.collection.immutable.NumericRange[Long] = NumericRange(1, 4)
scala> lr.take(3)
res17: scala.collection.immutable.NumericRange[Long] = NumericRange(1, 4, 7)
scala> lr.take(4)
res18: scala.collection.immutable.NumericRange[Long] = NumericRange(1, 4, 7)
```
`(1D to 2D).by(0.2).take(2) => NumericRange(1.0)` why ? This is a bug?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-46037683
@mateiz could you take another look at this when you get a chance? SPARK-1817 has been marked as resolved, but the fix for the original issue depends on this patch. Thx.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13315403
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of slices required")
}
+ // Sequences need to be sliced at same positions for operations
+ // like RDD.zip() to behave as expected
+ def positions(length: Long, numSlices: Int) = {
--- End diff --
Ah, Ok. Just for my knowledge, if I had made it a private method, would it pass without explicit return type?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13195998
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
Alright, maybe we should wait for Scala then. By the way, for your original use case, was the range you wanted always (0 to numElements)? If so you can also try RDD.zipWithIndex.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45004121
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13269320
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of slices required")
}
+ // Sequences need to be sliced at same positions for operations
+ // like RDD.zip() to behave as expected
+ def positions(length: Long, numSlices: Int) = {
--- End diff --
This needs an explicit return type (e.g. `: Seq[(Int, Int)]`)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45015891
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45020741
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45014312
Merged build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13269335
--- Diff: core/src/test/scala/org/apache/spark/rdd/ParallelCollectionSplitSuite.scala ---
@@ -111,6 +111,24 @@ class ParallelCollectionSplitSuite extends FunSuite with Checkers {
assert(slices.forall(_.isInstanceOf[Range]))
}
+ test("identical slice sizes between Range and NumericRange") {
+ val r = ParallelCollectionRDD.slice(1 to 7, 4)
+ val nr = ParallelCollectionRDD.slice(1L to 7L, 4)
+ assert(r.size === 4)
+ for (i <- 0 until r.size) {
+ assert(r(i).size === nr(i).size)
+ }
+ }
+
+ test("identical slice sizes between List and NumericRange") {
+ val r = ParallelCollectionRDD.slice(List(1,2), 4)
--- End diff --
Space after ,
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-43140579
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45004023
Updated patch based on @mateiz comments.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13269358
--- Diff: core/src/test/scala/org/apache/spark/rdd/ParallelCollectionSplitSuite.scala ---
@@ -111,6 +111,24 @@ class ParallelCollectionSplitSuite extends FunSuite with Checkers {
assert(slices.forall(_.isInstanceOf[Range]))
}
+ test("identical slice sizes between Range and NumericRange") {
+ val r = ParallelCollectionRDD.slice(1 to 7, 4)
+ val nr = ParallelCollectionRDD.slice(1L to 7L, 4)
+ assert(r.size === 4)
+ for (i <- 0 until r.size) {
+ assert(r(i).size === nr(i).size)
+ }
+ }
+
+ test("identical slice sizes between List and NumericRange") {
+ val r = ParallelCollectionRDD.slice(List(1,2), 4)
--- End diff --
Sure.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45014315
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15392/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45004103
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-46098063
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13269334
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of slices required")
}
+ // Sequences need to be sliced at same positions for operations
+ // like RDD.zip() to behave as expected
+ def positions(length: Long, numSlices: Int) = {
--- End diff --
Actually it would be better if this returned an Iterator, so that it doesn't materialize the whole sequence. You can do `(0 until numSlices).iterator.map(...)`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13269104
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
@mateiz the current implementation would lose elements for all types of numeric ranges (including Long and Double) when we zip a numeric range with other sequences, because we partition numeric ranges differently from other sequences. This patch fixes it by partitioning numeric ranges at exactly the same indexes as we would on other sequences. However, we still depend on ```take``` and ```drop``` being implemented correctly on numeric ranges for things to work. The Scala bug affects ```take``` and ```drop``` on Double ranges, but not on other numeric ranges like Long (hence, the unit tests in this patch, which is based on Long ranges, work).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13161774
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
Do you want to wait on this to be fixed by Scala, or do you want to work around it for now?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r12704287
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
Looks like a bug to me. I couldn't google any pointer on it. You might want to post it on Scala forum?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-43395959
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15060/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13268425
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
Won't this patch make it lose numbers out of Double ranges? Whereas the current implementation works.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13314967
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -117,6 +117,15 @@ private object ParallelCollectionRDD {
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of slices required")
}
+ // Sequences need to be sliced at same positions for operations
+ // like RDD.zip() to behave as expected
+ def positions(length: Long, numSlices: Int) = {
--- End diff --
This is just our style throughout the code. It makes it easier to avoid compatibility-breaking changes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-43395160
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-46100288
Alright, merged this. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r12706894
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
This issue has been reported at Scala and is still open, https://issues.scala-lang.org/browse/SI-8518
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-46099077
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15790/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/776#discussion_r13241563
--- Diff: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala ---
@@ -128,18 +137,17 @@ private object ParallelCollectionRDD {
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
- (0 until numSlices).map(i => {
- val start = ((i * r.length.toLong) / numSlices).toInt
- val end = (((i + 1) * r.length.toLong) / numSlices).toInt
- new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+ positions(r.length, numSlices).map({
+ case (start, end) =>
+ new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
- val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
- for (i <- 0 until numSlices) {
+ for ((start, end) <- positions(nr.length, numSlices)) {
+ val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
--- End diff --
@mateiz the use case wasn't mine, it was from reporter of SPARK-1817. Btw, I think this PR can be committed independent of Scala fix. It fixes the issue for other numeric ranges (e.g., Long), and will also work on Double once the Scala fix is in.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-46097979
Oh, sorry, I forgot to merge this after testing it. Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/776
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-43395958
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1837] NumericRange should be partitione...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/776#issuecomment-45015904
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---