You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tbertelsen <gi...@git.apache.org> on 2015/02/11 19:28:49 UTC

[GitHub] spark pull request: Fixing SPARK-5744.

GitHub user tbertelsen opened a pull request:

    https://github.com/apache/spark/pull/4534

    Fixing SPARK-5744.

    RDD.isEmpty fails when an RDD contains empty partitions.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tbertelsen/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4534.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4534
    
----
commit 6f4c5c22678c9dc78417cd3c109b90b87b3b8211
Author: Tobias Bertelsen <to...@gmail.com>
Date:   2015-02-11T18:28:23Z

    Fixing SPARK-5744.
    
    RDD.isEmpty fails when an RDD contains empty partitions.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by tbertelsen <gi...@git.apache.org>.
Github user tbertelsen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4534#discussion_r24523768
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1253,9 +1253,9 @@ abstract class RDD[T: ClassTag](
     
       /**
        * @return true if and only if the RDD contains no elements at all. Note that an RDD
    -   *         may be empty even when it has at least 1 partition.
    +   *         may be empty even when it has 1 or more partitions.
        */
    -  def isEmpty(): Boolean = partitions.length == 0 || take(1).length == 0
    +  def isEmpty(): Boolean = partitions.length == 0 || mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)
    --- End diff --
    
    I am getting more convinced that this is a bug in `take`. It is provoked by RDD[Nothing].take(1). Take for example these three commands:
    
    ~~~ scala
    sc.parallelize(Seq[Nothing]()).take(1) // Fails
    sc.parallelize(Seq[Any]()).take(1)     // Array[Any] = Array()
    sc.parallelize(Seq[Int]()).take(1)     // Array[Int] = Array()
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73965544
  
    Hmph. I think we might could work around this if `runJob` didn't have to return an `Array`, but it does. Even if it collects its results in another collection it has to get copied to an `Array`. I think that's where this runs into the mismatch between these Scala types and JVMs and arrays and covariance every way I can see.
    
    All of these are fairly artificial cases. An empty RDD or partition of a normal type works fine. I think it's worth making the `EmptyRDD` change to take care of the `Seq()` case, and treating the rest as just how it is with Scala and arrays and these types and the current API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-74246841
  
    OK, I can merge my PR. I would be fine with you adding this in to yours, with whatever additions you want to. That is, I didn't intend to hijack your change, and can credit you in JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Making RDD.isEmpty robust to empty partitions ...

Posted by tbertelsen <gi...@git.apache.org>.
Github user tbertelsen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4534#discussion_r24521457
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1253,9 +1253,9 @@ abstract class RDD[T: ClassTag](
     
       /**
        * @return true if and only if the RDD contains no elements at all. Note that an RDD
    -   *         may be empty even when it has at least 1 partition.
    +   *         may be empty even when it has 1 or more partitions.
        */
    -  def isEmpty(): Boolean = partitions.length == 0 || take(1).length == 0
    +  def isEmpty(): Boolean = partitions.length == 0 || mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)
    --- End diff --
    
    You might have a point, that this is actually a bug in `take()`.
    
    `sc.parallelize(Seq(1,2,3),1).take(1337)` works fine but and returns Array[Int] = Array(1, 2, 3)`
    
    `sc.parallelize(Seq(),1).take(1)` fails on the other hand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73973075
  
    The case of `Nothing` is actually fine to solve here; you don't have to reject it. It's `Null` I can't figure out. I think at best, just fix the `Nothing` case per above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Fixing SPARK-5744.

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4534#discussion_r24520320
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1253,9 +1253,9 @@ abstract class RDD[T: ClassTag](
     
       /**
        * @return true if and only if the RDD contains no elements at all. Note that an RDD
    -   *         may be empty even when it has at least 1 partition.
    +   *         may be empty even when it has 1 or more partitions.
        */
    -  def isEmpty(): Boolean = partitions.length == 0 || take(1).length == 0
    +  def isEmpty(): Boolean = partitions.length == 0 || mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)
    --- End diff --
    
    I'll try the test case, sure, to investigate. The case of an empty partition should be handled already by `take()`, so I don't think that's it per se. 
    
    (I'm worried about this logic since it will touch every partition, and the point was to not do so. The 2 changes before this line aren't necessary.) 
    
    The exception looks more like funny business in handling `Seq()` (i.e. type `Any`) somewhere along the line. I'll look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by tbertelsen <gi...@git.apache.org>.
Github user tbertelsen closed the pull request at:

    https://github.com/apache/spark/pull/4534


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73946196
  
    This works, so it's not quite empty partitions:
    
    ```
    sc.parallelize(Seq[Int](), 1).isEmpty()
    ```
    
    This also creates an exception, so it's to do with `Seq()` (which is of `Nothing`, not `Any`, sorry):
    
    ```
    sc.parallelize(Seq(), 1).take(1)
    ```
    
    I think the problem roughly boils down to this behavior in Scala:
    
    ```
    Seq(Seq().toArray,Seq().toArray).toArray
    java.lang.ArrayStoreException: [Ljava.lang.Object;
    ...
    ```
    
    The problem is the `Nothing` type, rather than emptiness, but I'm still scratching my head figuring out the cleanest way to deal with that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by tbertelsen <gi...@git.apache.org>.
Github user tbertelsen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-74246192
  
    It looks like it fixes the issue with `Seq()`, but not e.g. `new Array(2)` (have not run it though). But after our discussion I think that is perfectly fine.
    
    If you manage to create a non-empty collection of `Nothing`, then practically nothing will work in spark, or in scala for that matter, e.g., `(new Array(2)).toList` won't compile. 
    
    Handling `RDD[Nothing]` seems like it will require quite some work, but I fail to see the use case where it brings real functionality.
    
    In summary: your fix seems like a good solution. I'll close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Fixing SPARK-5744.

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73936918
  
    Thanks for doing this, but the title of this PR isn't sufficient.  It will become the commit log message, so please update the PR title to adequately describe what you did so that other developers don't have to look into the details of the commit or look up the JIRA issue just to get an idea of what this PR is about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73941110
  
    perfect


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by tbertelsen <gi...@git.apache.org>.
Github user tbertelsen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73959278
  
    How do we treat `null` values, e.g. `sc.parallelize(Seq(new Array[Nothing](1)))`. I have no experience with TypeTages, but couldn't we use them in a smart way?
    
    By the way `sc.parallelize(Seq(new Array[Nothing](1)))` on almost anything


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-5744 [CORE] Making RDD.isEmpty robust to...

Posted by tbertelsen <gi...@git.apache.org>.
Github user tbertelsen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73940991
  
    Sorry. Is is good now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by tbertelsen <gi...@git.apache.org>.
Github user tbertelsen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73972070
  
    It compiles if you include `new`. It just becomes a `null` filled array.
    ~~~
    scala> new Array[Nothing](1)
    res19: Array[Nothing] = Array(null)
    ~~~
    
    I think your right about this being fairly artificial.  One solution could be, just not to accept Nothing as the type of an RDD, and fail fast and predictable like this:
    
    ~~~
        if (implicitly[ClassTag[T]].runtimeClass == classOf[Nothing]) {
          throw new RuntimeException("RDD[Nothing] is not allowed.")
        }
    ~~~
    
    We must however be carefull. Right now `sc.emptyRDD` returns an `RDD[Nothing]`, and there could easily be unforeseen consequences by prohibiting `RDD[Nothing]`
    
    
    If we just want to provide a good error message, we could also just catch the ArrayStoreException and wrap it in an exception with a more descriptive message. Perhaps at the place where, it is already wrapped in the SparkDriverExecutionException. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Fixing SPARK-5744.

Posted by tbertelsen <gi...@git.apache.org>.
Github user tbertelsen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73935576
  
    FYI: The method was introduced in https://github.com/apache/spark/pull/4074


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Making RDD.isEmpty robust to empty partitions ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73940039
  
    @tbertelsen Better, but you still should include SPARK-5744 and add [CORE] to the PR title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Fixing SPARK-5744.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73936109
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73955339
  
    I think I have a solution. A `Seq[Nothing]` must be empty since there are no instances of `Nothing`. So I think this can be fixed (and optimized) by returning an `EmptyRDD` from `parallelize` when given an empty argument.
    
    ```
      def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = {
        assertNotStopped()
        if (seq.isEmpty) {
          new EmptyRDD[T](this)
        } else {
          new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
        }
      }
    ```
    
    Witness:
    
    ```
    scala> sc.parallelize(Seq(), 1).isEmpty
    res0: Boolean = true
    
    scala> sc.parallelize(Seq(), 1).take(1)
    res1: Array[Nothing] = Array()
    ```
    
    It deserves a unit test for both of these and a run through all the tests. Want to try that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-74062847
  
    @tbertelsen Here's my proposed fix: https://github.com/srowen/spark/commit/2390a3f612eece1a2f68c7bd8edbb88647062854
    
    I discovered along the way that histogram() doesn't support an RDD with 0 partitions, so included that fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5744] [CORE] Making RDD.isEmpty robust ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4534#issuecomment-73960255
  
    `Array[Nothing](1)` won't compile by itself, nor will `Seq[Nothing](1)` for different reasons. `new Array[Nothing](1)` won't be accepted as a `Seq[?]` for reasons I don't 100% get.
    
    But there is actually still the same problem with the `Null` type: `sc.parallelize(Seq(null), 1).isEmpty` still fails. The change above is good but doesn't fix this case. Back to the drawing boards...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org