You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by srowen <gi...@git.apache.org> on 2015/02/13 15:45:27 UTC

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/4591

    SPARK-5744 [CORE] RDD.isEmpty / take fails for (empty) RDD of Nothing

    Handle `take`, `isEmpty` for `parallelize(Seq())`.
    As a knock-on effect of that fix: correctly handle 0 partitions in `histogram`. 
    Plus tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-5744

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4591.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4591
    
----
commit 2390a3f612eece1a2f68c7bd8edbb88647062854
Author: Sean Owen <so...@cloudera.com>
Date:   2015-02-12T12:24:41Z

    Handle take, isEmpty for parallelize(Seq()); related: correctly handle 0 partitions in histogram. Plus tests.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4591#issuecomment-74397355

To recap:

If you parallelize a Seq of Nothing or Null, you end up with an ArrayStoreException in the part of the code that collects results from partitions. It's just a Java-Scala compatibility thing that has to be worked around. I don't think there's a complete solution.

We can solve the case of Seq[Nothing] by special-casing empty Seq, which at least handles the "sc.parallelize(Seq())" case. I used EmptyRDD for this since it's logic to make an empty RDD in this case. However an EmptyRDD has no partitions, while some code relies on it to have 1 or more empty partitions. So I made that possible.

Then I found some code will ultimately call compute() on an EmptyRDD if it starts to be returned from parallelize(), and that threw an exception. I had it return an empty Iterator.

That ends up putting us back where we started. Except that we can make sc.parallelize(Seq()).isEmpty() work by writing sc.parallelize(Seq(), 0).isEmpty() now.

(Along the way I discovered that calling histogram() on an EmptyRDD fails, so I touched that up.)

The rest is tests.

Question: since it doesn't seem possible to totally fix the "Seq()" case, and the improvement here for this trouble is small, is it worth doing this or just accepting this as a known issue / corner case?

I could trim this PR down to just the histogram fix and tests (that pass) and call it a day too.

Thoughts? CC @mateiz @pwendell @rxin as it's kind of a core API issue.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4591#issuecomment-75139291
  
    I'm going to abandon this and make a new PR that just documents this behavior. The fix just raises some other issues. (However I'll probably keep the fix for `histogram` in the new PR as a souvenir for the trouble...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4591#issuecomment-74411356

@rxin So I should emphasize that the result of parallelizing an empty `Seq` right now is an RDD with many empty partitions, rather than no partitions. It _almost_ works to return an empty RDD with no partitions, except that I found one case in the Python API, in its own paralleize() implementation, where it builds a result for xranges by mapping the empty partitions of the result of parallelize([]) to the desired partitions with a generator function.

This is basically the much simpler state after my first commit in this PR. If you can think of a better way to implement that bit of the Python API that doesn't rely on getting back an RDD with many empty partitions, that could be a way forward. The nice thing is that that version solves the problem of `parallelize(Seq())` for all numbers of partitions. I suppose it introduces a small behavior change, in that you get back a 0-partition RDD here not one with the default # of partitions.

Anyhow, otherwise, I tend to agree with just documenting this and moving on.

I'll pause a day for more thoughts.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4591#issuecomment-74273762
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27441/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by srowen <gi...@git.apache.org>.

Github user srowen closed the pull request at:

    https://github.com/apache/spark/pull/4591


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4591#issuecomment-74264066
  
      [Test build #27441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27441/consoleFull) for   PR 4591 at commit [`2390a3f`](https://github.com/apache/spark/commit/2390a3f612eece1a2f68c7bd8edbb88647062854).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4591#issuecomment-74406082
  
    It seems to me it is a corner case. We might be better off just documenting it rather than adding an empty partition to the rdd, since conceptually I expect a empty RDD to have no partition. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4591#issuecomment-74399542
  
      [Test build #27496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27496/consoleFull) for   PR 4591 at commit [`7439b31`](https://github.com/apache/spark/commit/7439b31ee01a01d9a27fec9e0e1d93025d413ae6).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaStatefulNetworkWordCount `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4591#issuecomment-74273745
  
      [Test build #27441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27441/consoleFull) for   PR 4591 at commit [`2390a3f`](https://github.com/apache/spark/commit/2390a3f612eece1a2f68c7bd8edbb88647062854).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4591#issuecomment-75171391
  
    Sounds good. Thanks Sean.
    On Feb 19, 2015 1:26 PM, "Sean Owen" <no...@github.com> wrote:
    
    > I'm going to abandon this and make a new PR that just documents this
    > behavior. The fix just raises some other issues. (However I'll probably
    > keep the fix for histogram in the new PR as a souvenir for the trouble...)
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/4591#issuecomment-75139291>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4591#issuecomment-74397467
  
      [Test build #27496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27496/consoleFull) for   PR 4591 at commit [`7439b31`](https://github.com/apache/spark/commit/7439b31ee01a01d9a27fec9e0e1d93025d413ae6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-5744 [CORE] RDD.isEmpty / take fails for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4591#issuecomment-74399544
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27496/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org