You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mattf <gi...@git.apache.org> on 2014/09/21 15:53:42 UTC

[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

GitHub user mattf opened a pull request:

    https://github.com/apache/spark/pull/2478

    [SPARK-3580] add 'partitions' property to PySpark RDD

    'rdd.partitions' is available in scala&java, primarily used for its
    size() method to get the number of partitions. pyspark instead has a
    getNumPartitions() call and no access to 'partitions'
    
    this change adds 'partitions' to pyspark's rdd, allowing for
    len(rdd.partitions) to get the number of partitions in a way familiar
    to python developers

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mattf/spark SPARK-3580

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2478.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2478
    
----
commit 96316e4bcbce345ed9ba0f6f68c70508d102a4cc
Author: Matthew Farrellee <ma...@redhat.com>
Date:   2014-09-21T13:49:58Z

    [SPARK-3580] add 'partitions' property to PySpark RDD
    
    'rdd.partitions' is available in scala&java, primarily used for its
    size() method to get the number of partitions. pyspark instead has a
    getNumPartitions() call and no access to 'partitions'
    
    this change adds 'partitions' to pyspark's rdd, allowing for
    len(rdd.partitions) to get the number of partitions in a way familiar
    to python developers

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-56717218
  
    I think `len(rdd)` has the potential to be confused with `rdd.count()`, since calling `len()` on a Python collection usually returns the size of that collection.
    
    I also agree that we shouldn't expose Java `Partition` objects to users.  Is there any reason to expose `Partition` objects besides allowing `len(rdd.partitions())` to work?  If not, I'm not sure that we should add this feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-56805152
  
    > RDD._jrdd is very heavy for PipelinedRDD, but getNumPartitions() could be optimized for PipelinedRDD to avoid the creation of _jrdd (could be rdd.prev.getNumPartitions()).
    
    very true
    
    
    > Also, partitions() is one Java Object, it should be an implementation detail, it's better to keep it as internal interface.
    
    also true. when testing this i noticed the list should essentially be treated as full of black boxes, but that's difficult to enforce w/o wrapping the job object in a python version of Partition.
    
    @davies @JoshRosen, what's the purpose of exposing an array of partition objects in Scala&Java?
    
    
    > I think len(rdd) has the potential to be confused with rdd.count(), since calling len() on a Python collection usually returns the size of that collection.
    
    i agree


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2478


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-57258363
  
    @JoshRosen a partition itself doesn't have much in the way of a user api. it wouldn't be difficult to wrap the java objects in a python Partition. we should then start implementing the rdd functions that take partitions in python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-56552452
  
    RDD._jrdd is very heavy for PipelinedRDD, but getNumPartitions() could be optimized for PipelinedRDD to avoid the creation of _jrdd (could be rdd.prev.getNumPartitions()).
    
    Also, partitions() is one Java Object, it should be an implementation detail, it's better to keep it as internal interface.
    
    len(rdd.partitions()) sounds more Pythonic, how about `len(rdd)`?  cc @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-56301525
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20621/consoleFull) for   PR 2478 at commit [`96316e4`](https://github.com/apache/spark/commit/96316e4bcbce345ed9ba0f6f68c70508d102a4cc).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-58884300
  
    @JoshRosen @pwendell any further comment on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-56299526
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20621/consoleFull) for   PR 2478 at commit [`96316e4`](https://github.com/apache/spark/commit/96316e4bcbce345ed9ba0f6f68c70508d102a4cc).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-62222172
  
    Hi @mattf,
    
    If you don't have any additional comments, do you mind closing this pull request?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-56858958
  
    > what's the purpose of exposing an array of partition objects in Scala&Java?
    
    In Scala / Java, I think we expose Partition objects for use in custom RDD implementations.  There are a bunch of methods like `iterator()`, `compute()`, `preferredLocations()`, etc. that take Partition objects.  Outside of this context, I'm not sure that they're useful to end-users.
    
    (If you use IntelliJ, try running "Find Usages" on Partition and look at the "Method Parameter Declaration" usages).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2478#issuecomment-58916892
  
    @mattf I'm not sure that it's worth exposing those `Partition`-accepting methods in Python, since I don't think that they're really intended to be called by users.  I guess I don't really see the benefit of adding a `Partition` wrapper to Python just so that we can write `len(partitions)`; it seems like a lot of work for not much benefit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org