You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by kokes <gi...@git.apache.org> on 2018/06/28 07:28:33 UTC

[GitHub] spark pull request #21654: [SPARK-24671][PySpark] DataFrame length using a d...

GitHub user kokes opened a pull request:

    https://github.com/apache/spark/pull/21654

    [SPARK-24671][PySpark] DataFrame length using a dunder/magic method

    ## What changes were proposed in this pull request?
    
    `len(df)` should work by implementing `__len__` method on class `DataFrame`, this just invokes `self.count()`
    
    ## How was this patch tested?
    
    It was not, because local tests failed early on (lint-scala), before they got to PySpark and I wasn't sure how to skip them. I'm relying on Jenkins here.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kokes/spark dflen

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21654.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21654
    
----
commit 4d0afaf3cd046b11e8bae43dc00ddf4b1eb97732
Author: Ondrej Kokes <on...@...>
Date:   2018-06-27T19:50:58Z

    len(df) == df.count()

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21654: [SPARK-24671][PySpark] DataFrame length using a d...

Posted by rgbkrk <gi...@git.apache.org>.

Github user rgbkrk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21654#discussion_r216414567
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -375,6 +375,9 @@ def _truncate(self):
             return int(self.sql_ctx.getConf(
                 "spark.sql.repl.eagerEval.truncate", "20"))
     
    +    def __len__(self):
    --- End diff --
    
    I'd argue for bringing this in, if you don't think we're providing people a footgun where they'd incidentally use `len()` on a dataframe often. As for making a plan around built in function support, I'm happy to be part of a `_repr_*_` campaign. I wouldn't have the background to participate in others (`__lt__`, etc.) as I wouldn't be able to weigh their maintainability, performance, and utility like I could visual elements like reprs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21654
  
    Interesting, WDYT Python people like .. @holdenk ? This could be implemented on other classes like RDD, I guess. Any downside? does it help people mix up a local collection and distributed data structure?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21654
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21654
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95823/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21654
  
    hey @kokes this is out of sync with master, can you merge in the latest master? I'm going to follow up on the dev@ list for the plan which @HyukjinKwon wants to see (please feel free to join in that discussion).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21654
  
    cc @rgbkrk 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21654
  
    **[Test build #98211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98211/testReport)** for PR 21654 at commit [`e580442`](https://github.com/apache/spark/commit/e5804422c2711b3b8f7989a909ef27ef4cacb056).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org