You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/09/08 05:22:33 UTC

[GitHub] spark pull request #21654: [SPARK-24671][PySpark] DataFrame length using a d...

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21654#discussion_r216121454
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -375,6 +375,9 @@ def _truncate(self):
             return int(self.sql_ctx.getConf(
                 "spark.sql.repl.eagerEval.truncate", "20"))
     
    +    def __len__(self):
    --- End diff --
    
    Can we better just not define this? RDD doesn't have this one too. IMHO, such allowing bit by bit wouldn't be so ideal .. For example, `columns.py` ended up with a weird limit:
    
    ```python
    >>> iter(spark.range(1).id)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
        raise TypeError("Column is not iterable")
    TypeError: Column is not iterable
    >>> isinstance(spark.range(1).id, collections.Iterable)
    True
    ```
    
    It makes a general sense though.
    
    This `__iter__` can't be removed BTW because we implement `__getitem__` and `__getattr__` to access columns in dataframes IIRC.
    
    `__repr__` was added because it's commonly used and it had a strong usecase for notebook, etc. However, for `len()` I wouldn't add it for now. Think about `if len(df) ...` and it is eagerly evaluated .. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org