You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "BeishaoCao-db (via GitHub)" <gi...@apache.org> on 2023/04/27 18:33:36 UTC

[GitHub] [spark] BeishaoCao-db commented on a diff in pull request #40907: [SPARK-43270][PYTHON] Implement `__dir__()` in `pyspark.sql.dataframe.DataFrame` to include columns

BeishaoCao-db commented on code in PR #40907:
URL: https://github.com/apache/spark/pull/40907#discussion_r1179547179


##########
python/pyspark/sql/dataframe.py:
##########
@@ -3008,6 +3008,34 @@ def __getattr__(self, name: str) -> Column:
         jc = self._jdf.apply(name)
         return Column(jc)
 
+    def __dir__(self) -> List[str]:
+        """
+        Examples
+        --------
+        >>> from pyspark.sql.functions import lit
+
+        Create a dataframe with a column named 'id'.
+
+        >>> df = spark.range(3)
+        >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # Includes column id
+        ['id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal', 'isStreaming']
+
+        Add a column named 'i_like_pancakes'.
+
+        >>> df = df.withColumn('i_like_pancakes', lit(1))
+        >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # Includes columns i_like_pancakes, id
+        ['i_like_pancakes', 'id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal']
+
+        Try to add an existed column 'inputFiles'.
+
+        >>> df = df.withColumn('inputFiles', lit(2))
+        >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # Doesn't duplicate inputFiles
+        ['i_like_pancakes', 'id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal']
+        """
+        attrs = list(super().__dir__())
+        attrs.extend(attr for attr in self.columns if attr not in attrs)

Review Comment:
   I don't see the point of using hasattr: our implementation about `__getattr__ `check the columns exist or not and we still  need to check that `attr not in attrs`



##########
python/pyspark/sql/dataframe.py:
##########
@@ -3008,6 +3008,34 @@ def __getattr__(self, name: str) -> Column:
         jc = self._jdf.apply(name)
         return Column(jc)
 
+    def __dir__(self) -> List[str]:
+        """
+        Examples
+        --------
+        >>> from pyspark.sql.functions import lit
+
+        Create a dataframe with a column named 'id'.
+
+        >>> df = spark.range(3)
+        >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # Includes column id
+        ['id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal', 'isStreaming']
+
+        Add a column named 'i_like_pancakes'.
+
+        >>> df = df.withColumn('i_like_pancakes', lit(1))
+        >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # Includes columns i_like_pancakes, id
+        ['i_like_pancakes', 'id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal']
+
+        Try to add an existed column 'inputFiles'.
+
+        >>> df = df.withColumn('inputFiles', lit(2))
+        >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # Doesn't duplicate inputFiles
+        ['i_like_pancakes', 'id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal']
+        """
+        attrs = list(super().__dir__())

Review Comment:
   1. we use set and remove the check, return sorted list
   2. still use the list, return sorted list
   
   I submit 2 commits for each of them, which one do you prefer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org