You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/10 21:43:34 UTC

[GitHub] [spark] ravwojdyla opened a new pull request #34863: [SPARK-37601] sql.DataFrame.transform accept function parameters

ravwojdyla opened a new pull request #34863:
URL: https://github.com/apache/spark/pull/34863


   [SPARK-37601](https://issues.apache.org/jira/browse/SPARK-37601):
   
   ```python
   def foo(df: DataFrame, p: int) -> DataFrame
     ...
   
   # current:
   
   from functools import partial
   df.transform(partial(foo, p=3))
   
   # or:
   
   df.transform(lambda df: foo(df, p=3))
   
   # vs this PR:
   
   df.transform(foo, p=3)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-991334913


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-1006873344


   This is closed without merging, but SPARK-37601 is marked as resolved as 3.3.0. It looks wrong to me.
   
   ![Screen Shot 2022-01-06 at 11 26 57 AM](https://user-images.githubusercontent.com/9700541/148439825-8f2d0de3-4151-449e-818e-a151654d3625.png)
   .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r776530895



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3067,6 +3067,10 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ----------
         func : function
             a function that takes and returns a :class:`DataFrame`.
+        args : Any
+            optional positional arguments for the func function
+        kwargs: Any
+            optional keyword arguments for the func function

Review comment:
       Ah, I got it.
   
   Yeah, then it looks fine as is for now to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r776170974



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3067,6 +3067,10 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ----------
         func : function
             a function that takes and returns a :class:`DataFrame`.
+        args : Any
+            optional positional arguments for the func function
+        kwargs: Any
+            optional keyword arguments for the func function

Review comment:
       I think they are not the Parameters of `transform` ??
   
   I think it should be described as a child of `func`.
   
   e.g.
   ```python
           func : function
               a function that takes and returns a :class:`DataFrame`.
               * args : Any
                   optional positional arguments for the func function
               * kwargs: Any
                   optional keyword arguments for the func function
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ravwojdyla closed pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
ravwojdyla closed pull request #34863:
URL: https://github.com/apache/spark/pull/34863


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ravwojdyla commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
ravwojdyla commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r772645527



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3076,15 +3080,17 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ...     return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])
         >>> def sort_columns_asc(input_df):
         ...     return input_df.select(*sorted(input_df.columns))
-        >>> df.transform(cast_all_to_int).transform(sort_columns_asc).show()
+        >>> def mul_by(input_df, m):
+        ...     return input_df.select([(col(col_name) * m).alias(col_name) for col_name in input_df.columns])
+        >>> df.transform(cast_all_to_int).transform(sort_columns_asc).transform(mul_by, 3).show()
         +-----+---+
         |float|int|
         +-----+---+
-        |    1|  1|
-        |    2|  2|
+        |    3|  3|
+        |    6|  6|
         +-----+---+

Review comment:
       @itholic sure - will make that change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ravwojdyla edited a comment on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
ravwojdyla edited a comment on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-1003161166


   https://github.com/apache/spark/pull/35062 ? 🤔
   
   Kinda unfortunate that the other PR got merged in hours while this one stayed open for 20 days. At this point the only thing this PR adds is tests, feel free to copy them. I'm closing this.
   
   Thanks for the review @itholic and @zero323 :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r776530895



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3067,6 +3067,10 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ----------
         func : function
             a function that takes and returns a :class:`DataFrame`.
+        args : Any
+            optional positional arguments for the func function
+        kwargs: Any
+            optional keyword arguments for the func function

Review comment:
       Ah, I got it.
   
   Yeah, then it looks fine as it for now to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
zero323 commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r776354920



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3067,6 +3067,10 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ----------
         func : function
             a function that takes and returns a :class:`DataFrame`.
+        args : Any
+            optional positional arguments for the func function
+        kwargs: Any
+            optional keyword arguments for the func function

Review comment:
       But like mentioned above ‒ that's not very likely scenario :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r776531548



##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -1138,6 +1138,39 @@ def test_create_nan_decimal_dataframe(self):
             [Row(value=None)],
         )
 
+    def test_transform_param(self):
+        def transform_fun(df, m, *, param):
+            return df.filter(col("c1") <= param).withColumn("c1", col("c1") * m)
+
+        def assert_equals(observed):
+            actual = observed.collect()
+            self.assertEqual(
+                [
+                    {
+                        "c1": 3,
+                    },
+                    {
+                        "c1": 6,
+                    },
+                ],
+                [row.asDict() for row in actual],
+            )
+
+        sdf = self.spark.createDataFrame([(1,), (2,), (3,)], ["c1"])
+        assert_equals(sdf.transform(transform_fun, 3, param=2))
+        assert_equals(sdf.transform(lambda df: transform_fun(df, 3, param=2)))
+        from functools import partial

Review comment:
       nit: I'm not strong feeling about this, but can we move this import to the top of file as PEP recommends ??




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r776170974



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3067,6 +3067,10 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ----------
         func : function
             a function that takes and returns a :class:`DataFrame`.
+        args : Any
+            optional positional arguments for the func function
+        kwargs: Any
+            optional keyword arguments for the func function

Review comment:
       I think they are not the Parameters of `transform` ??
   
   Maybe it should be described as a child of `func`.
   
   e.g.
   ```python
           func : function
               a function that takes and returns a :class:`DataFrame`.
               * args : Any
                   optional positional arguments for the func
               * kwargs: Any
                   optional keyword arguments for the func
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-1003203294


   Sorry @ravwojdyla! Seems like the other ticket emerged from [unrelated discussion](https://lists.apache.org/thread/s4s7s3fb2goy9mx4sk9d2079qx44gfx6) about `DataFrame` inheritance.  Personally, I've seen the note about modifying `transform` and was convinced it refers to this PR, so didn't follow up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ravwojdyla commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
ravwojdyla commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r767045781



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3058,7 +3058,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
         jdf = self._jdf.toDF(self._jseq(cols))
         return DataFrame(jdf, self.sql_ctx)
 
-    def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
+    def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame":

Review comment:
       Note: to properly type this, we would need [PEP-612](https://www.python.org/dev/peps/pep-0612/) and python 3.10. An alternative is to provide a long list of overloads for "most" cases.

##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3058,7 +3058,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
         jdf = self._jdf.toDF(self._jseq(cols))
         return DataFrame(jdf, self.sql_ctx)
 
-    def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
+    def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame":

Review comment:
       Note: to properly type this, we would need [PEP-612](https://www.python.org/dev/peps/pep-0612/) and python 3.10. An alternative is to provide a long list of overloads for "most" cases (but even that would not properly type it).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ravwojdyla commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
ravwojdyla commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r767045781



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3058,7 +3058,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
         jdf = self._jdf.toDF(self._jseq(cols))
         return DataFrame(jdf, self.sql_ctx)
 
-    def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
+    def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame":

Review comment:
       Note: to properly type this, we would need [PEP-612](https://www.python.org/dev/peps/pep-0612/) and python 3.10. An alternative is to provide a long list of overloads for "most" cases (but even that would not properly type it).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ravwojdyla commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
ravwojdyla commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r772703401



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3076,15 +3080,17 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ...     return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])
         >>> def sort_columns_asc(input_df):
         ...     return input_df.select(*sorted(input_df.columns))
-        >>> df.transform(cast_all_to_int).transform(sort_columns_asc).show()
+        >>> def mul_by(input_df, m):
+        ...     return input_df.select([(col(col_name) * m).alias(col_name) for col_name in input_df.columns])
+        >>> df.transform(cast_all_to_int).transform(sort_columns_asc).transform(mul_by, 3).show()
         +-----+---+
         |float|int|
         +-----+---+
-        |    1|  1|
-        |    2|  2|
+        |    3|  3|
+        |    6|  6|
         +-----+---+

Review comment:
       @itholic done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 removed a comment on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
zero323 removed a comment on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-1003202816


   Sorry @ravwojdyla! Seems like the other ticket emerged from [unrelated discussion](https://lists.apache.org/thread/s4s7s3fb2goy9mx4sk9d2079qx44gfx6) about `DataFrame` inheritance.  Personally, I've seen the note about modifying `transform` and was convinced it refers to this PR, so didn't follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-991334913


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r769286118



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3076,15 +3080,17 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ...     return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])
         >>> def sort_columns_asc(input_df):
         ...     return input_df.select(*sorted(input_df.columns))
-        >>> df.transform(cast_all_to_int).transform(sort_columns_asc).show()
+        >>> def mul_by(input_df, m):
+        ...     return input_df.select([(col(col_name) * m).alias(col_name) for col_name in input_df.columns])
+        >>> df.transform(cast_all_to_int).transform(sort_columns_asc).transform(mul_by, 3).show()
         +-----+---+
         |float|int|
         +-----+---+
-        |    1|  1|
-        |    2|  2|
+        |    3|  3|
+        |    6|  6|
         +-----+---+

Review comment:
       Can we keep the existing example and adding the new one to demonstrate the both usage ??




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r776531548



##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -1138,6 +1138,39 @@ def test_create_nan_decimal_dataframe(self):
             [Row(value=None)],
         )
 
+    def test_transform_param(self):
+        def transform_fun(df, m, *, param):
+            return df.filter(col("c1") <= param).withColumn("c1", col("c1") * m)
+
+        def assert_equals(observed):
+            actual = observed.collect()
+            self.assertEqual(
+                [
+                    {
+                        "c1": 3,
+                    },
+                    {
+                        "c1": 6,
+                    },
+                ],
+                [row.asDict() for row in actual],
+            )
+
+        sdf = self.spark.createDataFrame([(1,), (2,), (3,)], ["c1"])
+        assert_equals(sdf.transform(transform_fun, 3, param=2))
+        assert_equals(sdf.transform(lambda df: transform_fun(df, 3, param=2)))
+        from functools import partial

Review comment:
       nit: I'm not strong feeling about this, but can we move this import to the top of file as [PEP recommends](https://www.python.org/dev/peps/pep-0008/#imports) ??




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-1003202816


   Sorry @ravwojdyla! Seems like the other ticket emerged from [unrelated discussion](https://lists.apache.org/thread/s4s7s3fb2goy9mx4sk9d2079qx44gfx6) about `DataFrame` inheritance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ravwojdyla commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
ravwojdyla commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r767045781



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3058,7 +3058,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
         jdf = self._jdf.toDF(self._jseq(cols))
         return DataFrame(jdf, self.sql_ctx)
 
-    def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
+    def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame":

Review comment:
       Note: to properly type this, we would need [PEP-612](https://www.python.org/dev/peps/pep-0612/) and python 3.10. An alternative is to provide a long list of overloads for "most" cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 edited a comment on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
zero323 edited a comment on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-1003202816


   Sorry @ravwojdyla! Seems like the other ticket emerged from [unrelated discussion](https://lists.apache.org/thread/s4s7s3fb2goy9mx4sk9d2079qx44gfx6) about `DataFrame` inheritance.  Personally, I've seen the note about modifying `transform` and was convinced it refers to this PR, so didn't follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ravwojdyla edited a comment on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
ravwojdyla edited a comment on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-1003161166


   https://github.com/apache/spark/pull/35062 ? 🤔
   
   Kinda unfortunate that the other PR got merged in hours while this one stayed open for 20 days. At this point the only thing this PR adds is tests, feel free to copy them. I'm closing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ravwojdyla commented on pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
ravwojdyla commented on pull request #34863:
URL: https://github.com/apache/spark/pull/34863#issuecomment-1003161166


   https://github.com/apache/spark/pull/35062 ?? 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r776170974



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3067,6 +3067,10 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ----------
         func : function
             a function that takes and returns a :class:`DataFrame`.
+        args : Any
+            optional positional arguments for the func function
+        kwargs: Any
+            optional keyword arguments for the func function

Review comment:
       I think they are not the Parameters of `transform` ??
   
   I think it should be described as a child of `func`.
   
   e.g.
   ```python
           func : function
               a function that takes and returns a :class:`DataFrame`.
               * args : Any
                   optional positional arguments for the func
               * kwargs: Any
                   optional keyword arguments for the func
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on a change in pull request #34863: [SPARK-37601][PYTHON] sql.DataFrame.transform accept function parameters

Posted by GitBox <gi...@apache.org>.
zero323 commented on a change in pull request #34863:
URL: https://github.com/apache/spark/pull/34863#discussion_r776353027



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -3067,6 +3067,10 @@ def transform(self, func: Callable[["DataFrame"], "DataFrame"]) -> "DataFrame":
         ----------
         func : function
             a function that takes and returns a :class:`DataFrame`.
+        args : Any
+            optional positional arguments for the func function
+        kwargs: Any
+            optional keyword arguments for the func function

Review comment:
       > I think they are not the Parameters of `transform` ??
   
   What I meant is the following (totally made up change);
   
   - Let's say we want to add `foo` argument to `transform` i..e
   
       ```python
       def transform(self, func: Callable[["DataFrame"], "DataFrame"], foo: Any) -> "DataFrame"
       ```
   
   - That's possible ATM, but if we add `*args` and `**kwargs` to the signature this becomes problematic as both
    
       ```python
      def transform(self, func: Callable[["DataFrame"], "DataFrame"], foo: Any, *args, **kwargs) -> "DataFrame"
       ```
   
      and
   
       ```python
      def transform(self, func: Callable[["DataFrame"], "DataFrame"], *args,  foo: Any,  **kwargs) -> "DataFrame"
       ```
   
      are limited and / or ambiguous.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org