You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/30 20:40:08 UTC

[GitHub] [spark] Daniel-Davies opened a new pull request #35071: [SPARK-37788][Python] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Daniel-Davies opened a new pull request #35071:
URL: https://github.com/apache/spark/pull/35071


   
   ### What changes were proposed in this pull request?
   Please see https://issues.apache.org/jira/browse/SPARK-37788
   
   There are a few remaining functions that should but don't yet support ColumnOrName; this PR updates some annotations of functions that do support it, and converts input column string names to columns if not being done already.
   
   ### Why are the changes needed?
   API consistency in PySpark
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes; namely two array functions:
   
   - array_repeat; can now support `df.select(array_repeat("data", "repeat_n").alias('r'))`
   - slice: can now support `df.select(slice("data", "index", "length").alias('r'))`
   
   ### How was this patch tested?
   Modification to two existing unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35071:
URL: https://github.com/apache/spark/pull/35071#discussion_r776898225



##########
File path: python/pyspark/sql/functions.py
##########
@@ -1715,7 +1715,7 @@ def greatest(*cols: "ColumnOrName") -> Column:
     return Column(sc._jvm.functions.greatest(_to_seq(sc, cols, _to_java_column)))
 
 
-def least(*cols: Column) -> Column:
+def least(*cols: "ColumnOrName") -> Column:

Review comment:
       ditto for docstring




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

zero323 commented on pull request #35071:
URL: https://github.com/apache/spark/pull/35071#issuecomment-1006004549


   Merged into master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #35071:
URL: https://github.com/apache/spark/pull/35071#discussion_r777776710



##########
File path: python/pyspark/sql/functions.py
##########
@@ -4697,6 +4714,13 @@ def array_repeat(col: "ColumnOrName", count: Union["ColumnOrName", int]) -> Colu
 
     .. versionadded:: 2.4.0
 
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        column name or column that contains the element to be repeated

Review comment:
       nit: `column name or column containing the element to be repeated` to get consistency with other descriptions ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35071:
URL: https://github.com/apache/spark/pull/35071#issuecomment-1005517308


   Thanks for fixing this @Daniel-Davies.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

itholic commented on pull request #35071:
URL: https://github.com/apache/spark/pull/35071#issuecomment-1004556535


   Could you also update the PR description since maybe this PR also address the `least`, `overlay` more than `array_repeat` and `slice` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Daniel-Davies edited a comment on pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

Daniel-Davies edited a comment on pull request #35071:
URL: https://github.com/apache/spark/pull/35071#issuecomment-1005679065


   > Thanks for fixing this @Daniel-Davies.
   
   No worries! Thanks all for the diligent reviews everyone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35071:
URL: https://github.com/apache/spark/pull/35071#discussion_r778664735



##########
File path: python/pyspark/sql/functions.py
##########
@@ -4172,12 +4209,10 @@ def from_json(
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
-        string column in json format
+        a column or column name in CSV format

Review comment:
       ```suggestion
           a column or column name in JSON format
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 closed pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

zero323 closed pull request #35071:
URL: https://github.com/apache/spark/pull/35071


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #35071:
URL: https://github.com/apache/spark/pull/35071#discussion_r777776710



##########
File path: python/pyspark/sql/functions.py
##########
@@ -4697,6 +4714,13 @@ def array_repeat(col: "ColumnOrName", count: Union["ColumnOrName", int]) -> Colu
 
     .. versionadded:: 2.4.0
 
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        column name or column that contains the element to be repeated

Review comment:
       nit: `column name or column containing the element to be repeated` to get consistency with other descriptions ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #35071:
URL: https://github.com/apache/spark/pull/35071#discussion_r777851483



##########
File path: python/pyspark/sql/functions.py
##########
@@ -3102,24 +3121,36 @@ def instr(str: "ColumnOrName", substr: str) -> Column:
 def overlay(
     src: "ColumnOrName",
     replace: "ColumnOrName",
-    pos: Union[Column, int],
-    len: Union[Column, int] = -1,
+    pos: Union["ColumnOrName", int],
+    len: Union["ColumnOrName", int] = -1,

Review comment:
       ditto ?

##########
File path: python/pyspark/sql/functions.py
##########
@@ -1715,13 +1715,18 @@ def greatest(*cols: "ColumnOrName") -> Column:
     return Column(sc._jvm.functions.greatest(_to_seq(sc, cols, _to_java_column)))
 
 
-def least(*cols: Column) -> Column:
+def least(*cols: "ColumnOrName") -> Column:

Review comment:
       Maybe we also need to add the tests for this when the `*cols` is string ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Daniel-Davies commented on pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

Daniel-Davies commented on pull request #35071:
URL: https://github.com/apache/spark/pull/35071#issuecomment-1005679065


   > Thanks for fixing this @Daniel-Davies.
   
   No worries! Thanks all for the diligent reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35071:
URL: https://github.com/apache/spark/pull/35071#discussion_r776898164



##########
File path: python/pyspark/sql/functions.py
##########
@@ -3707,7 +3709,9 @@ def arrays_overlap(a1: "ColumnOrName", a2: "ColumnOrName") -> Column:
     return Column(sc._jvm.functions.arrays_overlap(_to_java_column(a1), _to_java_column(a2)))
 
 
-def slice(x: "ColumnOrName", start: Union[Column, int], length: Union[Column, int]) -> Column:
+def slice(
+    x: "ColumnOrName", start: Union["ColumnOrName", int], length: Union["ColumnOrName", int]
+) -> Column:

Review comment:
       Since we're here, let's also update `Parameters` in docstring. Especially we would have to document `start` and `length` to avoid confusion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35071:
URL: https://github.com/apache/spark/pull/35071#discussion_r776898209



##########
File path: python/pyspark/sql/functions.py
##########
@@ -4687,7 +4691,7 @@ def map_from_entries(col: "ColumnOrName") -> Column:
     return Column(sc._jvm.functions.map_from_entries(_to_java_column(col)))
 
 
-def array_repeat(col: "ColumnOrName", count: Union[Column, int]) -> Column:
+def array_repeat(col: "ColumnOrName", count: Union["ColumnOrName", int]) -> Column:

Review comment:
       Let's also add `Parameters` into docstring since we're here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35071:
URL: https://github.com/apache/spark/pull/35071#discussion_r778662841



##########
File path: python/pyspark/sql/functions.py
##########
@@ -4172,12 +4209,10 @@ def from_json(
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
-        string column in json format
+        a column or column name in CSV format

Review comment:
       Seems a typo? we should fix both `from_json` and `from_csv` :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #35071:
URL: https://github.com/apache/spark/pull/35071#discussion_r777229120



##########
File path: python/pyspark/sql/functions.py
##########
@@ -4703,7 +4741,8 @@ def array_repeat(col: "ColumnOrName", count: Union[Column, int]) -> Column:
     assert sc is not None and sc._jvm is not None
     return Column(
         sc._jvm.functions.array_repeat(
-            _to_java_column(col), _to_java_column(count) if isinstance(count, Column) else count
+            _to_java_column(col),
+            count if isinstance(count, int) else _to_java_column(count),

Review comment:
       I'd prefer to either leave this one for the time being, or use the same approach as for the previous PR (`lit`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #35071:
URL: https://github.com/apache/spark/pull/35071#issuecomment-1003347739


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Daniel-Davies commented on pull request #35071: [SPARK-37788][PYTHON] Update remaining PySpark functions to use ColumnOrName (over Column), where appropriate

Posted by GitBox <gi...@apache.org>.

Daniel-Davies commented on pull request #35071:
URL: https://github.com/apache/spark/pull/35071#issuecomment-1003454890


   > Looks pretty good otherwise. Do you mind adding some docs on functions that take both `Column` and string but interpret string as something different from `Column`'s name? We can do this in a separate PR too.
   > 
   > e.g.) `from_csv`'s `schema` takes `str` but that works like actual string value of DDL schema instead of a column name.
   
   Hey @HyukjinKwon - thanks for the review; I've added some more descriptions to such cases of non column-name strings- let me know if you meant another location


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org