You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "dtenedor (via GitHub)" <gi...@apache.org> on 2024/03/04 20:58:20 UTC

[PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

dtenedor opened a new pull request, #45375:
URL: https://github.com/apache/spark/pull/45375

   ### What changes were proposed in this pull request?
   
   This PR adds more Python UDTF documentation for functions that accept input tables.
   
   ### Why are the changes needed?
   
   This functionality was added recently but not covered in docs yet.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's a documentation-only change.
   
   ### How was this patch tested?
   
   N/A
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #45375:
URL: https://github.com/apache/spark/pull/45375#discussion_r1511980443


##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -63,6 +63,7 @@ To implement a Python UDTF, you first need to define a class implementing the me
             """
             ...
 
+        @staticmethod
         def analyze(self, *args: Any) -> AnalyzeResult:

Review Comment:
   You're right, good catch. Updated this.



##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -285,10 +327,39 @@ To implement a Python UDTF, you first need to define a class implementing the me
             """
             ...
 
+Emitting output rows
+--------------------
+
+The return type of the UDTF defines the schema of the table it outputs. It must be either a
+``StructType``, for example ``StructType().add("c1", StringType())``, or a DDL string representing a
+struct type, for example ``c1: string``. The `eval` and `terminate` methods then emit zero or more
+output rows conforming to this schema by yielding tuples, lists, or pyspark.sql.Row objects. For
+example:
+
+```
+def eval(self, x, y, z):
+    # Here we return a row by providing a tuple of three elements.

Review Comment:
   Sure, this is done.



##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -163,6 +185,28 @@ To implement a Python UDTF, you first need to define a class implementing the me
             ...         num_articles=len((
             ...             word for word in words
             ...             if word == 'a' or word == 'an' or word == 'the')))
+
+            An `analyze` implementation that returns a constant output schema, and also requests
+            to select a subset of columns from the input table and for input table to be partitioned
+            across several UDTF calls based on the values of the `date` column:
+
+            >>> @staticmethod
+            ... def analyze(*args) -> AnalyzeResult:

Review Comment:
   I added some more explanation here.



##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -75,31 +76,52 @@ To implement a Python UDTF, you first need to define a class implementing the me
 
             This method accepts zero or more parameters mapping 1:1 with the arguments provided to
             the particular UDTF call under consideration. Each parameter is an instance of the
-            `AnalyzeArgument` class, which contains fields including the provided argument's data
-            type and value (in the case of literal scalar arguments only). For table arguments, the
-            `isTable` field is set to true and the `dataType` field is a StructType representing
-            the table's column types:
-
-                dataType: DataType
-                value: Optional[Any]
-                isTable: bool
+            `AnalyzeArgument` class.
+
+            `AnalyzeArgument` fields
+            ------------------------
+            dataType: DataType
+                Indicates the type of the provided input argument to this particular UDTF call.
+                For input table arguments, this is a StructType representing the table's columns.
+            value: Optional[Any]
+                The value of the provided input argument to this particular UDTF call. This is
+                `None` for table arguments, or for literal scalar arguments that are not constant.
+            isTable: bool
+                This is true if the provided input argument to this particular UDTF call is a
+                table argument.
+            isConstantExpression: bool
+                This is true if the provided input argument to this particular UDTF call is a
+                constant scalar expression.

Review Comment:
   Yes. Updated this to explicitly say "either a literal or other constant-foldable scalar expression."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44746][PYTHON] Add more Python UDTF documentation for functions that accept input tables [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45375:
URL: https://github.com/apache/spark/pull/45375#issuecomment-1979844986

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on PR #45375:
URL: https://github.com/apache/spark/pull/45375#issuecomment-1979764609

   Looks good! Also cc @ueshin and @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on PR #45375:
URL: https://github.com/apache/spark/pull/45375#issuecomment-1977448390

   cc @allisonwang-db @ueshin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #45375:
URL: https://github.com/apache/spark/pull/45375#discussion_r1513503230


##########
examples/src/main/python/sql/udtf.py:
##########
@@ -210,6 +210,75 @@ def eval(self, row: Row):
     # +---+
 
 
+def python_udtf_table_argument_with_partitioning(spark: SparkSession) -> None:
+
+    from pyspark.sql.functions import udtf
+    from pyspark.sql.types import Row
+
+    # Define and register a UDTF.
+    @udtf(returnType="a: string, b: int")
+    class FilterUDTF:
+        def __init__(self):
+            self.key = ""
+            self.max = 0
+
+        def eval(self, row: Row):
+            self.key = row["a"]
+            self.max = max(self.max, row["b"])
+
+        def terminate(self):
+            yield self.key, self.max
+
+    spark.udtf.register("filter_udtf", FilterUDTF)
+
+    # Create an input table with some example values.
+    spark.sql("DROP TABLE IF EXISTS values_table")
+    spark.sql("CREATE TABLE values_table (a STRING, b INT)")
+    spark.sql("INSERT INTO values_table VALUES ('abc', 2), ('abc', 4), ('def', 6), ('def', 8)")
+    spark.table("values_table").show()
+

Review Comment:
   Sure, this is done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #45375:
URL: https://github.com/apache/spark/pull/45375#discussion_r1511851865


##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -63,6 +63,7 @@ To implement a Python UDTF, you first need to define a class implementing the me
             """
             ...
 
+        @staticmethod
         def analyze(self, *args: Any) -> AnalyzeResult:

Review Comment:
   Should this `Any` be `AnalyzeArgument` class?



##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -285,10 +327,39 @@ To implement a Python UDTF, you first need to define a class implementing the me
             """
             ...
 
+Emitting output rows
+--------------------
+
+The return type of the UDTF defines the schema of the table it outputs. It must be either a
+``StructType``, for example ``StructType().add("c1", StringType())``, or a DDL string representing a
+struct type, for example ``c1: string``. The `eval` and `terminate` methods then emit zero or more
+output rows conforming to this schema by yielding tuples, lists, or pyspark.sql.Row objects. For
+example:
+
+```
+def eval(self, x, y, z):
+    # Here we return a row by providing a tuple of three elements.

Review Comment:
   This section is very helpful! Could we move this comment outside of the code block for the following examples? Like
   
   Return a row by providing a tuple of three elements.
   ```
   def eval(...)
   ```
   ...
   



##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -75,31 +76,52 @@ To implement a Python UDTF, you first need to define a class implementing the me
 
             This method accepts zero or more parameters mapping 1:1 with the arguments provided to
             the particular UDTF call under consideration. Each parameter is an instance of the
-            `AnalyzeArgument` class, which contains fields including the provided argument's data
-            type and value (in the case of literal scalar arguments only). For table arguments, the
-            `isTable` field is set to true and the `dataType` field is a StructType representing
-            the table's column types:
-
-                dataType: DataType
-                value: Optional[Any]
-                isTable: bool
+            `AnalyzeArgument` class.
+
+            `AnalyzeArgument` fields
+            ------------------------
+            dataType: DataType
+                Indicates the type of the provided input argument to this particular UDTF call.
+                For input table arguments, this is a StructType representing the table's columns.
+            value: Optional[Any]
+                The value of the provided input argument to this particular UDTF call. This is
+                `None` for table arguments, or for literal scalar arguments that are not constant.
+            isTable: bool
+                This is true if the provided input argument to this particular UDTF call is a
+                table argument.
+            isConstantExpression: bool
+                This is true if the provided input argument to this particular UDTF call is a
+                constant scalar expression.

Review Comment:
   Does it mean it's a literal / foldable expression?



##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -163,6 +185,28 @@ To implement a Python UDTF, you first need to define a class implementing the me
             ...         num_articles=len((
             ...             word for word in words
             ...             if word == 'a' or word == 'an' or word == 'the')))
+
+            An `analyze` implementation that returns a constant output schema, and also requests
+            to select a subset of columns from the input table and for input table to be partitioned
+            across several UDTF calls based on the values of the `date` column:
+
+            >>> @staticmethod
+            ... def analyze(*args) -> AnalyzeResult:

Review Comment:
   It will be very helpful to have some explanation for this function (what this UDTF is trying to achieve with partitionBy, orderBy, and select). Maybe we can add it in the example section.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #45375:
URL: https://github.com/apache/spark/pull/45375#discussion_r1513496747


##########
examples/src/main/python/sql/udtf.py:
##########
@@ -210,6 +210,75 @@ def eval(self, row: Row):
     # +---+
 
 
+def python_udtf_table_argument_with_partitioning(spark: SparkSession) -> None:
+
+    from pyspark.sql.functions import udtf
+    from pyspark.sql.types import Row
+
+    # Define and register a UDTF.
+    @udtf(returnType="a: string, b: int")
+    class FilterUDTF:
+        def __init__(self):
+            self.key = ""
+            self.max = 0
+
+        def eval(self, row: Row):
+            self.key = row["a"]
+            self.max = max(self.max, row["b"])
+
+        def terminate(self):
+            yield self.key, self.max
+
+    spark.udtf.register("filter_udtf", FilterUDTF)
+
+    # Create an input table with some example values.
+    spark.sql("DROP TABLE IF EXISTS values_table")
+    spark.sql("CREATE TABLE values_table (a STRING, b INT)")
+    spark.sql("INSERT INTO values_table VALUES ('abc', 2), ('abc', 4), ('def', 6), ('def', 8)")
+    spark.table("values_table").show()
+

Review Comment:
   maybe we can have the output of the show() here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44746][PYTHON] Add more Python UDTF documentation for functions that accept input tables [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #45375: [SPARK-44746][PYTHON] Add more Python UDTF documentation for functions that accept input tables
URL: https://github.com/apache/spark/pull/45375


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org