You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "dtenedor (via GitHub)" <gi...@apache.org> on 2024/02/03 00:43:06 UTC

[PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

dtenedor opened a new pull request, #45007:
URL: https://github.com/apache/spark/pull/45007

   ### What changes were proposed in this pull request?
   
   This PR adds a UDTF API for the 'analyze' method to indicate subset of input table columns to select.
   
   For example, this UDTF populates this 'select' list to indicate that Spark should only return two input columns from the input table: 'input' and 'partition_col':
   
   ```
   from pyspark.sql.functions import AnalyzeResult, OrderingColumn, PartitioningColumn
   from pyspark.sql.types import IntegerType, Row, StructType
   class Udtf:
       def __init__(self):
           self._partition_col = None
           self._count = 0
           self._sum = 0
           self._last = None
   
       @staticmethod
       def analyze(row: Row):
           return AnalyzeResult(
               schema=StructType()
                   .add("partition_col", IntegerType())
                   .add("count", IntegerType())
                   .add("total", IntegerType())
                   .add("last", IntegerType()),
               partitionBy=[
                   PartitioningColumn("$partitionBy")
               ],
               orderBy=[
                   OrderingColumn("$orderBy")
               ],
               select=[
                   "input", "partition_col"
               ])
   
       def eval(self, row: Row):
           self._partition_col = row["partition_col"]
           self._count += 1
           self._last = row["input"]
           self._sum += row["input"]
   
       def terminate(self):
           yield self._partition_col, self._count, self._sum, self._last
   ```
   
   ### Why are the changes needed?
   
   This can reduce the amount of data sent between the JVM and Python interpreter, improving performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   This PR adds test coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1476911322


##########
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala:
##########
@@ -627,22 +631,44 @@ object IntegratedUDFTestUtils extends SQLHelper {
   object UDTFPartitionByOrderBy
     extends TestPythonUDTFPartitionByOrderByBase(
       partitionBy = "partition_col",
-      orderBy = "input")
+      orderBy = "input",
+      select = "")
 
   object UDTFPartitionByOrderByComplexExpr
     extends TestPythonUDTFPartitionByOrderByBase(
       partitionBy = "partition_col + 1",
-      orderBy = "RANDOM(42)")
+      orderBy = "RANDOM(42)",
+      select = "")
+
+  object UDTFPartitionByOrderBySelectExpr
+    extends TestPythonUDTFPartitionByOrderByBase(
+      partitionBy = "partition_col",
+      orderBy = "input",
+      select = "\"partition_col\", \"input\"")
+
+  object UDTFPartitionByOrderBySelectComplexExpr
+    extends TestPythonUDTFPartitionByOrderByBase(
+      partitionBy = "partition_col + 1",
+      orderBy = "RANDOM(42)",
+      select = "\"input + 1\"")
 
   object UDTFInvalidPartitionByOrderByParseError
     extends TestPythonUDTFPartitionByOrderByBase(
       partitionBy = "unparsable",
-      orderBy = "input")
+      orderBy = "input",
+      select = "")
 
   object UDTFInvalidOrderByAscKeyword
     extends TestPythonUDTFPartitionByOrderByBase(
       partitionBy = "partition_col",
-      orderBy = "partition_col ASC")
+      orderBy = "partition_col ASC",
+      select = "")
+
+  object UDTFInvalidSelectExprParseError
+    extends TestPythonUDTFPartitionByOrderByBase(
+      partitionBy = "partition_col",
+      orderBy = "input",
+      select = "\"unparsable\"")

Review Comment:
   What happens if `select` contains only `partition_col`? That will try to get `row["input"]` in `eval`.
   Could you add the test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1932985032

   The failed tests seem not related to the last commit. The previous build passed except for linter that is fixed in the last commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1481956296


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -31,7 +31,7 @@
     write_with_length,
     SpecialLengths,
 )
-from pyspark.sql.functions import PartitioningColumn
+from pyspark.sql.functions import OrderingColumn, PartitioningColumn, SelectedColumn

Review Comment:
   Or we may want a similar check for `OrderingColumn`? which can be done in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1479121291


##########
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala:
##########
@@ -627,22 +631,44 @@ object IntegratedUDFTestUtils extends SQLHelper {
   object UDTFPartitionByOrderBy
     extends TestPythonUDTFPartitionByOrderByBase(
       partitionBy = "partition_col",
-      orderBy = "input")
+      orderBy = "input",
+      select = "")
 
   object UDTFPartitionByOrderByComplexExpr
     extends TestPythonUDTFPartitionByOrderByBase(
       partitionBy = "partition_col + 1",
-      orderBy = "RANDOM(42)")
+      orderBy = "RANDOM(42)",
+      select = "")
+
+  object UDTFPartitionByOrderBySelectExpr
+    extends TestPythonUDTFPartitionByOrderByBase(
+      partitionBy = "partition_col",
+      orderBy = "input",
+      select = "\"partition_col\", \"input\"")
+
+  object UDTFPartitionByOrderBySelectComplexExpr
+    extends TestPythonUDTFPartitionByOrderByBase(
+      partitionBy = "partition_col + 1",
+      orderBy = "RANDOM(42)",
+      select = "\"input + 1\"")
 
   object UDTFInvalidPartitionByOrderByParseError
     extends TestPythonUDTFPartitionByOrderByBase(
       partitionBy = "unparsable",
-      orderBy = "input")
+      orderBy = "input",
+      select = "")
 
   object UDTFInvalidOrderByAscKeyword
     extends TestPythonUDTFPartitionByOrderByBase(
       partitionBy = "partition_col",
-      orderBy = "partition_col ASC")
+      orderBy = "partition_col ASC",
+      select = "")
+
+  object UDTFInvalidSelectExprParseError
+    extends TestPythonUDTFPartitionByOrderByBase(
+      partitionBy = "partition_col",
+      orderBy = "input",
+      select = "\"unparsable\"")

Review Comment:
   Good idea, this is done. I added several more tests as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1932985366

   Thanks! merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1924946296

   cc @ueshin


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1928614063

   @ueshin thanks for your review! I switched the selected column representation to a new class `SelectedColumn` per offline suggestion, including an optional field for specifying an alias.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1924996188

   I guess `"$partitionBy"` and `"$orderBy"` in the example of the descritpion should be some columns? Could you fix them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1481953697


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -31,7 +31,7 @@
     write_with_length,
     SpecialLengths,
 )
-from pyspark.sql.functions import PartitioningColumn
+from pyspark.sql.functions import OrderingColumn, PartitioningColumn, SelectedColumn

Review Comment:
   ```suggestion
   from pyspark.sql.functions import PartitioningColumn, SelectedColumn
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1481977258


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -31,7 +31,7 @@
     write_with_length,
     SpecialLengths,
 )
-from pyspark.sql.functions import PartitioningColumn
+from pyspark.sql.functions import OrderingColumn, PartitioningColumn, SelectedColumn

Review Comment:
   Fixed this for now by removing `OrderingColumn`. I will add a similar check for that in a separate PR to help separate concerns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin closed pull request #45007: [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select
URL: https://github.com/apache/spark/pull/45007


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org