You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "dtenedor (via GitHub)" <gi...@apache.org> on 2024/02/03 00:43:06 UTC
[PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
dtenedor opened a new pull request, #45007:
URL: https://github.com/apache/spark/pull/45007
### What changes were proposed in this pull request?
This PR adds a UDTF API for the 'analyze' method to indicate subset of input table columns to select.
For example, this UDTF populates this 'select' list to indicate that Spark should only return two input columns from the input table: 'input' and 'partition_col':
```
from pyspark.sql.functions import AnalyzeResult, OrderingColumn, PartitioningColumn
from pyspark.sql.types import IntegerType, Row, StructType
class Udtf:
def __init__(self):
self._partition_col = None
self._count = 0
self._sum = 0
self._last = None
@staticmethod
def analyze(row: Row):
return AnalyzeResult(
schema=StructType()
.add("partition_col", IntegerType())
.add("count", IntegerType())
.add("total", IntegerType())
.add("last", IntegerType()),
partitionBy=[
PartitioningColumn("$partitionBy")
],
orderBy=[
OrderingColumn("$orderBy")
],
select=[
"input", "partition_col"
])
def eval(self, row: Row):
self._partition_col = row["partition_col"]
self._count += 1
self._last = row["input"]
self._sum += row["input"]
def terminate(self):
yield self._partition_col, self._count, self._sum, self._last
```
### Why are the changes needed?
This can reduce the amount of data sent between the JVM and Python interpreter, improving performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This PR adds test coverage.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1476911322
##########
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala:
##########
@@ -627,22 +631,44 @@ object IntegratedUDFTestUtils extends SQLHelper {
object UDTFPartitionByOrderBy
extends TestPythonUDTFPartitionByOrderByBase(
partitionBy = "partition_col",
- orderBy = "input")
+ orderBy = "input",
+ select = "")
object UDTFPartitionByOrderByComplexExpr
extends TestPythonUDTFPartitionByOrderByBase(
partitionBy = "partition_col + 1",
- orderBy = "RANDOM(42)")
+ orderBy = "RANDOM(42)",
+ select = "")
+
+ object UDTFPartitionByOrderBySelectExpr
+ extends TestPythonUDTFPartitionByOrderByBase(
+ partitionBy = "partition_col",
+ orderBy = "input",
+ select = "\"partition_col\", \"input\"")
+
+ object UDTFPartitionByOrderBySelectComplexExpr
+ extends TestPythonUDTFPartitionByOrderByBase(
+ partitionBy = "partition_col + 1",
+ orderBy = "RANDOM(42)",
+ select = "\"input + 1\"")
object UDTFInvalidPartitionByOrderByParseError
extends TestPythonUDTFPartitionByOrderByBase(
partitionBy = "unparsable",
- orderBy = "input")
+ orderBy = "input",
+ select = "")
object UDTFInvalidOrderByAscKeyword
extends TestPythonUDTFPartitionByOrderByBase(
partitionBy = "partition_col",
- orderBy = "partition_col ASC")
+ orderBy = "partition_col ASC",
+ select = "")
+
+ object UDTFInvalidSelectExprParseError
+ extends TestPythonUDTFPartitionByOrderByBase(
+ partitionBy = "partition_col",
+ orderBy = "input",
+ select = "\"unparsable\"")
Review Comment:
What happens if `select` contains only `partition_col`? That will try to get `row["input"]` in `eval`.
Could you add the test?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1932985032
The failed tests seem not related to the last commit. The previous build passed except for linter that is fixed in the last commit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1481956296
##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -31,7 +31,7 @@
write_with_length,
SpecialLengths,
)
-from pyspark.sql.functions import PartitioningColumn
+from pyspark.sql.functions import OrderingColumn, PartitioningColumn, SelectedColumn
Review Comment:
Or we may want a similar check for `OrderingColumn`? which can be done in a separate PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1479121291
##########
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala:
##########
@@ -627,22 +631,44 @@ object IntegratedUDFTestUtils extends SQLHelper {
object UDTFPartitionByOrderBy
extends TestPythonUDTFPartitionByOrderByBase(
partitionBy = "partition_col",
- orderBy = "input")
+ orderBy = "input",
+ select = "")
object UDTFPartitionByOrderByComplexExpr
extends TestPythonUDTFPartitionByOrderByBase(
partitionBy = "partition_col + 1",
- orderBy = "RANDOM(42)")
+ orderBy = "RANDOM(42)",
+ select = "")
+
+ object UDTFPartitionByOrderBySelectExpr
+ extends TestPythonUDTFPartitionByOrderByBase(
+ partitionBy = "partition_col",
+ orderBy = "input",
+ select = "\"partition_col\", \"input\"")
+
+ object UDTFPartitionByOrderBySelectComplexExpr
+ extends TestPythonUDTFPartitionByOrderByBase(
+ partitionBy = "partition_col + 1",
+ orderBy = "RANDOM(42)",
+ select = "\"input + 1\"")
object UDTFInvalidPartitionByOrderByParseError
extends TestPythonUDTFPartitionByOrderByBase(
partitionBy = "unparsable",
- orderBy = "input")
+ orderBy = "input",
+ select = "")
object UDTFInvalidOrderByAscKeyword
extends TestPythonUDTFPartitionByOrderByBase(
partitionBy = "partition_col",
- orderBy = "partition_col ASC")
+ orderBy = "partition_col ASC",
+ select = "")
+
+ object UDTFInvalidSelectExprParseError
+ extends TestPythonUDTFPartitionByOrderByBase(
+ partitionBy = "partition_col",
+ orderBy = "input",
+ select = "\"unparsable\"")
Review Comment:
Good idea, this is done. I added several more tests as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1932985366
Thanks! merging to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1924946296
cc @ueshin
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1928614063
@ueshin thanks for your review! I switched the selected column representation to a new class `SelectedColumn` per offline suggestion, including an optional field for specifying an alias.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #45007:
URL: https://github.com/apache/spark/pull/45007#issuecomment-1924996188
I guess `"$partitionBy"` and `"$orderBy"` in the example of the descritpion should be some columns? Could you fix them?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1481953697
##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -31,7 +31,7 @@
write_with_length,
SpecialLengths,
)
-from pyspark.sql.functions import PartitioningColumn
+from pyspark.sql.functions import OrderingColumn, PartitioningColumn, SelectedColumn
Review Comment:
```suggestion
from pyspark.sql.functions import PartitioningColumn, SelectedColumn
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on code in PR #45007:
URL: https://github.com/apache/spark/pull/45007#discussion_r1481977258
##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -31,7 +31,7 @@
write_with_length,
SpecialLengths,
)
-from pyspark.sql.functions import PartitioningColumn
+from pyspark.sql.functions import OrderingColumn, PartitioningColumn, SelectedColumn
Review Comment:
Fixed this for now by removing `OrderingColumn`. I will add a similar check for that in a separate PR to help separate concerns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin closed pull request #45007: [SPARK-46966][Python] Add UDTF API for 'analyze' method to indicate subset of input table columns to select
URL: https://github.com/apache/spark/pull/45007
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org