You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "ueshin (via GitHub)" <gi...@apache.org> on 2023/10/12 23:09:12 UTC
[PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]
ueshin opened a new pull request, #43355:
URL: https://github.com/apache/spark/pull/43355
### What changes were proposed in this pull request?
This is a follow-up of #43042.
Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument.
### Why are the changes needed?
The Python UDTF analysis result was not applied when the table argument is specified as a named argument.
For example, for the following UDTF:
```py
@udtf
class TestUDTF:
def __init__(self):
self._count = 0
self._sum = 0
self._last = None
@staticmethod
def analyze(*args, **kwargs):
return AnalyzeResult(
schema=StructType()
.add("count", IntegerType())
.add("total", IntegerType())
.add("last", IntegerType()),
with_single_partition=True,
order_by=[OrderingColumn("input"), OrderingColumn("partition_col")],
)
def eval(self, row: Row):
# Make sure that the rows arrive in the expected order.
if self._last is not None and self._last > row["input"]:
raise Exception(
f"self._last was {self._last} but the row value was {row['input']}"
)
self._count += 1
self._last = row["input"]
self._sum += row["input"]
def terminate(self):
yield self._count, self._sum, self._last
spark.udtf.register("test_udtf", TestUDTF)
```
The following query shows a wrong result:
```py
>>> spark.sql("""
... WITH t AS (
... SELECT id AS partition_col, 1 AS input FROM range(1, 21)
... UNION ALL
... SELECT id AS partition_col, 2 AS input FROM range(1, 21)
... )
... SELECT count, total, last
... FROM test_udtf(row => TABLE(t))
... ORDER BY 1, 2
... """).show()
+-----+-----+----+
|count|total|last|
+-----+-----+----+
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 2| 2|
| 1| 2| 2|
| 1| 2| 2|
| 1| 2| 2|
| 1| 2| 2|
| 1| 2| 2|
| 1| 2| 2|
| 1| 2| 2|
+-----+-----+----+
only showing top 20 rows
```
That should equal to the result without named argument:
```py
>>> spark.sql("""
... WITH t AS (
... SELECT id AS partition_col, 1 AS input FROM range(1, 21)
... UNION ALL
... SELECT id AS partition_col, 2 AS input FROM range(1, 21)
... )
... SELECT count, total, last
... FROM test_udtf(TABLE(t))
... ORDER BY 1, 2
... """).show()
+-----+-----+----+
|count|total|last|
+-----+-----+----+
| 40| 60| 2|
+-----+-----+----+
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified the related tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #43355:
URL: https://github.com/apache/spark/pull/43355#issuecomment-1760655880
Thanks! merging to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #43355:
URL: https://github.com/apache/spark/pull/43355#issuecomment-1760653226
The failed tests are not related to this PR.
> KafkaSourceStressSuite.stress test with multiple topics and partitions
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #43355:
URL: https://github.com/apache/spark/pull/43355#issuecomment-1760496230
cc @dtenedor
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin closed pull request #43355: [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument
URL: https://github.com/apache/spark/pull/43355
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org