You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "ueshin (via GitHub)" <gi...@apache.org> on 2023/10/12 23:09:12 UTC

[PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]

ueshin opened a new pull request, #43355:
URL: https://github.com/apache/spark/pull/43355

   ### What changes were proposed in this pull request?
   
   This is a follow-up of #43042.
   
   Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument.
   
   ### Why are the changes needed?
   
   The Python UDTF analysis result was not applied when the table argument is specified as a named argument.
   
   For example, for the following UDTF:
   
   ```py
   @udtf
   class TestUDTF:
       def __init__(self):
           self._count = 0
           self._sum = 0
           self._last = None
   
       @staticmethod
       def analyze(*args, **kwargs):
           return AnalyzeResult(
               schema=StructType()
               .add("count", IntegerType())
               .add("total", IntegerType())
               .add("last", IntegerType()),
               with_single_partition=True,
               order_by=[OrderingColumn("input"), OrderingColumn("partition_col")],
           )
   
       def eval(self, row: Row):
           # Make sure that the rows arrive in the expected order.
           if self._last is not None and self._last > row["input"]:
               raise Exception(
                   f"self._last was {self._last} but the row value was {row['input']}"
               )
           self._count += 1
           self._last = row["input"]
           self._sum += row["input"]
   
       def terminate(self):
           yield self._count, self._sum, self._last
   
   spark.udtf.register("test_udtf", TestUDTF)
   ```
   
   The following query shows a wrong result:
   
   ```py
   >>> spark.sql("""
   ...     WITH t AS (
   ...       SELECT id AS partition_col, 1 AS input FROM range(1, 21)
   ...       UNION ALL
   ...       SELECT id AS partition_col, 2 AS input FROM range(1, 21)
   ...     )
   ...     SELECT count, total, last
   ...     FROM test_udtf(row => TABLE(t))
   ...     ORDER BY 1, 2
   ... """).show()
   +-----+-----+----+
   |count|total|last|
   +-----+-----+----+
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    1|   1|
   |    1|    2|   2|
   |    1|    2|   2|
   |    1|    2|   2|
   |    1|    2|   2|
   |    1|    2|   2|
   |    1|    2|   2|
   |    1|    2|   2|
   |    1|    2|   2|
   +-----+-----+----+
   only showing top 20 rows
   ```
   
   
   That should equal to the result without named argument:
   
   ```py
   >>> spark.sql("""
   ...     WITH t AS (
   ...       SELECT id AS partition_col, 1 AS input FROM range(1, 21)
   ...       UNION ALL
   ...       SELECT id AS partition_col, 2 AS input FROM range(1, 21)
   ...     )
   ...     SELECT count, total, last
   ...     FROM test_udtf(TABLE(t))
   ...     ORDER BY 1, 2
   ... """).show()
   +-----+-----+----+
   |count|total|last|
   +-----+-----+----+
   |   40|   60|   2|
   +-----+-----+----+
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Modified the related tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #43355:
URL: https://github.com/apache/spark/pull/43355#issuecomment-1760655880

   Thanks! merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #43355:
URL: https://github.com/apache/spark/pull/43355#issuecomment-1760653226

   The failed tests are not related to this PR. 
   
   > KafkaSourceStressSuite.stress test with multiple topics and partitions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #43355:
URL: https://github.com/apache/spark/pull/43355#issuecomment-1760496230

   cc @dtenedor 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin closed pull request #43355: [SPARK-45266][PYTHON][FOLLOWUP] Fix to resolve UnresolvedPolymorphicPythonUDTF when the table argument is specified as a named argument
URL: https://github.com/apache/spark/pull/43355


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org