You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "dtenedor (via GitHub)" <gi...@apache.org> on 2023/10/31 23:10:10 UTC

[PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' method accepts or returns wrong values [spark]

dtenedor opened a new pull request, #43611:
URL: https://github.com/apache/spark/pull/43611

   ### What changes were proposed in this pull request?
   
   This PR adds checks to return specific error messages if any Python UDTF 'analyze' method accepts or returns wrong values.
   
   ### Why are the changes needed?
   
   This helps users understand how to easily fix their user-defined table functions if they are malformed.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, see above.
   
   ### How was this patch tested?
   
   This PR adds test coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1408286324


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,89 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"

Review Comment:
   This `handler.__name__` should also be used the registered name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on PR #43611:
URL: https://github.com/apache/spark/pull/43611#issuecomment-1788161885

   cc @ueshin


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1401278217


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,89 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check that the arguments provided to the UDTF call match the expected parameters defined
+        # in the static 'analyze' method signature.
+        try:
+            inspect.signature(handler.analyze).bind(*args, **kwargs)  # type: ignore[attr-defined]
+        except TypeError as e:
+            # The UDTF call's arguments did not match the expected signature.
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the function arguments did not match the expected
+                    signature of the static 'analyze' method ({e}). Please update the query so that
+                    this table function call provides arguments matching the expected signature, or
+                    else update the table function so that its static 'analyze' method accepts the
+                    provided arguments, and then try the query again."""
+                )
+            )
+
+        # Invoke the UDTF's 'analyze' method.
         result = handler.analyze(*args, **kwargs)  # type: ignore[attr-defined]
 
+        # Check invariants about the 'analyze' method after running it.
         if not isinstance(result, AnalyzeResult):
             raise PySparkValueError(
-                "Output of `analyze` static method of Python UDTFs expects "
-                f"a pyspark.sql.udtf.AnalyzeResult but got: {type(result)}"
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult, but instead this method returned a value of
+                    type: {type(result)}"""
+                )

Review Comment:
   Good question: these use the error class `TABLE_VALUED_FUNCTION_FAILED_TO_ANALYZE_IN_PYTHON`. Each of these strings goes in the `msg` parameter [1].
   
   [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala#L229



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on PR #43611:
URL: https://github.com/apache/spark/pull/43611#issuecomment-1832510620

   Thanks! merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on PR #43611:
URL: https://github.com/apache/spark/pull/43611#issuecomment-1817291535

   cc @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1399758278


##########
python/pyspark/worker.py:
##########
@@ -709,24 +708,28 @@ def read_udtf(pickleSer, infile, eval_type):
     # with one argument containing the previous AnalyzeResult. If that fails, then try a constructor
     # with no arguments. In this way each UDTF class instance can decide if it wants to inspect the
     # AnalyzeResult.
+    udtf_init_args = inspect.getfullargspec(handler)
+
+    udtf_name = handler.__name__

Review Comment:
   The tricky thing here is that users can register the UDTF using a different name from the handler name. 
   ```
   class MyUDTF:
       ...
       
   spark.udtf.register("foo", MyUDTF)
   
   spark.sql("SELECT * FROM foo(...)")
   ```
   Then should the error message here use `foo` instead of `MyUDTF`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1396192258


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,150 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check invariants about the 'analyze' and 'eval' methods before running them.
+        def check_method_invariants_before_running(
+            expected: inspect.FullArgSpec, method: str, is_static: bool
+        ) -> None:
+            num_expected_args = len(expected.args)
+            num_provided_args = len(args) + len(kwargs)
+            num_provided_non_kw_args = len(args)
+            if not is_static:
+                num_provided_args += 1
+                num_provided_non_kw_args += 1
+            if (
+                expected.varargs is None
+                and expected.varkw is None
+                and expected.defaults is None
+                and num_expected_args != num_provided_args
+            ):
+                # The UDTF call provided the wrong number of positional arguments.
+                def arguments(num: int) -> str:
+                    return f"{num} argument{'' if num == 1 else 's'}"
+
+                raise PySparkValueError(
+                    format_error(
+                        f"""
+                    {error_prefix} because its '{method}' method expects exactly

Review Comment:
   I thought so as well, but the `dev/reformat-python` script keeps changing it back to this :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1409526093


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,89 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"

Review Comment:
   Good suggestion, done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1390014123


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,150 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check invariants about the 'analyze' and 'eval' methods before running them.
+        def check_method_invariants_before_running(
+            expected: inspect.FullArgSpec, method: str, is_static: bool
+        ) -> None:
+            num_expected_args = len(expected.args)
+            num_provided_args = len(args) + len(kwargs)
+            num_provided_non_kw_args = len(args)
+            if not is_static:
+                num_provided_args += 1
+                num_provided_non_kw_args += 1
+            if (
+                expected.varargs is None
+                and expected.varkw is None
+                and expected.defaults is None
+                and num_expected_args != num_provided_args
+            ):
+                # The UDTF call provided the wrong number of positional arguments.
+                def arguments(num: int) -> str:
+                    return f"{num} argument{'' if num == 1 else 's'}"
+
+                raise PySparkValueError(
+                    format_error(
+                        f"""
+                    {error_prefix} because its '{method}' method expects exactly

Review Comment:
   nit: indent: shall we have the same indent as `f"""` or one more indent?
   ditto for the following error messages.



##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,150 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check invariants about the 'analyze' and 'eval' methods before running them.
+        def check_method_invariants_before_running(
+            expected: inspect.FullArgSpec, method: str, is_static: bool
+        ) -> None:

Review Comment:
   I'm not sure these checks fully cover Python's errors.
   I guess we should use `inspect.Signature.bind` and its error message to build new error messages.



##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,150 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check invariants about the 'analyze' and 'eval' methods before running them.
+        def check_method_invariants_before_running(
+            expected: inspect.FullArgSpec, method: str, is_static: bool
+        ) -> None:
+            num_expected_args = len(expected.args)
+            num_provided_args = len(args) + len(kwargs)
+            num_provided_non_kw_args = len(args)
+            if not is_static:
+                num_provided_args += 1
+                num_provided_non_kw_args += 1
+            if (
+                expected.varargs is None
+                and expected.varkw is None
+                and expected.defaults is None
+                and num_expected_args != num_provided_args
+            ):
+                # The UDTF call provided the wrong number of positional arguments.
+                def arguments(num: int) -> str:
+                    return f"{num} argument{'' if num == 1 else 's'}"
+
+                raise PySparkValueError(
+                    format_error(
+                        f"""
+                    {error_prefix} because its '{method}' method expects exactly
+                    {arguments(num_expected_args)}, but the function call provided
+                    {arguments(num_provided_args)} instead. Please update the query so that it
+                    provides exactly {arguments(num_expected_args)}, or else update the table
+                    function so that its '{method}' method accepts exactly
+                    {arguments(num_provided_non_kw_args)}, and then try the query again."""
+                    )
+                )
+            expected_arg_names = set(expected.args)
+            provided_positional_arg_names = set(expected.args[: len(args)])
+            for arg_name in kwargs.keys():
+                if expected.varkw is None and arg_name not in expected_arg_names:
+                    # The UDTF call provided a keyword argument whose name was not expected.
+                    raise PySparkValueError(
+                        format_error(
+                            f"""
+                        {error_prefix} because its '{method}' method expects arguments whose names
+                        appear in the set ({', '.join(expected.args)}), but the function call
+                        provided a keyword argument with unexpected name '{arg_name}' instead.
+                        Please update the query so that it provides only keyword arguments whose
+                        names appear in this set, or else update the table function so that its
+                        '{method}' method accepts argument names including '{arg_name}', and then
+                        try the query again."""
+                        )
+                    )
+                elif arg_name in provided_positional_arg_names:
+                    # The UDTF call provided a duplicate keyword argument when a value for that
+                    # argument was already specified positionally.
+                    raise PySparkValueError(
+                        format_error(
+                            f"""
+                        {error_prefix} because the function call provided keyword argument
+                        '{arg_name}' whose corresponding value was already specified positionally.
+                        Please update the query so that it provides this argument's value exactly
+                        once instead, and then try the query again."""
+                        )
+                    )
+
+        check_method_invariants_before_running(
+            inspect.getfullargspec(handler.analyze),  # type: ignore[attr-defined]
+            "static analyze",
+            is_static=True,
+        )
+        if hasattr(handler, "eval"):
+            check_method_invariants_before_running(
+                inspect.getfullargspec(handler.eval),  # type: ignore[attr-defined]
+                "eval",
+                is_static=False,
+            )

Review Comment:
   The same/similar check is necessary in `worker.py` for the case without `analyze`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1399758278


##########
python/pyspark/worker.py:
##########
@@ -709,24 +708,28 @@ def read_udtf(pickleSer, infile, eval_type):
     # with one argument containing the previous AnalyzeResult. If that fails, then try a constructor
     # with no arguments. In this way each UDTF class instance can decide if it wants to inspect the
     # AnalyzeResult.
+    udtf_init_args = inspect.getfullargspec(handler)
+
+    udtf_name = handler.__name__

Review Comment:
   The tricky thing here is that users can register the UDTF using a different name from the handler name. 
   ```
   class MyUDTF:
       ...
       
   spark.udtf.register("foo", MyUDTF)
   
   spark.sqk("SELECT * FROM foo(...)")
   ```
   Then should the error message here use `foo` instead of `MyUDTF`?



##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,89 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check that the arguments provided to the UDTF call match the expected parameters defined
+        # in the static 'analyze' method signature.
+        try:
+            inspect.signature(handler.analyze).bind(*args, **kwargs)  # type: ignore[attr-defined]
+        except TypeError as e:
+            # The UDTF call's arguments did not match the expected signature.
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the function arguments did not match the expected
+                    signature of the static 'analyze' method ({e}). Please update the query so that
+                    this table function call provides arguments matching the expected signature, or
+                    else update the table function so that its static 'analyze' method accepts the
+                    provided arguments, and then try the query again."""
+                )
+            )
+
+        # Invoke the UDTF's 'analyze' method.
         result = handler.analyze(*args, **kwargs)  # type: ignore[attr-defined]
 
+        # Check invariants about the 'analyze' method after running it.
         if not isinstance(result, AnalyzeResult):
             raise PySparkValueError(
-                "Output of `analyze` static method of Python UDTFs expects "
-                f"a pyspark.sql.udtf.AnalyzeResult but got: {type(result)}"
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult, but instead this method returned a value of
+                    type: {type(result)}"""
+                )

Review Comment:
   Shall we consider using error classes for them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on PR #43611:
URL: https://github.com/apache/spark/pull/43611#issuecomment-1832510327

   The failed tests seem not related to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1401290111


##########
python/pyspark/worker.py:
##########
@@ -709,24 +708,28 @@ def read_udtf(pickleSer, infile, eval_type):
     # with one argument containing the previous AnalyzeResult. If that fails, then try a constructor
     # with no arguments. In this way each UDTF class instance can decide if it wants to inspect the
     # AnalyzeResult.
+    udtf_init_args = inspect.getfullargspec(handler)
+
+    udtf_name = handler.__name__

Review Comment:
   Sounds good, this is done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on PR #43611:
URL: https://github.com/apache/spark/pull/43611#issuecomment-1821807603

   Hi @allisonwang-db, responded to your code review comments, please take another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on PR #43611:
URL: https://github.com/apache/spark/pull/43611#issuecomment-1804852830

   cc @ueshin the failing tests look unrelated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1396514528


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,150 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check invariants about the 'analyze' and 'eval' methods before running them.
+        def check_method_invariants_before_running(
+            expected: inspect.FullArgSpec, method: str, is_static: bool
+        ) -> None:
+            num_expected_args = len(expected.args)
+            num_provided_args = len(args) + len(kwargs)
+            num_provided_non_kw_args = len(args)
+            if not is_static:
+                num_provided_args += 1
+                num_provided_non_kw_args += 1
+            if (
+                expected.varargs is None
+                and expected.varkw is None
+                and expected.defaults is None
+                and num_expected_args != num_provided_args
+            ):
+                # The UDTF call provided the wrong number of positional arguments.
+                def arguments(num: int) -> str:
+                    return f"{num} argument{'' if num == 1 else 's'}"
+
+                raise PySparkValueError(
+                    format_error(
+                        f"""
+                    {error_prefix} because its '{method}' method expects exactly
+                    {arguments(num_expected_args)}, but the function call provided
+                    {arguments(num_provided_args)} instead. Please update the query so that it
+                    provides exactly {arguments(num_expected_args)}, or else update the table
+                    function so that its '{method}' method accepts exactly
+                    {arguments(num_provided_non_kw_args)}, and then try the query again."""
+                    )
+                )
+            expected_arg_names = set(expected.args)
+            provided_positional_arg_names = set(expected.args[: len(args)])
+            for arg_name in kwargs.keys():
+                if expected.varkw is None and arg_name not in expected_arg_names:
+                    # The UDTF call provided a keyword argument whose name was not expected.
+                    raise PySparkValueError(
+                        format_error(
+                            f"""
+                        {error_prefix} because its '{method}' method expects arguments whose names
+                        appear in the set ({', '.join(expected.args)}), but the function call
+                        provided a keyword argument with unexpected name '{arg_name}' instead.
+                        Please update the query so that it provides only keyword arguments whose
+                        names appear in this set, or else update the table function so that its
+                        '{method}' method accepts argument names including '{arg_name}', and then
+                        try the query again."""
+                        )
+                    )
+                elif arg_name in provided_positional_arg_names:
+                    # The UDTF call provided a duplicate keyword argument when a value for that
+                    # argument was already specified positionally.
+                    raise PySparkValueError(
+                        format_error(
+                            f"""
+                        {error_prefix} because the function call provided keyword argument
+                        '{arg_name}' whose corresponding value was already specified positionally.
+                        Please update the query so that it provides this argument's value exactly
+                        once instead, and then try the query again."""
+                        )
+                    )
+
+        check_method_invariants_before_running(
+            inspect.getfullargspec(handler.analyze),  # type: ignore[attr-defined]
+            "static analyze",
+            is_static=True,
+        )
+        if hasattr(handler, "eval"):
+            check_method_invariants_before_running(
+                inspect.getfullargspec(handler.eval),  # type: ignore[attr-defined]
+                "eval",
+                is_static=False,
+            )

Review Comment:
   Done.



##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,150 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check invariants about the 'analyze' and 'eval' methods before running them.
+        def check_method_invariants_before_running(
+            expected: inspect.FullArgSpec, method: str, is_static: bool
+        ) -> None:

Review Comment:
   Good suggestion, this is done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1396192258


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,150 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check invariants about the 'analyze' and 'eval' methods before running them.
+        def check_method_invariants_before_running(
+            expected: inspect.FullArgSpec, method: str, is_static: bool
+        ) -> None:
+            num_expected_args = len(expected.args)
+            num_provided_args = len(args) + len(kwargs)
+            num_provided_non_kw_args = len(args)
+            if not is_static:
+                num_provided_args += 1
+                num_provided_non_kw_args += 1
+            if (
+                expected.varargs is None
+                and expected.varkw is None
+                and expected.defaults is None
+                and num_expected_args != num_provided_args
+            ):
+                # The UDTF call provided the wrong number of positional arguments.
+                def arguments(num: int) -> str:
+                    return f"{num} argument{'' if num == 1 else 's'}"
+
+                raise PySparkValueError(
+                    format_error(
+                        f"""
+                    {error_prefix} because its '{method}' method expects exactly

Review Comment:
   This is done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1399572799


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,94 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check that the arguments provided to the UDTF call match the expected parameters defined
+        # in the static 'analyze' method signature.
+        try:
+            inspect.signature(handler.analyze).bind(*args, **kwargs)
+        except TypeError as e:
+            # The UDTF call's arguments did not match the expected signature.
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the function arguments did not match the expected
+                    signature of the static 'analyze' method ({e}). Please update the query so that
+                    this table function call provides arguments matching the expected signature, or
+                    else update the table function so that its static 'analyze' method accepts the
+                    provided arguments, and then try the query again."""
+                )
+            )
+
+        # Invoke the UDTF's 'analyze' method.
         result = handler.analyze(*args, **kwargs)  # type: ignore[attr-defined]
 
+        # Check invariants about the 'analyze' method after running it.
         if not isinstance(result, AnalyzeResult):
             raise PySparkValueError(
-                "Output of `analyze` static method of Python UDTFs expects "
-                f"a pyspark.sql.udtf.AnalyzeResult but got: {type(result)}"
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult, but instead this method returned a value of
+                    type: {type(result)}"""
+                )
+            )
+        elif not isinstance(result.schema, StructType):
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult with a 'schema' field comprising a StructType,
+                    but the 'schema' field had the wrong type: {type(result.schema)}"""
+                )
+            )
+        has_table_arg = (
+            len([arg for arg in args if arg.isTable])
+            + len([arg for arg in kwargs.items() if arg[-1].isTable])
+        ) > 0
+        if not has_table_arg and result.withSinglePartition:
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method returned an
+                    'AnalyzeResult' object with the 'withSinglePartition' field set to 'true', but
+                    the function call did not provide any table argument. Please update the query so
+                    that it provides a table argument, or else update the table function so that its
+                    'analyze' method returns an 'AnalyzeResult' object with the
+                    'withSinglePartition' field set to 'false', and then try the query again."""
+                )
+            )
+        elif not has_table_arg and len(result.partitionBy) > 0:
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method returned an
+                    'AnalyzeResult' object with the 'partitionBy' list set to non-empty, but the
+                    function call did not provide any table argument. Please update the query so
+                    that it provides a table argument, or else update the table function so that its
+                    'analyze' method returns an 'AnalyzeResult' object with the 'partitionBy' list
+                    set to empty, and then try the query again."""
+                )
+            )
+        elif (
+            hasattr(result, "partitionBy")

Review Comment:
   Good point, removed this check.



##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,94 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check that the arguments provided to the UDTF call match the expected parameters defined
+        # in the static 'analyze' method signature.
+        try:
+            inspect.signature(handler.analyze).bind(*args, **kwargs)
+        except TypeError as e:
+            # The UDTF call's arguments did not match the expected signature.
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the function arguments did not match the expected
+                    signature of the static 'analyze' method ({e}). Please update the query so that
+                    this table function call provides arguments matching the expected signature, or
+                    else update the table function so that its static 'analyze' method accepts the
+                    provided arguments, and then try the query again."""
+                )
+            )
+
+        # Invoke the UDTF's 'analyze' method.
         result = handler.analyze(*args, **kwargs)  # type: ignore[attr-defined]
 
+        # Check invariants about the 'analyze' method after running it.
         if not isinstance(result, AnalyzeResult):
             raise PySparkValueError(
-                "Output of `analyze` static method of Python UDTFs expects "
-                f"a pyspark.sql.udtf.AnalyzeResult but got: {type(result)}"
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult, but instead this method returned a value of
+                    type: {type(result)}"""
+                )
+            )
+        elif not isinstance(result.schema, StructType):
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult with a 'schema' field comprising a StructType,
+                    but the 'schema' field had the wrong type: {type(result.schema)}"""
+                )
+            )
+        has_table_arg = (
+            len([arg for arg in args if arg.isTable])
+            + len([arg for arg in kwargs.items() if arg[-1].isTable])
+        ) > 0

Review Comment:
   Sounds good, done (also re-ran `dev/reformat-python` afterwards).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #43611:
URL: https://github.com/apache/spark/pull/43611#discussion_r1398012807


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,94 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check that the arguments provided to the UDTF call match the expected parameters defined
+        # in the static 'analyze' method signature.
+        try:
+            inspect.signature(handler.analyze).bind(*args, **kwargs)
+        except TypeError as e:
+            # The UDTF call's arguments did not match the expected signature.
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the function arguments did not match the expected
+                    signature of the static 'analyze' method ({e}). Please update the query so that
+                    this table function call provides arguments matching the expected signature, or
+                    else update the table function so that its static 'analyze' method accepts the
+                    provided arguments, and then try the query again."""
+                )
+            )
+
+        # Invoke the UDTF's 'analyze' method.
         result = handler.analyze(*args, **kwargs)  # type: ignore[attr-defined]
 
+        # Check invariants about the 'analyze' method after running it.
         if not isinstance(result, AnalyzeResult):
             raise PySparkValueError(
-                "Output of `analyze` static method of Python UDTFs expects "
-                f"a pyspark.sql.udtf.AnalyzeResult but got: {type(result)}"
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult, but instead this method returned a value of
+                    type: {type(result)}"""
+                )
+            )
+        elif not isinstance(result.schema, StructType):
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult with a 'schema' field comprising a StructType,
+                    but the 'schema' field had the wrong type: {type(result.schema)}"""
+                )
+            )
+        has_table_arg = (
+            len([arg for arg in args if arg.isTable])
+            + len([arg for arg in kwargs.items() if arg[-1].isTable])
+        ) > 0

Review Comment:
   ```suggestion
           has_table_arg = any(arg.isTable for arg in args) or any(arg.isTable for arg in kwargs.values())
   ```
   
   Please reformat it as I'm not sure it follows the format rule.



##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -116,12 +118,94 @@ def main(infile: IO, outfile: IO) -> None:
         handler = read_udtf(infile)
         args, kwargs = read_arguments(infile)
 
+        error_prefix = f"Failed to evaluate the user-defined table function '{handler.__name__}'"
+
+        def format_error(msg: str) -> str:
+            return dedent(msg).replace("\n", " ")
+
+        # Check that the arguments provided to the UDTF call match the expected parameters defined
+        # in the static 'analyze' method signature.
+        try:
+            inspect.signature(handler.analyze).bind(*args, **kwargs)
+        except TypeError as e:
+            # The UDTF call's arguments did not match the expected signature.
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the function arguments did not match the expected
+                    signature of the static 'analyze' method ({e}). Please update the query so that
+                    this table function call provides arguments matching the expected signature, or
+                    else update the table function so that its static 'analyze' method accepts the
+                    provided arguments, and then try the query again."""
+                )
+            )
+
+        # Invoke the UDTF's 'analyze' method.
         result = handler.analyze(*args, **kwargs)  # type: ignore[attr-defined]
 
+        # Check invariants about the 'analyze' method after running it.
         if not isinstance(result, AnalyzeResult):
             raise PySparkValueError(
-                "Output of `analyze` static method of Python UDTFs expects "
-                f"a pyspark.sql.udtf.AnalyzeResult but got: {type(result)}"
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult, but instead this method returned a value of
+                    type: {type(result)}"""
+                )
+            )
+        elif not isinstance(result.schema, StructType):
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method expects a result of type
+                    pyspark.sql.udtf.AnalyzeResult with a 'schema' field comprising a StructType,
+                    but the 'schema' field had the wrong type: {type(result.schema)}"""
+                )
+            )
+        has_table_arg = (
+            len([arg for arg in args if arg.isTable])
+            + len([arg for arg in kwargs.items() if arg[-1].isTable])
+        ) > 0
+        if not has_table_arg and result.withSinglePartition:
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method returned an
+                    'AnalyzeResult' object with the 'withSinglePartition' field set to 'true', but
+                    the function call did not provide any table argument. Please update the query so
+                    that it provides a table argument, or else update the table function so that its
+                    'analyze' method returns an 'AnalyzeResult' object with the
+                    'withSinglePartition' field set to 'false', and then try the query again."""
+                )
+            )
+        elif not has_table_arg and len(result.partitionBy) > 0:
+            raise PySparkValueError(
+                format_error(
+                    f"""
+                    {error_prefix} because the static 'analyze' method returned an
+                    'AnalyzeResult' object with the 'partitionBy' list set to non-empty, but the
+                    function call did not provide any table argument. Please update the query so
+                    that it provides a table argument, or else update the table function so that its
+                    'analyze' method returns an 'AnalyzeResult' object with the 'partitionBy' list
+                    set to empty, and then try the query again."""
+                )
+            )
+        elif (
+            hasattr(result, "partitionBy")

Review Comment:
   Do we need to check this? `AnalyzeResult` should always have this attribute?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin closed pull request #43611: [SPARK-45746][Python] Return specific error messages if UDTF 'analyze' or 'eval' method accepts or returns wrong values
URL: https://github.com/apache/spark/pull/43611


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org