You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2024/03/05 00:37:48 UTC

[PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

itholic opened a new pull request, #45377:
URL: https://github.com/apache/spark/pull/45377

   ### What changes were proposed in this pull request?
   
   This PR introduces an enhancement to the error messages generated by PySpark's DataFrame API, adding detailed context about the location within the user's PySpark code where the error occurred.
   
   This follows a similar improvement done on the JVM side for the Dataset API as described in https://github.com/apache/spark/pull/43334, aiming to provide PySpark users with the same level of detailed error context for better usability and debugging efficiency.
   
   
   ### Why are the changes needed?
   
   To improve a debuggability. Errors originating from PySpark operations can be difficult to debug with limited context in the error messages. While improvements on the JVM side have been made to offer detailed error contexts, PySpark errors often lack this level of detail. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No API changes, but error messages will include a reference to the exact line of user code that triggered the error, in addition to the existing descriptive error message.
   
   For example, consider the following PySpark code snippet that triggers a `DIVIDE_BY_ZERO` error:
   
   ```python
   1  from pyspark.sql import SparkSession
   2  from pyspark.sql.functions import col
   3  
   4  spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
   5  spark.conf.set("spark.sql.ansi.enabled", True)
   6  
   7  df = spark.range(10)
   8  df.select(col("id") / 0).show()
   ```
   
   **Before:**
   ```
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
   == DataFrame ==
   "divide" was called from
   java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   ```
   
   **After:**
   ```
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
   == DataFrame ==
   "divide" was called from
   java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   == Error Location (PySpark Code) ==
   ['df.select(col("id") / 0).show()\n'] was called from /.../spark/python/test_pyspark_error.py:8
   ```
   
   
   ### How was this patch tested?
   
   Added UTs.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-1977735804

   I'm still working on Spark Connect support and unit tests, but the basic structure is ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2033393150

   cc @cloud-fan too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1551362982


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -165,6 +172,20 @@ case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends Que
     builder ++= " was called from\n"
     builder ++= callSite
     builder += '\n'
+
+    if (pysparkOriginInfo.nonEmpty) {
+      builder ++= "\n== PySpark call site ==\n"
+      builder ++= "\""
+
+      builder ++= pysparkFragment
+      builder ++= "\""
+      builder ++= " was called from\n"
+      builder ++= pysparkCallSite
+      builder += '\n'
+    }
+
+    PySparkCurrentOrigin.clear()

Review Comment:
   it looks tricky to call `clear` inside a `lazy val`...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2036729930

   ```
   == DataFrame ==
   "divide" was called from
   java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   == PySpark call site ==
   "divide" was called from
   /.../spark/python/test_pyspark_error.py:4
   ```
   
   Shall we just make it
   ```
   == DataFrame ==
   "divide" was called from
   /.../spark/python/test_pyspark_error.py:4
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1552181315


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,13 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:  # type: ignore[return]
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+
+    def pysparkCallSite(self) -> Optional[str]:  # type: ignore[return]

Review Comment:
   By any chance could we comment what "fragment" and "callSite" are at PySparkCurrentOrigin? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2041298301

   Nice catching 👍  Let me address this case as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558831279


##########
python/pyspark/sql/column.py:
##########
@@ -174,16 +175,48 @@ def _bin_op(
     ["Column", Union["Column", "LiteralType", "DecimalLiteral", "DateTimeLiteral"]], "Column"
 ]:
     """Create a method for given binary operator"""
+    binary_operator_map = {
+        "plus": "+",
+        "minus": "-",
+        "divide": "/",
+        "multiply": "*",
+        "mod": "%",
+        "equalTo": "=",
+        "lt": "<",
+        "leq": "<=",
+        "geq": ">=",
+        "gt": ">",
+        "eqNullSafe": "<=>",
+        "bitwiseOR": "|",
+        "bitwiseAND": "&",
+        "bitwiseXOR": "^",
+        # Just following JVM rule even if the names of source and target are the same.
+        "and": "and",
+        "or": "or",
+    }
 
     def _(
         self: "Column",
         other: Union["Column", "LiteralType", "DecimalLiteral", "DateTimeLiteral"],
     ) -> "Column":
+        logging_info = {}
+        if name in binary_operator_map:
+            stack = inspect.stack()
+            frame_info = stack[-1]

Review Comment:
   Now we respect the conf.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1511983720


##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +127,73 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def is_builtin_exception(e: BaseException) -> bool:
+    """
+    Check if the given exception is a builtin exception or not
+    """
+    builtin_exceptions = [
+        exc
+        for name, exc in vars(builtins).items()
+        if isinstance(exc, type) and issubclass(exc, BaseException)
+    ]
+    return isinstance(e, tuple(builtin_exceptions))
+
+
+def add_error_context(func: Callable[..., Any]) -> Callable[..., Any]:
+    """
+    A decorator that captures PySpark exceptions occurring during the function execution,
+    and adds user code location information to the exception message.
+    """
+
+    @functools.wraps(func)
+    def wrapper(*args: Any, **kwargs: Any) -> Any:
+        try:
+            return func(*args, **kwargs)
+        except Exception as e:
+            from pyspark.errors import PySparkException
+            from pyspark.errors.exceptions.captured import CapturedException
+
+            inspect_stack = inspect.stack()
+            # Stack location is different when Python running on IPython (e.g. Jupyter Notebook)
+            user_code_space = inspect_stack[-1] if get_ipython() is None else inspect_stack[1]

Review Comment:
   This PR also covers the errors generated from Notebook. For example:
   
   <img width="1204" alt="Screenshot 2024-03-04 at 3 41 28 PM" src="https://github.com/apache/spark/assets/44108233/a713d8a9-097b-455f-b3b4-c5d18185b25f">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1519270151


##########
python/pyspark/errors/utils.py:
##########
@@ -15,12 +15,22 @@
 # limitations under the License.
 #
 
+import builtins
 import re
-from typing import Dict, Match
+import functools
+import inspect
+import threading
+from typing import Any, Callable, Dict, Match, TypeVar, Type
+
+from IPython import get_ipython

Review Comment:
   Thanks for pointing out!
   
   btw I'm preparing new design for error context for leveraging JVM stacktrace. Let me address this comment along with applying the new design.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2041845268

   > My idea: we add new Column creation methods for PySpark, which takes python call site information.
   
   I'm not 100% sure if it is work, but it sounds worth enough to try. Let me give it a shot. Thanks for the idea!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1552966218


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,13 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:  # type: ignore[return]
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+
+    def pysparkCallSite(self) -> Optional[str]:  # type: ignore[return]

Review Comment:
   > how are these two different? The given example only shows the call site
   > 
   > ```
   > == PySpark call site ==
   > "divide" was called from
   > /.../spark/python/test_pyspark_error.py:8
   > ```
   
   Yes, the `fragment` is a function identifier and the call site is the file name including line number.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1553150727


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -825,6 +828,231 @@ def test_duplicate_field_names(self):
         self.assertEqual(df.schema, schema)
         self.assertEqual(df.collect(), data)
 
+    def test_dataframe_error_context(self):
+        # SPARK-47274: Add more useful contexts for PySpark DataFrame API errors.
+        with self.sql_conf({"spark.sql.ansi.enabled": True}):
+            df = self.spark.range(10)
+
+            # DataFrameQueryContext with pysparkCallSite - divide
+            with self.assertRaises(ArithmeticException) as pe:
+                df.withColumn("div_zero", df.id / 0).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - plus
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("plus_invalid_type", df.id + "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - minus
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("minus_invalid_type", df.id - "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - multiply
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_invalid_type", df.id * "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`divide` is problematic)
+            with self.assertRaises(ArithmeticException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_zero", df.id / 0
+                ).withColumn("plus_ten", df.id + 10).withColumn("minus_ten", df.id - 10).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`plus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_string", df.id + "string").withColumn(
+                    "minus_ten", df.id - 10
+                ).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`minus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_ten", df.id + 10).withColumn(
+                    "minus_string", df.id - "string"
+                ).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`multiply` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_string", df.id * "string").withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_ten", df.id + 10).withColumn("minus_ten", df.id - 10).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )
+
+            # Multiple expressions in df.select (`divide` is problematic)
+            with self.assertRaises(ArithmeticException) as pe:
+                df.select(df.id - 10, df.id + 4, df.id / 0, df.id * 5).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # Multiple expressions in df.select (`plus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.select(df.id - 10, df.id + "string", df.id / 10, df.id * 5).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # Multiple expressions in df.select (`minus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.select(df.id - "string", df.id + 4, df.id / 10, df.id * 5).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # Multiple expressions in df.select (`multiply` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.select(df.id - 10, df.id + 4, df.id / 10, df.id * "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )

Review Comment:
   cc @ueshin added some test cases here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2044191299

   @cloud-fan @ueshin I believe now the previous comments are all resolved, and I also added more tests accordingly.
   
   Could you take a look when you find some time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1557593182


##########
python/pyspark/sql/column.py:
##########
@@ -174,16 +175,48 @@ def _bin_op(
     ["Column", Union["Column", "LiteralType", "DecimalLiteral", "DateTimeLiteral"]], "Column"
 ]:
     """Create a method for given binary operator"""
+    binary_operator_map = {
+        "plus": "+",
+        "minus": "-",
+        "divide": "/",
+        "multiply": "*",
+        "mod": "%",
+        "equalTo": "=",
+        "lt": "<",
+        "leq": "<=",
+        "geq": ">=",
+        "gt": ">",
+        "eqNullSafe": "<=>",
+        "bitwiseOR": "|",
+        "bitwiseAND": "&",
+        "bitwiseXOR": "^",
+        # Just following JVM rule even if the names of source and target are the same.
+        "and": "and",
+        "or": "or",
+    }
 
     def _(
         self: "Column",
         other: Union["Column", "LiteralType", "DecimalLiteral", "DateTimeLiteral"],
     ) -> "Column":
+        logging_info = {}
+        if name in binary_operator_map:
+            stack = inspect.stack()
+            frame_info = stack[-1]

Review Comment:
   we should respect the conf `SQLConf.get.stackTracesInDataFrameContext` and get more than one stacks if users want.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2056680916

   > Let me give it a try and create a PR to refactoring the current structure, and ping you guys.
   
   Created https://github.com/apache/spark/pull/46063.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1549476644


##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +124,61 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def _capture_call_site(fragment: str) -> None:
+    """
+    Capture the call site information including file name, line number, and function name.
+
+    This function updates the thread-local storage from server side (PySparkCurrentOrigin)
+    with the current call site information when a PySpark API function is called.
+
+    Parameters
+    ----------
+    func_name : str
+        The name of the PySpark API function being captured.
+
+    Notes
+    -----
+    The call site information is used to enhance error messages with the exact location
+    in the user code that led to the error.
+    """
+    from pyspark.sql.session import SparkSession
+
+    spark = SparkSession._getActiveSessionOrCreate()

Review Comment:
   Sounds reasonable to me. Let me update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1559115572


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[(String, String)]) extends QueryContext {

Review Comment:
   Sounds better. Let me update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1557621216


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[java.util.Map[String, String]]) extends QueryContext {

Review Comment:
   Because we're currently using `dict` for PySpark logging info and it internally converted into `Map`:
   
   ```
               logging_info = {
                   "fragment": name,
                   "callSite": f"{frame_info.filename}:{frame_info.lineno}",
               }
   ```
   
   Maybe is there happen to any reason that we should use `Option[(String, String)]`??



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[java.util.Map[String, String]]) extends QueryContext {

Review Comment:
   Because we're currently using `dict` for PySpark logging info and it internally converted into `Map`:
   
   ```python
               logging_info = {
                   "fragment": name,
                   "callSite": f"{frame_info.filename}:{frame_info.lineno}",
               }
   ```
   
   Maybe is there happen to any reason that we should use `Option[(String, String)]`??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1557621216


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[java.util.Map[String, String]]) extends QueryContext {

Review Comment:
   Because we're currently using `dict` for PySpark logging info and it's internally converted into `Map` from `Py4J`:
   
   ```python
               logging_info = {
                   "fragment": name,
                   "callSite": f"{frame_info.filename}:{frame_info.lineno}",
               }
   ```
   
   Maybe is there happen to any reason that we should use `Option[(String, String)]`??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558921090


##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -171,6 +171,26 @@ class Column(val expr: Expression) extends Logging {
     Column.fn(name, this, lit(other))
   }
 
+  /**
+   * A version of the `fn` method specifically designed for binary operations in PySpark
+   * that require logging information.
+   * This method is used when the operation involves another Column.
+   *
+   * @param name               The name of the operation to be performed.
+   * @param other              The value to be used in the operation, which will be converted to a
+   *                           Column if not already one.
+   * @param pysparkLoggingInfo A map containing logging information such as the fragment and
+   *                           call site from PySpark.
+   * @return A Column resulting from the operation.
+   */
+  private def fn(
+      name: String, other: Any, pysparkLoggingInfo: java.util.ArrayList[String]): Column = {

Review Comment:
   shall we just take two string parameters instead of taking an ArrayList?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1549242008


##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +124,61 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def _capture_call_site(fragment: str) -> None:
+    """
+    Capture the call site information including file name, line number, and function name.
+
+    This function updates the thread-local storage from server side (PySparkCurrentOrigin)
+    with the current call site information when a PySpark API function is called.
+
+    Parameters
+    ----------
+    func_name : str
+        The name of the PySpark API function being captured.
+
+    Notes
+    -----
+    The call site information is used to enhance error messages with the exact location
+    in the user code that led to the error.
+    """
+    from pyspark.sql.session import SparkSession
+
+    spark = SparkSession._getActiveSessionOrCreate()
+    assert spark._jvm is not None
+
+    stack = inspect.stack()
+    frame_info = stack[-1]
+    filename = frame_info.filename
+    lineno = frame_info.lineno
+    call_site = f"{filename}:{lineno}"
+
+    pyspark_origin = spark._jvm.org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin
+    pyspark_origin.set(fragment, call_site)
+
+
+def with_origin(func: Callable[..., Any]) -> Callable[..., Any]:

Review Comment:
   Make this private too with `_ with_origin` if this isn't supposed to be referred in other modules.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1549240235


##########
python/pyspark/sql/column.py:
##########
@@ -195,6 +197,7 @@ def _(self: "Column", other: Union["LiteralType", "DecimalLiteral"]) -> "Column"
     return _
 
 
+@with_origin_to_class
 class Column:

Review Comment:
   Is this enough just to cover Column alone? cc @MaxGekk @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1549517320


##########
python/pyspark/sql/column.py:
##########
@@ -195,6 +197,7 @@ def _(self: "Column", other: Union["LiteralType", "DecimalLiteral"]) -> "Column"
     return _
 
 
+@with_origin_to_class
 class Column:

Review Comment:
   I believe so.
   
   I know there are some additional handling for couple of DataFrame functions such as `approxQuantile`, `cov`, `corr`, `corsstab` etc., but they immediately raise error from PySpark unlike Column API, so I think it should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1549232872


##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +124,61 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def _capture_call_site(fragment: str) -> None:
+    """
+    Capture the call site information including file name, line number, and function name.
+
+    This function updates the thread-local storage from server side (PySparkCurrentOrigin)
+    with the current call site information when a PySpark API function is called.
+
+    Parameters
+    ----------
+    func_name : str
+        The name of the PySpark API function being captured.
+
+    Notes
+    -----
+    The call site information is used to enhance error messages with the exact location
+    in the user code that led to the error.
+    """
+    from pyspark.sql.session import SparkSession
+
+    spark = SparkSession._getActiveSessionOrCreate()

Review Comment:
   I think we should probably just `getActiveSession`, and skip if it returns `None` for whatever reason instead of creating a new session here.



##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +124,61 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def _capture_call_site(fragment: str) -> None:
+    """
+    Capture the call site information including file name, line number, and function name.
+
+    This function updates the thread-local storage from server side (PySparkCurrentOrigin)
+    with the current call site information when a PySpark API function is called.
+
+    Parameters
+    ----------
+    func_name : str
+        The name of the PySpark API function being captured.
+
+    Notes
+    -----
+    The call site information is used to enhance error messages with the exact location
+    in the user code that led to the error.
+    """
+    from pyspark.sql.session import SparkSession

Review Comment:
   Import on the top maybe



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1549469336


##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +124,61 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def _capture_call_site(fragment: str) -> None:
+    """
+    Capture the call site information including file name, line number, and function name.
+
+    This function updates the thread-local storage from server side (PySparkCurrentOrigin)
+    with the current call site information when a PySpark API function is called.
+
+    Parameters
+    ----------
+    func_name : str
+        The name of the PySpark API function being captured.
+
+    Notes
+    -----
+    The call site information is used to enhance error messages with the exact location
+    in the user code that led to the error.
+    """
+    from pyspark.sql.session import SparkSession
+
+    spark = SparkSession._getActiveSessionOrCreate()
+    assert spark._jvm is not None
+
+    stack = inspect.stack()
+    frame_info = stack[-1]
+    filename = frame_info.filename
+    lineno = frame_info.lineno
+    call_site = f"{filename}:{lineno}"
+
+    pyspark_origin = spark._jvm.org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin
+    pyspark_origin.set(fragment, call_site)
+
+
+def with_origin(func: Callable[..., Any]) -> Callable[..., Any]:

Review Comment:
   Actually this is initially designed to be applied to individual method, but let me make it private since we're not having any use case for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2041828693

   @itholic what if we don't use thread local? IIUC, PySpark calls JVM methods to build the column instances at the end. On the JVM side, we wrap code with `withOrigin` to capture the call site automatically.
   
   My idea: we add new `Column` creation methods for PySpark, which takes python call site information. The implementation should set call site before calling `withOrigin`, as `withOrigin` respects the already captured call site
   ```
     private[sql] def withOrigin[T](f: => T): T = {
       if (CurrentOrigin.get.stackTrace.isDefined) {
         f
       } else {
     ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1543203400


##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +124,62 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def _capture_call_site(func_name: str) -> None:
+    """
+    Capture the call site information including file name, line number, and function name.
+
+    This function updates the thread-local storage from server side (PySparkCurrentOrigin)
+    with the current call site information when a PySpark API function is called.
+
+    Parameters
+    ----------
+    func_name : str
+        The name of the PySpark API function being captured.
+
+    Notes
+    -----
+    The call site information is used to enhance error messages with the exact location
+    in the user code that led to the error.
+    """
+    from pyspark.sql.session import SparkSession
+
+    spark = SparkSession._getActiveSessionOrCreate()
+    assert spark._jvm is not None
+
+    stack = inspect.stack()
+    frame_info = stack[-1]
+    function = func_name
+    filename = frame_info.filename
+    lineno = frame_info.lineno
+    call_site = f'"{function}" was called from\n{filename}:{lineno}'

Review Comment:
   I'm not sure if we should use `fragment` from JVM instead of Python function name??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558755161


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   Maybe can we keep the Java stacktrace as it is?
   
   In PySpark we also provide users an option `pysparkJVMStacktraceEnabled` to turn Java stacktrace on or off, since maybe someone still wants to see the Java stacktrace for deeper debugging?
   
   WDYT, @HyukjinKwon ?



##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   Maybe can we keep the Java stacktrace as it is?
   
   In PySpark we also provide users an option `pysparkJVMStacktraceEnabled` to turn Java stacktrace on or off, since maybe someone still wants to see the Java stacktrace for deeper debugging?
   
   WDYT, @HyukjinKwon ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2048796376

   Thanks @cloud-fan @ueshin @HyukjinKwon @xinrong-meng for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1565491650


##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -171,6 +171,29 @@ class Column(val expr: Expression) extends Logging {
     Column.fn(name, this, lit(other))
   }
 
+  /**
+   * A version of the `fn` method specifically designed for binary operations in PySpark
+   * that require logging information.
+   * This method is used when the operation involves another Column.
+   *
+   * @param name                The name of the operation to be performed.
+   * @param other               The value to be used in the operation, which will be converted to a
+   *                            Column if not already one.
+   * @param pysparkFragment     A string representing the 'fragment' of the PySpark error context,
+   *                            typically indicates the name of PySpark function.
+   * @param pysparkCallSite     A string representing the 'callSite' of the PySpark error context,
+   *                            providing the exact location within the PySpark code where the
+   *                            operation originated.
+   * @return A Column resulting from the operation.
+   */
+  private def fn(

Review Comment:
   @HyukjinKwon This probably can't cover all the cases, and we may need to add more overloads for certain functions that require non-expression parameters, but it shouldn't be any.
   
   I think it's better than using ThreadLocal which can be quite fragile to pass values between Python and JVM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2056611480

   The difficulty with the previous method was that it was not easy to perfectly sync the data between two separately operating TheadLocal, `CurrentOrigin` and `PySparkCurrentOrigin`.
   
   After taking deeper look at the structure, I think we may be able to make the `CurrentOrigin` more flexible to support PySpark error context instead of adding a separate ThreadLocal like `PySparkCurrentOrigin`.
   
   If it works, it seems possible to improve the structure to a more flexible while maintaining the existing communication rules between Python and JVM without adding helper functions such as PySpark-specific `fn`.
   
   Let me give it a try and create a PR to refactoring the current structure, and ping you guys.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1548814091


##########
python/pyspark/testing/utils.py:
##########
@@ -280,7 +282,14 @@ def check_error(
         exception: PySparkException,
         error_class: str,
         message_parameters: Optional[Dict[str, str]] = None,
+        query_context_type: Optional[QueryContextType] = None,
+        pyspark_fragment: Optional[str] = None,
     ):
+        query_context = exception.getQueryContext()
+        assert bool(query_context) == (query_context_type is not None), (
+            f"`query_context_type` is required when QueryContext exists. "

Review Comment:
   ```suggestion
               "`query_context_type` is required when QueryContext exists. "
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1518206202


##########
python/pyspark/errors/utils.py:
##########
@@ -15,12 +15,22 @@
 # limitations under the License.
 #
 
+import builtins
 import re
-from typing import Dict, Match
+import functools
+import inspect
+import threading
+from typing import Any, Callable, Dict, Match, TypeVar, Type
+
+from IPython import get_ipython

Review Comment:
   I don't think `IPython` module is always available.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558840908


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   Remove duplication by adding a method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558128891


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,13 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:  # type: ignore[return]
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+
+    def pysparkCallSite(self) -> Optional[str]:  # type: ignore[return]

Review Comment:
   Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558755161


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   Can we keep the Java stacktrace as it is?
   
   In PySpark we also provide users an option `pysparkJVMStacktraceEnabled` to turn Java stacktrace on or off, since maybe someone still wants to see the Java stacktrace for deeper debugging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2039001191

   > ```
   > == DataFrame ==
   > "divide" was called from
   > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   > 
   > == PySpark call site ==
   > "divide" was called from
   > /.../spark/python/test_pyspark_error.py:4
   > ```
   > 
   > Shall we just make it
   > 
   > ```
   > == DataFrame ==
   > "divide" was called from
   > /.../spark/python/test_pyspark_error.py:4
   > ```
   
   Sounds good. Let me address it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1551361278


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,13 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:  # type: ignore[return]
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+
+    def pysparkCallSite(self) -> Optional[str]:  # type: ignore[return]

Review Comment:
   oh, `divide` is the fragement?



##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,13 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:  # type: ignore[return]
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+
+    def pysparkCallSite(self) -> Optional[str]:  # type: ignore[return]

Review Comment:
   oh, `divide` is the fragment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1559114967


##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -171,6 +171,26 @@ class Column(val expr: Expression) extends Logging {
     Column.fn(name, this, lit(other))
   }
 
+  /**
+   * A version of the `fn` method specifically designed for binary operations in PySpark
+   * that require logging information.
+   * This method is used when the operation involves another Column.
+   *
+   * @param name               The name of the operation to be performed.
+   * @param other              The value to be used in the operation, which will be converted to a
+   *                           Column if not already one.
+   * @param pysparkLoggingInfo A map containing logging information such as the fragment and
+   *                           call site from PySpark.
+   * @return A Column resulting from the operation.
+   */
+  private def fn(
+      name: String, other: Any, pysparkLoggingInfo: java.util.ArrayList[String]): Column = {

Review Comment:
   Ah, two string parameters makes sense. Good suggestion, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1552966218


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,13 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:  # type: ignore[return]
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+
+    def pysparkCallSite(self) -> Optional[str]:  # type: ignore[return]

Review Comment:
   > how are these two different? The given example only shows the call site
   > 
   > ```
   > == PySpark call site ==
   > "divide" was called from
   > /.../spark/python/test_pyspark_error.py:8
   > ```
   
   Yes, the `fragment` is a function identifier and the call site is the file name including line number.
   
   I follows this behavior from JVM `DataFrameQueryContext`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2039000576

   Thanks @ueshin for spotting the negative case! Let me add it to the test case and fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1557589150


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[java.util.Map[String, String]]) extends QueryContext {

Review Comment:
   why is it a map instead of `Option[(String, String)]`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1555465767


##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -699,6 +699,13 @@ class Column(val expr: Expression) extends Logging {
    */
   def plus(other: Any): Column = this + other
 
+  def plusWithPySparkLoggingInfo(
+      other: Any, loggingInfo: java.util.Map[String, String]): Column = {
+    withOrigin(Some(loggingInfo)) {
+      this + other
+    }
+  }

Review Comment:
   This is too much work... Ideally all column creation APIs call `Column.fn` at the end, if we can change pyspark code to only call the Scala `Column.fn`, then we don't need to change the Scala side too much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2041315119

   Hmm... I faced some problem to resolve this case.
   
   PySpark provide logs to JVM at the time an expression is declared,
   but the actual execution order on the JVM side could be different from the declare order.
   
   For example, when running the example @ueshin provided:
   
   ```python
     1 spark.conf.set("spark.sql.ansi.enabled", True)
     2 df = spark.range(10)
     3 a = df.id / 10
     4 b = df.id / 0
     5 df.select(
     6   a,
     7   df.id + 4,
     8   b,
     9   df.id * 5
    10 ).show()
   ```
   
   Internally, the logging is processed as below:
   
   ```python
   # Logging call site from Python to JVM when define the expression in order:
   1: ("divide", "/test.py:3")
   2: ("divide", "/test.py:4")
   3: ("plus", "/test.py:7")
   4: ("multiply", "/test.py:9")
   
   # But analyzing the expression from JVM could have a different order from defining order from Python: 
   1: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   2: ArraySeq(org.apache.spark.sql.Column.plus(Column.scala:700), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   3: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   4: ArraySeq(org.apache.spark.sql.Column.multiply(Column.scala:760), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   ```
   
   To solve this problem,
   I think Python and JVM must share a "key" that can distinguish the unique value of each expression in `PySparkCurrentOrigin` and the JVM stackTrace at the time of declaring the expression.
   However, I can't think of a good way to make this possible at this moment.
   
   @ueshin @HyukjinKwon @cloud-fan could you advise if there is happen to a good way to make this possible?
   
   
   ## Workaround
   
   Alternatively, the workaround that comes to my mind is to provide additional information to the log.
   And at least we can compare fragment values and output "divide" instead of "plus", but the call site may be different:
   
   For example, the suggested workaround would look like:
   
   **In**
   ```python
     1 spark.conf.set("spark.sql.ansi.enabled", True)
     2 df = spark.range(10)
     3 a = df.id / 10
     4 b = df.id / 0
     5 df.select(
     6   a,
     7   df.id + 4,
     8   b,
     9   df.id * 5
    10 ).show()
   ```
   **Out**
   ```
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
   == DataFrame ==
   "divide" was called from
   /test.py:3
   
   == Other possible call sites ==
   "divide" was called from
   /test.py:4
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558755161


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   Can we keep the Java stacktrace as it is?
   
   In PySpark we provide users an option `pysparkJVMStacktraceEnabled` to turn Java stacktrace on or off, since maybe someone still wants to see the Java stacktrace for deeper debugging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1553150727


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -825,6 +828,231 @@ def test_duplicate_field_names(self):
         self.assertEqual(df.schema, schema)
         self.assertEqual(df.collect(), data)
 
+    def test_dataframe_error_context(self):
+        # SPARK-47274: Add more useful contexts for PySpark DataFrame API errors.
+        with self.sql_conf({"spark.sql.ansi.enabled": True}):
+            df = self.spark.range(10)
+
+            # DataFrameQueryContext with pysparkCallSite - divide
+            with self.assertRaises(ArithmeticException) as pe:
+                df.withColumn("div_zero", df.id / 0).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - plus
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("plus_invalid_type", df.id + "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - minus
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("minus_invalid_type", df.id - "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - multiply
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_invalid_type", df.id * "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`divide` is problematic)
+            with self.assertRaises(ArithmeticException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_zero", df.id / 0
+                ).withColumn("plus_ten", df.id + 10).withColumn("minus_ten", df.id - 10).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`plus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_string", df.id + "string").withColumn(
+                    "minus_ten", df.id - 10
+                ).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`minus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_ten", df.id + 10).withColumn(
+                    "minus_string", df.id - "string"
+                ).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`multiply` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_string", df.id * "string").withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_ten", df.id + 10).withColumn("minus_ten", df.id - 10).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )
+
+            # Multiple expressions in df.select (`divide` is problematic)
+            with self.assertRaises(ArithmeticException) as pe:
+                df.select(df.id - 10, df.id + 4, df.id / 0, df.id * 5).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # Multiple expressions in df.select (`plus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.select(df.id - 10, df.id + "string", df.id / 10, df.id * 5).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # Multiple expressions in df.select (`minus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.select(df.id - "string", df.id + 4, df.id / 10, df.id * 5).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # Multiple expressions in df.select (`multiply` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.select(df.id - 10, df.id + 4, df.id / 10, df.id * "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )

Review Comment:
   cc @ueshin added some test cases for multiple operations here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2058107235

   > perfectly sync the data between two separately operating TheadLocal, CurrentOrigin and PySparkCurrentOrigin.
   
   Why is that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2025536912

   Hi, @HyukjinKwon @MaxGekk could you take a look at the JVM side prototype when you have some time?
   
   Will add more commit for Spark Connect support and UTs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1547836649


##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +124,62 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def _capture_call_site(func_name: str) -> None:
+    """
+    Capture the call site information including file name, line number, and function name.
+
+    This function updates the thread-local storage from server side (PySparkCurrentOrigin)
+    with the current call site information when a PySpark API function is called.
+
+    Parameters
+    ----------
+    func_name : str
+        The name of the PySpark API function being captured.
+
+    Notes
+    -----
+    The call site information is used to enhance error messages with the exact location
+    in the user code that led to the error.
+    """
+    from pyspark.sql.session import SparkSession
+
+    spark = SparkSession._getActiveSessionOrCreate()
+    assert spark._jvm is not None
+
+    stack = inspect.stack()
+    frame_info = stack[-1]
+    function = func_name
+    filename = frame_info.filename
+    lineno = frame_info.lineno
+    call_site = f'"{function}" was called from\n{filename}:{lineno}'

Review Comment:
   Let me extract `fragment` from PySpark function name for consistency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1557591426


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   We don't need the java stacktrace in this case. Shall we just pass Nil?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1552964769


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -165,6 +172,20 @@ case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends Que
     builder ++= " was called from\n"
     builder ++= callSite
     builder += '\n'
+
+    if (pysparkOriginInfo.nonEmpty) {
+      builder ++= "\n== PySpark call site ==\n"
+      builder ++= "\""
+
+      builder ++= pysparkFragment
+      builder ++= "\""
+      builder ++= " was called from\n"
+      builder ++= pysparkCallSite
+      builder += '\n'
+    }
+
+    PySparkCurrentOrigin.clear()

Review Comment:
   Sounds reasonable. Let me update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2039207285

   Thanks @cloud-fan @xinrong-meng @ueshin for the additional comments! Just resolved all comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1559115572


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[(String, String)]) extends QueryContext {

Review Comment:
   Sounds better. Let me update



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[(String, String)]) extends QueryContext {

Review Comment:
   Sounds better. Let me update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558766748


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[java.util.Map[String, String]]) extends QueryContext {

Review Comment:
   Got it. let me update it. Thanks for the suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558919561


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[(String, String)]) extends QueryContext {

Review Comment:
   logging info sounds a bit weird, how about `pysparkErrorCotext`?



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala:
##########
@@ -32,10 +32,11 @@ case class Origin(
     sqlText: Option[String] = None,
     objectType: Option[String] = None,
     objectName: Option[String] = None,
-    stackTrace: Option[Array[StackTraceElement]] = None) {
+    stackTrace: Option[Array[StackTraceElement]] = None,
+    pysparkLoggingInfo: Option[(String, String)] = None) {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1559135258


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[(String, String)]) extends QueryContext {

Review Comment:
   Review applied. Thanks!



##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -171,6 +171,26 @@ class Column(val expr: Expression) extends Logging {
     Column.fn(name, this, lit(other))
   }
 
+  /**
+   * A version of the `fn` method specifically designed for binary operations in PySpark
+   * that require logging information.
+   * This method is used when the operation involves another Column.
+   *
+   * @param name               The name of the operation to be performed.
+   * @param other              The value to be used in the operation, which will be converted to a
+   *                           Column if not already one.
+   * @param pysparkLoggingInfo A map containing logging information such as the fragment and
+   *                           call site from PySpark.
+   * @return A Column resulting from the operation.
+   */
+  private def fn(
+      name: String, other: Any, pysparkLoggingInfo: java.util.ArrayList[String]): Column = {

Review Comment:
   Updated!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1548983997


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -825,6 +828,172 @@ def test_duplicate_field_names(self):
         self.assertEqual(df.schema, schema)
         self.assertEqual(df.collect(), data)
 
+    def test_dataframe_error_context(self):
+        # SPARK-47274: Add more useful contexts for PySpark DataFrame API errors.
+        with self.sql_conf({"spark.sql.ansi.enabled": True}):
+            df = self.spark.range(10)
+
+            # DataFrameQueryContext with pysparkCallSite - divide
+            with self.assertRaises(ArithmeticException) as pe:
+                df.withColumn("div_zero", df.id / 0).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - plus
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("plus_invalid_type", df.id + "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - minus
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("minus_invalid_type", df.id - "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - multiply
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_invalid_type", df.id * "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`divide` is problematic)
+            with self.assertRaises(ArithmeticException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_zero", df.id / 0
+                ).withColumn("plus_ten", df.id + 10).withColumn("minus_ten", df.id - 10).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`plus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_string", df.id + "string").withColumn(
+                    "minus_ten", df.id - 10
+                ).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`minus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_ten", df.id + 10).withColumn(
+                    "minus_string", df.id - "string"
+                ).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`multiply` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_string", df.id * "string").withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_ten", df.id + 10).withColumn("minus_ten", df.id - 10).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )
+
+            # DataFrameQueryContext without pysparkCallSite
+            with self.assertRaises(AnalysisException) as pe:
+                df.select("non-existing-column")
+            self.check_error(
+                exception=pe.exception,
+                error_class="UNRESOLVED_COLUMN.WITH_SUGGESTION",
+                message_parameters={"objectName": "`non-existing-column`", "proposal": "`id`"},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="",
+            )
+
+            # SQLQueryContext
+            with self.assertRaises(ArithmeticException) as pe:
+                self.spark.sql("select 10/0").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.SQL,
+            )
+
+            # No QueryContext
+            with self.assertRaises(AnalysisException) as pe:
+                self.spark.sql("select * from non-existing-table")
+            self.check_error(
+                exception=pe.exception,
+                error_class="INVALID_IDENTIFIER",
+                message_parameters={"ident": "non-existing-table"},
+                query_context_type=None,

Review Comment:
   FYI: `None` is default, so we don't need to specify like this, but I made this test for explicit example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1551367597


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -165,6 +172,20 @@ case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends Que
     builder ++= " was called from\n"
     builder ++= callSite
     builder += '\n'
+
+    if (pysparkOriginInfo.nonEmpty) {
+      builder ++= "\n== PySpark call site ==\n"
+      builder ++= "\""
+
+      builder ++= pysparkFragment
+      builder ++= "\""
+      builder ++= " was called from\n"
+      builder ++= pysparkCallSite
+      builder += '\n'
+    }
+
+    PySparkCurrentOrigin.clear()

Review Comment:
   Can we clear it right after we get the `fragment` and `callsite` properties?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2033330087

   Added QueryContext testing for DataFrameContext and UTs. The CI failures seems not related. cc @HyukjinKwon FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1555359203


##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -699,6 +699,13 @@ class Column(val expr: Expression) extends Logging {
    */
   def plus(other: Any): Column = this + other
 
+  def plusWithPySparkLoggingInfo(
+      other: Any, loggingInfo: java.util.Map[String, String]): Column = {
+    withOrigin(Some(loggingInfo)) {
+      this + other
+    }
+  }

Review Comment:
   Hi, @cloud-fan I added these kind of new `Column` expression methods for taking Python call site information properly along with extending the implementation of the existing `Origin` and `WithOrigin`.
   
   Thanks for your suggestion, now the call site correctness problem seems to be resolved, but I'm wondering if these `{name}PySparkLoggingInfo` methods should be added to all each `Coulmn` APIs.
   
   As this design can be changed after the review, I added for only four functions - `plus`, `divide`, `minus`, and `multiply` - as a prototype for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558758511


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   Maybe can we keep the Java stacktrace as it is?
   
   In PySpark we also provide users an option pysparkJVMStacktraceEnabled to turn Java stacktrace on or off, since maybe someone still wants to see the Java stacktrace for deeper debugging?
   
   WDYT, @HyukjinKwon ?



##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   Maybe can we keep the Java stacktrace as it is?
   
   In PySpark we also provide users an option `pysparkJVMStacktraceEnabled` to turn Java stacktrace on or off, since maybe someone still wants to see the Java stacktrace for deeper debugging?
   
   WDYT, @HyukjinKwon ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558760260


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   I see



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1557615484


##########
python/pyspark/sql/column.py:
##########
@@ -174,16 +175,48 @@ def _bin_op(
     ["Column", Union["Column", "LiteralType", "DecimalLiteral", "DateTimeLiteral"]], "Column"
 ]:
     """Create a method for given binary operator"""
+    binary_operator_map = {
+        "plus": "+",
+        "minus": "-",
+        "divide": "/",
+        "multiply": "*",
+        "mod": "%",
+        "equalTo": "=",
+        "lt": "<",
+        "leq": "<=",
+        "geq": ">=",
+        "gt": ">",
+        "eqNullSafe": "<=>",
+        "bitwiseOR": "|",
+        "bitwiseAND": "&",
+        "bitwiseXOR": "^",
+        # Just following JVM rule even if the names of source and target are the same.
+        "and": "and",
+        "or": "or",
+    }
 
     def _(
         self: "Column",
         other: Union["Column", "LiteralType", "DecimalLiteral", "DateTimeLiteral"],
     ) -> "Column":
+        logging_info = {}
+        if name in binary_operator_map:
+            stack = inspect.stack()
+            frame_info = stack[-1]

Review Comment:
   Sure, let me address it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1552967079


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,13 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:  # type: ignore[return]
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+
+    def pysparkCallSite(self) -> Optional[str]:  # type: ignore[return]

Review Comment:
   > By any chance could we comment what "fragment" and "callSite" are at PySparkCurrentOrigin?
   
   Sounds good. Let me add a comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47274][PYTHON][SQL][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1543112851


##########
python/pyspark/errors/utils.py:
##########
@@ -15,12 +15,22 @@
 # limitations under the License.
 #
 
+import builtins
 import re
-from typing import Dict, Match
+import functools
+import inspect
+import threading
+from typing import Any, Callable, Dict, Match, TypeVar, Type
+
+from IPython import get_ipython

Review Comment:
   Let me do the Notebook support as follow-up or after completing the initial prototype reviews.
   
   Because maybe I think we need separate discussion about making `IPython` as a project requirements or something?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1549517320


##########
python/pyspark/sql/column.py:
##########
@@ -195,6 +197,7 @@ def _(self: "Column", other: Union["LiteralType", "DecimalLiteral"]) -> "Column"
     return _
 
 
+@with_origin_to_class
 class Column:

Review Comment:
   I believe so.
   
   I know there are some additional handling for couple of `DataFrameStatFunctions` such as `approxQuantile`, `cov`, `corr`, `corsstab` etc., but they immediately raise error from PySpark unlike Column API, so I think it should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1549475904


##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +124,61 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def _capture_call_site(fragment: str) -> None:
+    """
+    Capture the call site information including file name, line number, and function name.
+
+    This function updates the thread-local storage from server side (PySparkCurrentOrigin)
+    with the current call site information when a PySpark API function is called.
+
+    Parameters
+    ----------
+    func_name : str
+        The name of the PySpark API function being captured.
+
+    Notes
+    -----
+    The call site information is used to enhance error messages with the exact location
+    in the user code that led to the error.
+    """
+    from pyspark.sql.session import SparkSession

Review Comment:
   That causes circular import error so I placed it here 😢 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1551358514


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,13 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:  # type: ignore[return]
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+
+    def pysparkCallSite(self) -> Optional[str]:  # type: ignore[return]

Review Comment:
   how are these two different? The given example only shows the call site
   ```
   == PySpark call site ==
   "divide" was called from
   /.../spark/python/test_pyspark_error.py:8
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1548983997


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -825,6 +828,172 @@ def test_duplicate_field_names(self):
         self.assertEqual(df.schema, schema)
         self.assertEqual(df.collect(), data)
 
+    def test_dataframe_error_context(self):
+        # SPARK-47274: Add more useful contexts for PySpark DataFrame API errors.
+        with self.sql_conf({"spark.sql.ansi.enabled": True}):
+            df = self.spark.range(10)
+
+            # DataFrameQueryContext with pysparkCallSite - divide
+            with self.assertRaises(ArithmeticException) as pe:
+                df.withColumn("div_zero", df.id / 0).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - plus
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("plus_invalid_type", df.id + "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - minus
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("minus_invalid_type", df.id - "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - multiply
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_invalid_type", df.id * "string").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`divide` is problematic)
+            with self.assertRaises(ArithmeticException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_zero", df.id / 0
+                ).withColumn("plus_ten", df.id + 10).withColumn("minus_ten", df.id - 10).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="divide",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`plus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_string", df.id + "string").withColumn(
+                    "minus_ten", df.id - 10
+                ).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="plus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`minus` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_ten", df.id * 10).withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_ten", df.id + 10).withColumn(
+                    "minus_string", df.id - "string"
+                ).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="minus",
+            )
+
+            # DataFrameQueryContext with pysparkCallSite - chained (`multiply` is problematic)
+            with self.assertRaises(NumberFormatException) as pe:
+                df.withColumn("multiply_string", df.id * "string").withColumn(
+                    "divide_ten", df.id / 10
+                ).withColumn("plus_ten", df.id + 10).withColumn("minus_ten", df.id - 10).collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="CAST_INVALID_INPUT",
+                message_parameters={
+                    "expression": "'string'",
+                    "sourceType": '"STRING"',
+                    "targetType": '"BIGINT"',
+                    "ansiConfig": '"spark.sql.ansi.enabled"',
+                },
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="multiply",
+            )
+
+            # DataFrameQueryContext without pysparkCallSite
+            with self.assertRaises(AnalysisException) as pe:
+                df.select("non-existing-column")
+            self.check_error(
+                exception=pe.exception,
+                error_class="UNRESOLVED_COLUMN.WITH_SUGGESTION",
+                message_parameters={"objectName": "`non-existing-column`", "proposal": "`id`"},
+                query_context_type=QueryContextType.DataFrame,
+                pyspark_fragment="",
+            )
+
+            # SQLQueryContext
+            with self.assertRaises(ArithmeticException) as pe:
+                self.spark.sql("select 10/0").collect()
+            self.check_error(
+                exception=pe.exception,
+                error_class="DIVIDE_BY_ZERO",
+                message_parameters={"config": '"spark.sql.ansi.enabled"'},
+                query_context_type=QueryContextType.SQL,
+            )
+
+            # No QueryContext
+            with self.assertRaises(AnalysisException) as pe:
+                self.spark.sql("select * from non-existing-table")
+            self.check_error(
+                exception=pe.exception,
+                error_class="INVALID_IDENTIFIER",
+                message_parameters={"ident": "non-existing-table"},
+                query_context_type=None,

Review Comment:
   FYI: `None` is default, so we don't need to specify like this when `QueryContext` not existing, but I made this test for explicit example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558761829


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   Can we add a method to avoid duplicated code of capturing java stacktraces?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1565491650


##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -171,6 +171,29 @@ class Column(val expr: Expression) extends Logging {
     Column.fn(name, this, lit(other))
   }
 
+  /**
+   * A version of the `fn` method specifically designed for binary operations in PySpark
+   * that require logging information.
+   * This method is used when the operation involves another Column.
+   *
+   * @param name                The name of the operation to be performed.
+   * @param other               The value to be used in the operation, which will be converted to a
+   *                            Column if not already one.
+   * @param pysparkFragment     A string representing the 'fragment' of the PySpark error context,
+   *                            typically indicates the name of PySpark function.
+   * @param pysparkCallSite     A string representing the 'callSite' of the PySpark error context,
+   *                            providing the exact location within the PySpark code where the
+   *                            operation originated.
+   * @return A Column resulting from the operation.
+   */
+  private def fn(

Review Comment:
   @HyukjinKwon This probably can't cover all the cases, and we may need to add more overloads for certain functions that require non-expression parameters, but it shouldn't be many.
   
   I think it's better than using ThreadLocal which can be quite fragile to pass values between Python and JVM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2056191793

   Let's clarify why https://github.com/apache/spark/pull/45377#issuecomment-2041315119 happens before we move further. That shouldn't happen from my understanding.
   
   If we go with the current approach, it would need more changes. e.g., `Column.substr` because it takes a different set of arguments and types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2058148979

   Because I called `PySparkCurrentOrigin` directly on the `DataFrameQueryContext` without utilizing `withOrigin` in the initial implementation. I realized it from recent review from the refactoring PR, so I'm currently trying to reintroduce `PySparkCurrentOrigin` there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1557589150


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[java.util.Map[String, String]]) extends QueryContext {

Review Comment:
   why is it a map instead of `(String, String)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1556624078


##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -699,6 +699,13 @@ class Column(val expr: Expression) extends Logging {
    */
   def plus(other: Any): Column = this + other
 
+  def plusWithPySparkLoggingInfo(
+      other: Any, loggingInfo: java.util.Map[String, String]): Column = {
+    withOrigin(Some(loggingInfo)) {
+      this + other
+    }
+  }

Review Comment:
   Sounds good, let me give it a shot. Thanks for sharing the insight!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-1977771911

   On second thought, we need to keep the previous transformation stacktrace to provide more accurate context.
   
   Will push more commit to update it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-1977737988

   cc @HyukjinKwon FYI, I'm still working on Spark Connect support and unit tests but the basic structure is ready for review.
   
   FYI, also cc @MaxGekk as you made a similar contribution on the JVM side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2041273571

   I'm afraid I still see a weird behavior:
   
   ```py
   >>> spark.conf.set("spark.sql.ansi.enabled", True)
   >>> df = spark.range(10)
   >>> a = df.id / 10
   >>> b = df.id / 0
   >>>
   >>> df.select(
   ...   a,
   ...   df.id + 4,
   ...   b,
   ...   df.id * 5
   ... ).show()
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
   == DataFrame ==
   "plus" was called from
   <stdin>:3
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2041313068

   Hmm... I faced some problem to resolve this case.
   
   PySpark provide logs to JVM at the time an expression is declared,
   but the actual execution order on the JVM side could be different from the declare order.
   
   For example, when running the example @ueshin provided:
   
   ```python
     1 spark.conf.set("spark.sql.ansi.enabled", True)
     2 df = spark.range(10)
     3 a = df.id / 10
     4 b = df.id / 0
     5 df.select(
     6   a,
     7   df.id + 4,
     8   b,
     9   df.id * 5
    10 ).show()
   ```
   
   Internally, the logging is processed as below:
   
   ```python
   # Logging call site from Python to JVM when define the expression in order:
   1: ("divide", "/test.py:3")
   2: ("divide", "/test.py:4")
   3: ("plus", "/test.py:7")
   4: ("multiply", "/test.py:9")
   
   # But analyzing the expression from JVM could have a different order from defining order from Python: 
   1: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   2: ArraySeq(org.apache.spark.sql.Column.plus(Column.scala:700), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   3: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   4: ArraySeq(org.apache.spark.sql.Column.multiply(Column.scala:760), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   ```
   
   To solve this problem,
   I think Python and JVM must share a "key" that can distinguish the unique value of each expression in `PySparkCurrentOrigin` and the JVM stackTrace at the time of declaring the expression.
   However, I can't think of a good way to make this possible at the moment.
   
   @ueshin @HyukjinKwon @cloud-fan could you please advise if there is happen to a good way to make this possible?
   
   Alternatively, the workaround that comes to my mind is to provide additional information to the log.
   If the log does not indicate the exact call site, it outputs candidates where an actual error may occur.
   For example:
   
   ```python
   >>> spark.conf.set("spark.sql.ansi.enabled", True)
   >>> df = spark.range(10)
   >>> a = df.id / 10
   >>> b = df.id / 0
   >>>
   >>> df.select(
   ...   a,
   ...   df.id + 4,
   ...   b,
   ...   df.id * 5
   ... ).show()
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
   == DataFrame ==
   "plus" was called from
   <stdin>:3
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2038202349

   When `df.select` with multiple expressions, both `fragment` and `callSite` seem to be different from Scala's.
   
   ```py
   >>> spark.conf.set("spark.sql.ansi.enabled", True)
   >>> df = spark.range(10)
   >>> df.select(
   ...   df.id / 10,
   ...   df.id + 4,
   ...   df.id / 0,
   ...   df.id * 5
   ... ).show()
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
   == DataFrame ==
   "divide" was called from
   java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   == PySpark call site ==
   "multiply" was called from
   <stdin>:5
   ```
   
   whereas:
   
   ```scala
   scala> df.select(
        |   df("id") / 10,
        |   df("id") + 4,
        |   df("id") / 0,
        |   df("id") * 5
        | ).show()
   org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
   == DataFrame ==
   "div" was called from
   <init>(<console>:4)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1548812966


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,17 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+        else:
+            return None

Review Comment:
   let's remove this.
   ```suggestion
   ```



##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -379,5 +379,17 @@ def fragment(self) -> str:
     def callSite(self) -> str:
         return str(self._q.callSite())
 
+    def pysparkFragment(self) -> Optional[str]:
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkFragment())
+        else:
+            return None
+
+    def pysparkCallSite(self) -> Optional[str]:
+        if self.contextType() == QueryContextType.DataFrame:
+            return str(self._q.pysparkCallSite())
+        else:
+            return None

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1557621216


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[java.util.Map[String, String]]) extends QueryContext {

Review Comment:
   Because we're currently using `dict` for PySpark logging info and it internally converted into `Map` from `Py4J`:
   
   ```python
               logging_info = {
                   "fragment": name,
                   "callSite": f"{frame_info.filename}:{frame_info.lineno}",
               }
   ```
   
   Maybe is there happen to any reason that we should use `Option[(String, String)]`??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558765519


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   > Can we add a method to avoid duplicated code of capturing java stacktraces?
   
   Sure, will address it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558755161


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -111,6 +111,26 @@ package object sql {
     }
   }
 
+  private[sql] def withOrigin[T](
+      pysparkLoggingInfo: Option[java.util.Map[String, String]] = None)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace

Review Comment:
   Can we keep the Java stacktrace as it is?
   
   In PySpark we also provide users an option `pysparkJVMStacktraceEnabled` to turn Java stacktrace on or off, since maybe someone still wants to see the Java stacktrace for deeper debugging?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1558765078


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -134,7 +134,9 @@ case class SQLQueryContext(
   override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
-case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext {
+case class DataFrameQueryContext(
+    stackTrace: Seq[StackTraceElement],
+    pysparkLoggingInfo: Option[java.util.Map[String, String]]) extends QueryContext {

Review Comment:
   it's more precise to use two string instead of a map.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #45377: [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors
URL: https://github.com/apache/spark/pull/45377


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2048784678

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47274][PYTHON][SQL][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1511983720


##########
python/pyspark/errors/utils.py:
##########
@@ -119,3 +127,73 @@ def get_message_template(self, error_class: str) -> str:
             message_template = main_message_template + " " + sub_message_template
 
         return message_template
+
+
+def is_builtin_exception(e: BaseException) -> bool:
+    """
+    Check if the given exception is a builtin exception or not
+    """
+    builtin_exceptions = [
+        exc
+        for name, exc in vars(builtins).items()
+        if isinstance(exc, type) and issubclass(exc, BaseException)
+    ]
+    return isinstance(e, tuple(builtin_exceptions))
+
+
+def add_error_context(func: Callable[..., Any]) -> Callable[..., Any]:
+    """
+    A decorator that captures PySpark exceptions occurring during the function execution,
+    and adds user code location information to the exception message.
+    """
+
+    @functools.wraps(func)
+    def wrapper(*args: Any, **kwargs: Any) -> Any:
+        try:
+            return func(*args, **kwargs)
+        except Exception as e:
+            from pyspark.errors import PySparkException
+            from pyspark.errors.exceptions.captured import CapturedException
+
+            inspect_stack = inspect.stack()
+            # Stack location is different when Python running on IPython (e.g. Jupyter Notebook)
+            user_code_space = inspect_stack[-1] if get_ipython() is None else inspect_stack[1]

Review Comment:
   This PR also covers the errors generated from Notebook. For example:
   
   <img width="1204" alt="Screenshot 2024-03-04 at 3 41 28 PM" src="https://github.com/apache/spark/assets/44108233/a713d8a9-097b-455f-b3b4-c5d18185b25f">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org