You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "MaxGekk (via GitHub)" <gi...@apache.org> on 2023/10/11 19:24:36 UTC

[PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

MaxGekk opened a new pull request, #43334:
URL: https://github.com/apache/spark/pull/43334

   ### What changes were proposed in this pull request?
   This PR captures the dataset APIs used by the user code and the call site in the user code and provides better error messages.
   
   E.g. consider the following Spark app `SimpleApp.scala`:
   ```scala
      1  import org.apache.spark.sql.SparkSession
      2  import org.apache.spark.sql.functions._
      3
      4  object SimpleApp {
      5    def main(args: Array[String]) {
      6      val spark = SparkSession.builder.appName("Simple Application").config("spark.sql.ansi.enabled", true).getOrCreate()
      7      import spark.implicits._
      8
      9      val c = col("a") / col("b")
     10
     11      Seq((1, 0)).toDF("a", "b").select(c).show()
     12
     13      spark.stop()
     14    }
     15  }
   ``` 
   
   After this PR the error message contains the error context (which Spark Dataset API is called from where in the user code) in the following form:
   ```
   Exception in thread "main" org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
   == Dataset ==
   "div" was called from SimpleApp$.main(SimpleApp.scala:9)
   
   	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
   	at org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
   ...
   ```
   which is similar to the already provided context in case of SQL queries:
   ```
   org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
   == SQL(line 1, position 1) ==
   a / b
   ^^^^^
   
   	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
   	at org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
   ...
   ```
   
   Please note that stack trace in `spark-shell` doesn't contain meaningful elements:
   ```
   scala> Thread.currentThread().getStackTrace.foreach(println)
   java.base/java.lang.Thread.getStackTrace(Thread.java:1602)
   $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:23)
   $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:27)
   $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:29)
   $line15.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:31)
   $line15.$read$$iw$$iw$$iw$$iw.<init>(<console>:33)
   $line15.$read$$iw$$iw$$iw.<init>(<console>:35)
   $line15.$read$$iw$$iw.<init>(<console>:37)
   $line15.$read$$iw.<init>(<console>:39)
   $line15.$read.<init>(<console>:41)
   $line15.$read$.<init>(<console>:45)
   $line15.$read$.<clinit>(<console>)
   $line15.$eval$.$print$lzycompute(<console>:7)
   $line15.$eval$.$print(<console>:6)
   $line15.$eval.$print(<console>)
   java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   ...
   ```
   so this change doesn't help with that usecase.
   
   ### Why are the changes needed?
   To provide more user friendly errors.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   ### How was this patch tested?
   Added new UTs to `QueryExecutionAnsiErrorsSuite`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "heyihong (via GitHub)" <gi...@apache.org>.

heyihong commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1374510895


##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -841,6 +848,12 @@ message FetchErrorDetailsResponse {
 
     // The corresponding fragment of the query which throws the exception.
     string fragment = 5;
+
+    // The user code (call site of the API) that caused throwing the exception.
+    optional string callSite = 6;
+
+    // Summary of the exception cause.
+    optional string summary = 7;

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #43334:
URL: https://github.com/apache/spark/pull/43334#issuecomment-1788538998

   Merging to master. Thank you, @peter-toth for the original PR, @cloud-fan @heyihong for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1365516375


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,43 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param framesToDrop the number of stack frames we can surely drop before searching for the user
+   *                     code
+   * @param f the function that can use the origin
+   * @return the result of `f`
+   */
+  private[sql] def withOrigin[T](framesToDrop: Int = 0)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = framesToDrop + 3
+      while (sparkCode(st(i))) i += 1

Review Comment:
   since we have this loop here, why do we still need `framesToDrop`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "ryan-johnson-databricks (via GitHub)" <gi...@apache.org>.

ryan-johnson-databricks commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1447667682


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,41 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param f The function that can use the origin.
+   * @return The result of `f`.
+   */
+  private[sql] def withOrigin[T](f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = 3
+      while (i < st.length && sparkCode(st(i))) i += 1
+      val origin =
+        Origin(stackTrace = Some(Thread.currentThread().getStackTrace.slice(i - 1, i + 1)))

Review Comment:
   Isn't this super expensive, calling `currentThread().getStackTrace` in a loop?? Can't we grab the stacktrace only once, and filter it as needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1372154497


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,43 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param framesToDrop the number of stack frames we can surely drop before searching for the user
+   *                     code
+   * @param f the function that can use the origin
+   * @return the result of `f`
+   */
+  private[sql] def withOrigin[T](framesToDrop: Int = 0)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = framesToDrop + 3
+      while (sparkCode(st(i))) i += 1

Review Comment:
   We set `framesToDrop = 1` in a few places:
   - `Column.fn`
   - `withExpr`
   - `repartitionByExpression`
   - `repartitionByRange`
   - `withAggregateFunction`
   - `createLambda`
   
   So, there are 2 options either
   - the function `sparkCode` doesn't work properly, and we skip 1 frame forcibly
   - or a premature optimization.
   
   I will check that after all tests passed eventually.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1374904453


##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -508,9 +508,11 @@ class Dataset[T] private[sql](
    * @group basic
    * @since 3.4.0
    */
-  def to(schema: StructType): DataFrame = withPlan {
-    val replaced = CharVarcharUtils.failIfHasCharVarchar(schema).asInstanceOf[StructType]
-    Project.matchSchema(logicalPlan, replaced, sparkSession.sessionState.conf)
+  def to(schema: StructType): DataFrame = withOrigin() {

Review Comment:
   When I revert changes in method like this `DataFrame -> DataFrame`, the fragment becomes blurry, like:
   Before: `select`
   After: `anonfun$select$4`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1374906918


##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -841,6 +848,12 @@ message FetchErrorDetailsResponse {
 
     // The corresponding fragment of the query which throws the exception.
     string fragment = 5;
+
+    // The user code (call site of the API) that caused throwing the exception.
+    optional string callSite = 6;

Review Comment:
   The actual differentiator is the new field `contextType`, so, we can put a default value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1365499326


##########
common/utils/src/main/java/org/apache/spark/QueryContext.java:
##########
@@ -45,4 +48,13 @@ public interface QueryContext {
 
     // The corresponding fragment of the query which throws the exception.
     String fragment();
+
+    // The Spark code (API) that caused throwing the exception.
+    String code();

Review Comment:
   shall we reuse the `fragment` function to return the code fragment? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377167899


##########
common/utils/src/main/java/org/apache/spark/QueryContextType.java:
##########
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * The type of {@link QueryContext}.
+ *
+ * @since 3.5.0
+ */
+@Evolving
+public enum QueryContextType {
+    SQL,
+    Dataset

Review Comment:
   There is no Dataset in PySpark, shall we use the name `DataFrame`? It also exists in Scala as a type alias of `Dataset[Row]`. And `DataFrame` is a more common name in the industry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377416076


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -128,6 +129,45 @@ case class SQLQueryContext(
     sqlText.isDefined && originStartIndex.isDefined && originStopIndex.isDefined &&
       originStartIndex.get >= 0 && originStopIndex.get < sqlText.get.length &&
       originStartIndex.get <= originStopIndex.get
+  }
+
+  override def callSite: String = throw new UnsupportedOperationException
+}
+
+case class DatasetQueryContext(
+    override val fragment: String,
+    override val callSite: String) extends QueryContext {
+  override val contextType = QueryContextType.Dataset
+
+  override def objectType: String = throw new UnsupportedOperationException
+  override def objectName: String = throw new UnsupportedOperationException
+  override def startIndex: Int = throw new UnsupportedOperationException
+  override def stopIndex: Int = throw new UnsupportedOperationException
+
+  override lazy val summary: String = {
+    val builder = new StringBuilder
+    builder ++= "== Dataset ==\n"
+    builder ++= "\""
+
+    builder ++= fragment
+    builder ++= "\""
+    builder ++= " was called from "

Review Comment:
   Not sure about this. Now it looks:
   ```
   == Dataset ==
   "col" was called from org.apache.spark.sql.DatasetSuite.$anonfun$new$621(DatasetSuite.scala:2673)
   
   ```
   but what you propose:
   ```
   == Dataset ==
   "col" was called from
   org.apache.spark.sql.DatasetSuite.$anonfun$new$621(DatasetSuite.scala:2673)
   
   ```
   @cloud-fan Are you sure?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #43334:
URL: https://github.com/apache/spark/pull/43334#issuecomment-2029250313

   > > If not, we probably should disable this feature for spark connect for now since customers may get confused if they see contexts for dataset API errors (likely in the spark connect planner) in the error message.
   > 
   > Don't think we should disable it even if it doesn't work. We have enough time to implement it before Spark 4.0.
   
   Do we happen to have any specific plan or timeline for supporting this features for Spark Connect? Seems like it is not working both on Spark Connect Scala and Python client for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1373459674


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,43 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param framesToDrop the number of stack frames we can surely drop before searching for the user
+   *                     code
+   * @param f the function that can use the origin
+   * @return the result of `f`
+   */
+  private[sql] def withOrigin[T](framesToDrop: Int = 0)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = framesToDrop + 3
+      while (sparkCode(st(i))) i += 1

Review Comment:
   I wated to use it only for optimization, at certrain places we simply know that at least how many frames deep we are in Spark's code. `sparkCode()` uses regex so it can be a bit slow... 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1373513104


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,43 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param framesToDrop the number of stack frames we can surely drop before searching for the user
+   *                     code
+   * @param f the function that can use the origin
+   * @return the result of `f`
+   */
+  private[sql] def withOrigin[T](framesToDrop: Int = 0)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = framesToDrop + 3
+      while (sparkCode(st(i))) i += 1

Review Comment:
   > sparkCode() uses regex so it can be a bit slow.
   
   Shouldn't be so slow, I think. Especially, just one pattern match. I'll remove the optimization so far.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377667589


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -128,6 +129,45 @@ case class SQLQueryContext(
     sqlText.isDefined && originStartIndex.isDefined && originStopIndex.isDefined &&
       originStartIndex.get >= 0 && originStopIndex.get < sqlText.get.length &&
       originStartIndex.get <= originStopIndex.get
+  }
+
+  override def callSite: String = throw new UnsupportedOperationException
+}
+
+case class DatasetQueryContext(
+    override val fragment: String,
+    override val callSite: String) extends QueryContext {
+  override val contextType = QueryContextType.Dataset
+
+  override def objectType: String = throw new UnsupportedOperationException
+  override def objectName: String = throw new UnsupportedOperationException
+  override def startIndex: Int = throw new UnsupportedOperationException
+  override def stopIndex: Int = throw new UnsupportedOperationException
+
+  override lazy val summary: String = {
+    val builder = new StringBuilder
+    builder ++= "== Dataset ==\n"
+    builder ++= "\""
+
+    builder ++= fragment
+    builder ++= "\""
+    builder ++= " was called from "

Review Comment:
   OK let's leave it as it is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377166456


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -28,19 +28,20 @@ case class SQLQueryContext(
     sqlText: Option[String],
     originObjectType: Option[String],
     originObjectName: Option[String]) extends QueryContext {
+  override val contextType = QueryContextType.SQL
 
-  override val objectType = originObjectType.getOrElse("")
-  override val objectName = originObjectName.getOrElse("")
-  override val startIndex = originStartIndex.getOrElse(-1)
-  override val stopIndex = originStopIndex.getOrElse(-1)

Review Comment:
   why remove `override`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "ryan-johnson-databricks (via GitHub)" <gi...@apache.org>.

ryan-johnson-databricks commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1447541033


##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -1572,7 +1589,9 @@ class Dataset[T] private[sql](
    * @since 2.0.0
    */
   @scala.annotation.varargs
-  def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
+  def select(col: String, cols: String*): DataFrame = withOrigin {

Review Comment:
   I don't think this is helpful -- the underlying `select` already has a `withOrigin` call, no?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "heyihong (via GitHub)" <gi...@apache.org>.

heyihong commented on PR #43334:
URL: https://github.com/apache/spark/pull/43334#issuecomment-1785254850

   One more question: does this feature work with spark connect scala client? If not, we probably should disable this feature for spark connect for now since customers may get confused if they see contexts for dataset API errors in the error message. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377169413


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -128,6 +129,45 @@ case class SQLQueryContext(
     sqlText.isDefined && originStartIndex.isDefined && originStopIndex.isDefined &&
       originStartIndex.get >= 0 && originStopIndex.get < sqlText.get.length &&
       originStartIndex.get <= originStopIndex.get
+  }
+
+  override def callSite: String = throw new UnsupportedOperationException
+}
+
+case class DatasetQueryContext(
+    override val fragment: String,
+    override val callSite: String) extends QueryContext {
+  override val contextType = QueryContextType.Dataset
+
+  override def objectType: String = throw new UnsupportedOperationException
+  override def objectName: String = throw new UnsupportedOperationException
+  override def startIndex: Int = throw new UnsupportedOperationException
+  override def stopIndex: Int = throw new UnsupportedOperationException
+
+  override lazy val summary: String = {
+    val builder = new StringBuilder
+    builder ++= "== Dataset ==\n"
+    builder ++= "\""
+
+    builder ++= fragment
+    builder ++= "\""
+    builder ++= " was called from "
+    builder ++= callSite
+    builder += '\n'
+    builder.result()
+  }
+}
+
+object DatasetQueryContext {
+  def apply(elements: Array[StackTraceElement]): DatasetQueryContext = {
+    val methodName = elements(0).getMethodName
+    val code = if (methodName.length > 1 && methodName(0) == '$') {
+      methodName.substring(1)
+    } else {
+      methodName
+    }
+    val callSite = elements(1).toString

Review Comment:
   shall we explicitly add an `assert` in this method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1375777248


##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -841,6 +848,12 @@ message FetchErrorDetailsResponse {
 
     // The corresponding fragment of the query which throws the exception.
     string fragment = 5;
+
+    // The user code (call site of the API) that caused throwing the exception.
+    optional string callSite = 6;

Review Comment:
   Removed `optional`



##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -841,6 +848,12 @@ message FetchErrorDetailsResponse {
 
     // The corresponding fragment of the query which throws the exception.
     string fragment = 5;
+
+    // The user code (call site of the API) that caused throwing the exception.
+    optional string callSite = 6;
+
+    // Summary of the exception cause.
+    optional string summary = 7;

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1379169569


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,41 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param f The function that can use the origin.
+   * @return The result of `f`.
+   */
+  private[sql] def withOrigin[T](f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = 3

Review Comment:
   Regarding to the magic number, we always have 3 those elements at the beginning of the input array:
   <img width="848" alt="Screenshot 2023-11-01 at 21 36 41" src="https://github.com/apache/spark/assets/1580697/bff163eb-8cba-40bd-bdf6-8f7094abe80d">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "heyihong (via GitHub)" <gi...@apache.org>.

heyihong commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1374509141


##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -841,6 +848,12 @@ message FetchErrorDetailsResponse {
 
     // The corresponding fragment of the query which throws the exception.
     string fragment = 5;
+
+    // The user code (call site of the API) that caused throwing the exception.
+    optional string callSite = 6;

Review Comment:
   Do we need to differentiate null or "" for callSite? If not,
   ```
   string callSite = 6;
   ```
   should be sufficient



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1372162159


##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -508,9 +508,11 @@ class Dataset[T] private[sql](
    * @group basic
    * @since 3.4.0
    */
-  def to(schema: StructType): DataFrame = withPlan {
-    val replaced = CharVarcharUtils.failIfHasCharVarchar(schema).asInstanceOf[StructType]
-    Project.matchSchema(logicalPlan, replaced, sparkSession.sessionState.conf)
+  def to(schema: StructType): DataFrame = withOrigin() {

Review Comment:
   Could you elaborate a little bit more, why ansi mode is important here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377216909


##########
common/utils/src/main/java/org/apache/spark/QueryContextType.java:
##########
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * The type of {@link QueryContext}.
+ *
+ * @since 3.5.0

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377214425


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -28,19 +28,20 @@ case class SQLQueryContext(
     sqlText: Option[String],
     originObjectType: Option[String],
     originObjectName: Option[String]) extends QueryContext {
+  override val contextType = QueryContextType.SQL
 
-  override val objectType = originObjectType.getOrElse("")
-  override val objectName = originObjectName.getOrElse("")
-  override val startIndex = originStartIndex.getOrElse(-1)
-  override val stopIndex = originStopIndex.getOrElse(-1)

Review Comment:
   Will revert it back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377164175


##########
common/utils/src/main/java/org/apache/spark/QueryContextType.java:
##########
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * The type of {@link QueryContext}.
+ *
+ * @since 3.5.0

Review Comment:
   4.0.0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #43334:
URL: https://github.com/apache/spark/pull/43334#issuecomment-1785771481

   >  If not, we probably should disable this feature for spark connect for now since customers may get confused if they see contexts for dataset API errors (likely in the spark connect planner) in the error message.
   
   Don't think we should disable it even if it doesn't work. We have enough time to implement it before Spark 4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1372046107


##########
common/utils/src/main/java/org/apache/spark/QueryContext.java:
##########
@@ -45,4 +48,13 @@ public interface QueryContext {
 
     // The corresponding fragment of the query which throws the exception.
     String fragment();
+
+    // The Spark code (API) that caused throwing the exception.
+    String code();

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377180284


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,41 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param f The function that can use the origin.
+   * @return The result of `f`.
+   */
+  private[sql] def withOrigin[T](f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = 3

Review Comment:
   can we add a comment to explain this magic number?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377414835


##########
common/utils/src/main/java/org/apache/spark/QueryContextType.java:
##########
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * The type of {@link QueryContext}.
+ *
+ * @since 3.5.0
+ */
+@Evolving
+public enum QueryContextType {
+    SQL,
+    Dataset

Review Comment:
   ok, I will rename it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1447601095


##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -1572,7 +1589,9 @@ class Dataset[T] private[sql](
    * @since 2.0.0
    */
   @scala.annotation.varargs
-  def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
+  def select(col: String, cols: String*): DataFrame = withOrigin {

Review Comment:
   We are reverting it in https://github.com/apache/spark/pull/44501



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1447671645


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,41 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param f The function that can use the origin.
+   * @return The result of `f`.
+   */
+  private[sql] def withOrigin[T](f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = 3
+      while (i < st.length && sparkCode(st(i))) i += 1
+      val origin =
+        Origin(stackTrace = Some(Thread.currentThread().getStackTrace.slice(i - 1, i + 1)))

Review Comment:
   which loop do you mean?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43334:
URL: https://github.com/apache/spark/pull/43334#issuecomment-2029265533

   I don't think so... We are still waiting for people who are familiar with Spark Connect to pick it up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1365512435


##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -508,9 +508,11 @@ class Dataset[T] private[sql](
    * @group basic
    * @since 3.4.0
    */
-  def to(schema: StructType): DataFrame = withPlan {
-    val replaced = CharVarcharUtils.failIfHasCharVarchar(schema).asInstanceOf[StructType]
-    Project.matchSchema(logicalPlan, replaced, sparkSession.sessionState.conf)
+  def to(schema: StructType): DataFrame = withOrigin() {

Review Comment:
   I think it's good enough to attach expression call site for ansi mode, we can attach plan call site later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1373459674


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,43 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param framesToDrop the number of stack frames we can surely drop before searching for the user
+   *                     code
+   * @param f the function that can use the origin
+   * @return the result of `f`
+   */
+  private[sql] def withOrigin[T](framesToDrop: Int = 0)(f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = framesToDrop + 3
+      while (sparkCode(st(i))) i += 1

Review Comment:
   I wated to use it only for optimization, at certrain places we simply know that at least how many frames deep we are in Spark's code. `sparkCode()` uses regex so can be a bit slow... 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1373515871


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -128,6 +129,45 @@ case class SQLQueryContext(
     sqlText.isDefined && originStartIndex.isDefined && originStopIndex.isDefined &&
       originStartIndex.get >= 0 && originStopIndex.get < sqlText.get.length &&
       originStartIndex.get <= originStopIndex.get
+  }
+
+  override def callSite: String = throw new UnsupportedOperationException
+}
+
+case class DatasetQueryContext(
+    override val fragment: String,
+    override val callSite: String) extends QueryContext {
+  override val contextType = QueryContextType.Dataset
+
+  override def objectType: String = throw new UnsupportedOperationException
+  override def objectName: String = throw new UnsupportedOperationException
+  override def startIndex: Int = throw new UnsupportedOperationException
+  override def stopIndex: Int = throw new UnsupportedOperationException
+
+  override lazy val summary: String = {
+    val builder = new StringBuilder
+    builder ++= "== Dataset ==\n"
+    builder ++= "\""
+
+    builder ++= fragment
+    builder ++= "\""
+    builder ++= " was called from "
+    builder ++= callSite
+    builder += '\n'
+    builder.result()
+  }
+}
+
+object DatasetQueryContext {
+  def apply(elements: Array[StackTraceElement]): DatasetQueryContext = {
+    val methodName = elements(0).getMethodName
+    val code = if (methodName.length > 1 && methodName(0) == '$') {
+      methodName.substring(1)
+    } else {
+      methodName
+    }
+    val callSite = elements(1).toString

Review Comment:
   We assume that `elements` contains at least two elements. The assumption comes from the code:
   ```scala
     private[sql] def withOrigin[T](f: => T): T = {
   ...
         var i = 3
         while (i < st.length && sparkCode(st(i))) i += 1
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk closed pull request #43334: [SPARK-45022][SQL] Provide context for dataset API errors
URL: https://github.com/apache/spark/pull/43334


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1379176991


##########
sql/core/src/main/scala/org/apache/spark/sql/package.scala:
##########
@@ -73,4 +76,41 @@ package object sql {
    * with rebasing.
    */
   private[sql] val SPARK_LEGACY_INT96_METADATA_KEY = "org.apache.spark.legacyINT96"
+
+  /**
+   * This helper function captures the Spark API and its call site in the user code from the current
+   * stacktrace.
+   *
+   * As adding `withOrigin` explicitly to all Spark API definition would be a huge change,
+   * `withOrigin` is used only at certain places where all API implementation surely pass through
+   * and the current stacktrace is filtered to the point where first Spark API code is invoked from
+   * the user code.
+   *
+   * As there might be multiple nested `withOrigin` calls (e.g. any Spark API implementations can
+   * invoke other APIs) only the first `withOrigin` is captured because that is closer to the user
+   * code.
+   *
+   * @param f The function that can use the origin.
+   * @return The result of `f`.
+   */
+  private[sql] def withOrigin[T](f: => T): T = {
+    if (CurrentOrigin.get.stackTrace.isDefined) {
+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = 3

Review Comment:
   This has been discussed at https://github.com/apache/spark/pull/42740#discussion_r1337356903.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1379164499


##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -508,9 +508,11 @@ class Dataset[T] private[sql](
    * @group basic
    * @since 3.4.0
    */
-  def to(schema: StructType): DataFrame = withPlan {
-    val replaced = CharVarcharUtils.failIfHasCharVarchar(schema).asInstanceOf[StructType]
-    Project.matchSchema(logicalPlan, replaced, sparkSession.sessionState.conf)
+  def to(schema: StructType): DataFrame = withOrigin() {

Review Comment:
   @cloud-fan For example, **with** `withOrigin` in `select`, we stop at index 3:
   <img width="853" alt="Screenshot 2023-11-01 at 21 25 06" src="https://github.com/apache/spark/assets/1580697/7080a1bb-7128-4c23-baaf-64aab6f7cd0a">
   
   Without `withOrigin` in `select`, the picture is different. We are in `withOrigin` in `Column`'s constructor, and stoped at the index 5:
   <img width="862" alt="Screenshot 2023-11-01 at 21 29 18" src="https://github.com/apache/spark/assets/1580697/c6747280-fc30-4500-a26c-e0a063544966">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45022][SQL] Provide context for dataset API errors [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43334:
URL: https://github.com/apache/spark/pull/43334#discussion_r1377168785


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##########
@@ -128,6 +129,45 @@ case class SQLQueryContext(
     sqlText.isDefined && originStartIndex.isDefined && originStopIndex.isDefined &&
       originStartIndex.get >= 0 && originStopIndex.get < sqlText.get.length &&
       originStartIndex.get <= originStopIndex.get
+  }
+
+  override def callSite: String = throw new UnsupportedOperationException
+}
+
+case class DatasetQueryContext(
+    override val fragment: String,
+    override val callSite: String) extends QueryContext {
+  override val contextType = QueryContextType.Dataset
+
+  override def objectType: String = throw new UnsupportedOperationException
+  override def objectName: String = throw new UnsupportedOperationException
+  override def startIndex: Int = throw new UnsupportedOperationException
+  override def stopIndex: Int = throw new UnsupportedOperationException
+
+  override lazy val summary: String = {
+    val builder = new StringBuilder
+    builder ++= "== Dataset ==\n"
+    builder ++= "\""
+
+    builder ++= fragment
+    builder ++= "\""
+    builder ++= " was called from "

Review Comment:
   shall we add a `\n` before the call site?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org