You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/22 12:48:39 UTC

[GitHub] [spark] gengliangwang opened a new pull request #35936: [SPARK-38616][SQL] Keep track of SQL query text in Catalyst TreeNode

gengliangwang opened a new pull request #35936:
URL: https://github.com/apache/spark/pull/35936

### What changes were proposed in this pull request?

Spark SQL uses the class Origin for tracking the position of each TreeNode in the SQL query text. When there is a parser error, we can show the position info in the error message:
```
> sql("create tabe foo(i int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'create tabe'(line 1, pos 7)

== SQL ==
create tabe foo(i int)
-------^^^
```
It contains two fields: line and startPosition. This is enough for the parser since the SQL query text is known.

However, the SQL query text is unknown in the execution phase. Spark SQL can't show the problematic SQL clause on ANSI runtime failures.
This PR is to include the query text in Origin. After this, we can provide details in the error messages of Expressions which can throw runtime exceptions when ANSI mode is on.

### Why are the changes needed?

Currently, there is not enough error context for runtime ANSI failures.

In the following example, the error message only tells that there is a "divide by zero" error, without pointing out where the exact SQL statement is.
```
> SELECT
ss1.ca_county,
ss1.d_year,
ws2.web_sales / ws1.web_sales web_q1_q2_increase,
ss2.store_sales / ss1.store_sales store_q1_q2_increase,
ws3.web_sales / ws2.web_sales web_q2_q3_increase,
ss3.store_sales / ss2.store_sales store_q2_q3_increase
FROM
ss ss1, ss ss2, ss ss3, ws ws1, ws ws2, ws ws3
WHERE
ss1.d_qoy = 1
AND ss1.d_year = 2000
AND ss1.ca_county = ss2.ca_county
AND ss2.d_qoy = 2
AND ss2.d_year = 2000
AND ss2.ca_county = ss3.ca_county
AND ss3.d_qoy = 3
AND ss3.d_year = 2000
AND ss1.ca_county = ws1.ca_county
AND ws1.d_qoy = 1
AND ws1.d_year = 2000
AND ws1.ca_county = ws2.ca_county
AND ws2.d_qoy = 2
AND ws2.d_year = 2000
AND ws1.ca_county = ws3.ca_county
AND ws3.d_qoy = 3
AND ws3.d_year = 2000
AND CASE WHEN ws1.web_sales > 0
THEN ws2.web_sales / ws1.web_sales
ELSE NULL END
> CASE WHEN ss1.store_sales > 0
THEN ss2.store_sales / ss1.store_sales
ELSE NULL END
AND CASE WHEN ws2.web_sales > 0
THEN ws3.web_sales / ws2.web_sales
ELSE NULL END
> CASE WHEN ss2.store_sales > 0
THEN ss3.store_sales / ss2.store_sales
ELSE NULL END
ORDER BY ss1.ca_county
```

```
org.apache.spark.SparkArithmeticException: divide by zero at
org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:140) at
org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:437) at
org.apache.spark.sql.catalyst.expressions.DivModLike.eval$(arithmetic.scala:425) at
org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:534)
```
This PR is the initial PR for the project https://issues.apache.org/jira/browse/SPARK-38615
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org