You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nattavut Sutyanyong (JIRA)" <ji...@apache.org> on 2016/09/01 16:00:22 UTC

[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation

    [ https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455851#comment-15455851 ] 

Nattavut Sutyanyong commented on SPARK-17348:
---------------------------------------------

Here is a use case:

"Find customers who bought items that are higher than the average price of the same items on the period of last two years from the date the items were bought."

That is replacing from my example above:
- IN with >
- the correlated predicate t1.c2 > t2.c2 with a BETWEEN..AND predicate for the 2-year period of the item bought
- T1 is ITEM table
- T2 is also the same ITEM table

and adding
- perhaps a CUSTOMER table to join with the ITEM on the parent side to pick up customer profiles
- and another correlated equality join on the ITEM_ID between the parent ITEM and the subquery ITEM

The bottom line is Spark 2.0 supports only a subset of SQL subqueries. A good subset of them are blocked. This is, I believe, because of the lack of runtime capability to process subqueries, specifically correlated subqueries. The current implementation aggressively (and, as in this case, incorrectly) converts correlated subqueries to some forms of joins. If we want to support a full class of SQL subqueries, we cannot assume that all forms of subqueries can be "de-correlated" and we need to have a fallback plan to execute subqueries in runtime.

This work is not a single PR work. It needs to craft out a design. Areas that we need to work are:

1. (Assume I am right) Extend nested-loop join to support correlation processing, both shallow and deep correlations.
2. Rework on the representation in the Logical Plan to cover all the supported cases.
3. Rework and extend the "de-correlation" rewrite code.

> Incorrect results from subquery transformation
> ----------------------------------------------
>
>                 Key: SPARK-17348
>                 URL: https://issues.apache.org/jira/browse/SPARK-17348
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>              Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when T1.C1 = 1 so both rows needs to be processed in the same group of the aggregation process in the subquery. The result of the aggregation yields MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org