You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/23 08:01:27 UTC

[GitHub] [spark] francis0407 edited a comment on issue #24344: [SPARK-27440][SQL] Optimize uncorrelated predicate subquery

francis0407 edited a comment on issue #24344: [SPARK-27440][SQL] Optimize uncorrelated predicate subquery
URL: https://github.com/apache/spark/pull/24344#issuecomment-495108399

Thanks @dilipbiswal @cloud-fan .
I'm ok for trying these, just wanna make contributions to the project. But I think we can do some deep discussion here, to figure out which method could be better.
First, I think I'd better conclude what we have discussed in this PR.

* At the beginning, I tried to transform all the predicate subquery to EXISTS. But in https://github.com/apache/spark/pull/24344#issuecomment-483642974, we found a bug in the current implementation of InSubquery (we're not correctly dealing with InSubquery for nulls), and opened a new issue [SPARK-27572](https://issues.apache.org/jira/browse/SPARK-27572?filter=-2). In short, not all of the InSubquery can be converted to semi/anti join or Exists (see the example in https://github.com/apache/spark/pull/24344#issuecomment-483642974). After realized this, I gave up converting InSubquery to Exists, but tried adding physical plan for them.

* Another discussion is about the optimization of non-correlated subquery. I tried optimizing EXISTS using `project 1, limit 1` to reduce the result set, and optimizing InSubquery using `push down the left value, project the equation and use 'distinct'`. With these optimization, the size of the result set can only be **1 or 2(null or true)** , and all of the calculation is done in the executor side. But after @cloud-fan 's reminding, I realize that this can be made more generally for semi/anti join.

Now we discuss about the optimization for non-correlated semi/anti join.

If I'm not mistaken, I think @cloud-fan said 'turn this join to a filter' means we can use a physical plan for the non-correlated semi-join (actually, it's the same with EXISTS). That is a great idea, much better than the idea using in this PR! It's more general, and extensible. It can still be available when the NULL BUG is fixed.

I suggest we might close this PR and the issue, and open another one for the optimization of non-correlated semi/anti join (emmm... not sure about the name). What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org