You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nattavut Sutyanyong (JIRA)" <ji...@apache.org> on 2016/11/24 20:58:58 UTC

[jira] [Commented] (SPARK-18582) Whitelist LogicalPlan operators allowed in correlated subqueries

    [ https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15694202#comment-15694202 ] 

Nattavut Sutyanyong commented on SPARK-18582:
---------------------------------------------

[~hvanhovell] has broken down the whitelist further to:

# {{LeafNode}} s should not be a problem. We don't need to explicitly handle them
# We should allow the following {{UnaryNode}}: {{Project}}, {{Filter}}, {{Aggregate}}, {{SubqueryAlias}}, {{Distinct}}, {{Generate}} (only when {{join=true}}), {{BroadcastHint}}, {{Sort}}, {{Repartition}} & {{RedistributeData}} (parent of {{SortPartitions}} and {{RepartitionByExpression}}). We need to find out what other systems allow for {{Window}}.
# The only {{BinaryNode}} we should allow is {{Join}} with special cases for {{Left/Right/Full}}. We should also make sure that the {{LeftAnti}} and {{LeftSemi}} are handled properly.


> Whitelist LogicalPlan operators allowed in correlated subqueries
> ----------------------------------------------------------------
>
>                 Key: SPARK-18582
>                 URL: https://issues.apache.org/jira/browse/SPARK-18582
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>
> We want to tighten the code that handles correlated subquery to whitelist operators that are allowed in it.
> The current code in {{def pullOutCorrelatedPredicates}} looks like
> {code}
>       // Simplify the predicates before pulling them out.
>       val transformed = BooleanSimplification(sub) transformUp {
>         case f @ Filter(cond, child) => ...
>         case p @ Project(expressions, child) => ...
>         case a @ Aggregate(grouping, expressions, child) => ...
>         case w : Window => ...
>         case j @ Join(left, _, RightOuter, _) => ...
>         case j @ Join(left, right, FullOuter, _) => ...
>         case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ...
>         case u: Union => ...
>         case s: SetOperation => ...
>         case e: Expand => ...
>         case l : LocalLimit => ...
>         case g : GlobalLimit => ...
>         case s : Sample => ...
>         case p =>
>           failOnOuterReference(p)
>           ...
>       }
> {code}
> The code disallows operators in a sub plan of an operator hosting correlation on a case by case basis. As it is today, it only blocks {{Union}}, {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}} {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the list above are permitted to be under a correlation point. Is this risky? There are many (30+ at least from browsing the {{LogicalPlan}} type hierarchy) operators derived from {{LogicalPlan}} class.
> For the case of {{ScalarSubquery}}, it explicitly checks that only {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed ({{CheckAnalysis.scala}} around line 126-165 in and after {{def cleanQuery}}). We should whitelist which operators are allowed in correlated subqueries. At my first glance, we should allow, in addition to the ones allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org