You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Asif (Jira)" <ji...@apache.org> on 2020/10/14 21:17:00 UTC
[jira] [Created] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

Asif created SPARK-33152:
----------------------------

             Summary: Constraint Propagation code causes OOM issues or increasing compilation time to hours
                 Key: SPARK-33152
                 URL: https://issues.apache.org/jira/browse/SPARK-33152
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Asif


We encountered this issue at Workday. 

The issue is that current Constraints Propagation code pessimistically generates all the possible permutations of base constraint for the aliases in the project node.

This causes blow up of the number of constraints generated causing OOM issues at compile time of sql query, or queries taking 18 min to 2 hrs to compile.

The problematic piece of code is in LogicalPlan.getAliasedConstraints

projectList.foreach {
 case a @ Alias(l: Literal, _) =>
 allConstraints += EqualNullSafe(a.toAttribute, l)
 case a @ Alias(e, _) =>
 // For every alias in `projectList`,replace the reference in
 // constraints by its attribute.
 allConstraints ++= allConstraints.map(_ transform {
 case expr: Expression if expr.semanticEquals(e) =>
 a.toAttribute
 })
 allConstraints += EqualNullSafe(e, a.toAttribute)
 case _ => // Don't change.
 }

so consider a hypothetical plan

 

Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as c2 , c as c3)

   |

Filter f(a, b, c)

|

Base Relation (a, b, c)

and so we have projection as

a, a1, a2, a3

b, b1, b2

c, c1, c2, c3

Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 times, and C occurring 3 times.

So at project node the number of constraints for a single base constraint f(a, b, c) will be

4C1 * 3C2 * 4C3 = 48

In our case, we have seen number of constraints going up to > 30000 or more, as there are complex case statements in the projection.

Spark generates all these constraints pessimistically for pruning filters or push down predicates for join , it may encounter when the optimizer traverses up the tree.

 

This is issue is solved at our end by modifying the spark code to use a different logic.

The idea is simple. 

Instead of generating pessimistically all possible combinations of base constraint, just store the original base constraints & track the aliases at each level.

The principal followed is this:

1) Store the base constraint and keep the track of the aliases for the underlying attribute.

2) If the base attribute composing the constraint is not in the output set, see if the constraint survives by substituting the attribute getting removed with the next available alias's attribute.

 

For checking if a filter can be pruned , just canonicalize the filter with the attribute at 0th position of the tracking list & compare with the underlying base constraint.

To elaborate using  the plan above.

At project node

We have constraint f(a,b,c)

we keep track of alias

List 1  : a, a1.attribute, a2.attribute, a3.attribute

List2 :  b, b1.attribute, b2.attribute 

List3: c, c1.attribute, c2.attribute, c3.attribute

Lets say above the project node, we encounter a filter

f(a1, b2, c3)

So canonicalize the filter by using the above list data, to convert it to 

f(a,b c) & compare it with the stored base constraints.

 

For predicate push down , instead of generating all the redundant combinations of constraints , just generate one constraint per element of the alias.

In the current spark code , in any case, filter push down happens only for 1 variable at a time.

So just expanding the filter (a,b,c) to

f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 

would suffice, rather than generating all the redundant combinations.

In fact the code can be easily modified to generate only those constraints which involve variables forming the join condition. so the number of  constraints generated on expand are further reduced.

We already have code to generate compound filters for push down ( join on multiple conditions), which can be used for single variable condition, push down too.

Just to elaborate the logic further, if we consider the above hypothetical plan (assume collapse project rule is not there)

 

Project (a1, a1. as a4, b,  c1, c1 as c4)

  |

Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as c2 , c as c3)

   |

Filter f(a, b, c)

|

Base Relation (a, b, c)

 

So at project node2, the constraints data will become

we keep track of alias

List 1  : a1, a4.attribute

List2 :  b

List3: c1, c4.attribute

And the constraint f(a, b, c) is modified to f(a1, b, c1)

 

The above logic is also more effective in filter pruning then the current code as some of our tests show. Besides minimal over head of constraint propagation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org