You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (Jira)" <ji...@apache.org> on 2021/12/02 23:36:00 UTC
[jira] [Commented] (SPARK-37392) Catalyst optimizer very time-consuming and memory-intensive with some "explode(array)"

    [ https://issues.apache.org/jira/browse/SPARK-37392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452656#comment-17452656 ] 

Josh Rosen commented on SPARK-37392:
------------------------------------

When I ran this in {{spark-shell}} it triggered an OOM in {{{}LogicalPlan.constraints(){}}}. It In the heap dump I spotted an {{ExpressionSet}} with over 100,000 expressions.

Based on the constraints that I saw I think that they were introduced by the {{InferFiltersFromGenerate}} rule and that some sort of unexpected rule interaction is resulting in a huge blowup of derived constraints in {{{}PruneFilters{}}}. The {{InferFiltersFromGenerate}} rule was introduced in SPARK-32295 / Spark 3.1.0, which could explain why this issue isn't reproducible in Spark 2.4.1.

Looking at the constraints in the huge {{{}ExpressionSet{}}}, it looks like the vast majority (> 99%) of the constraints are {{GreaterThan}} or {{LessThan}} constraints of the form:
 * {{GreaterThan(Size(CreateArray(...)), Literal(0))}}
 * {{LessThan(Literal(0), Size(CreateArray(...)))}}

I think the {{GreaterThan}} comes from {{InferFiltersFromGenerate}} and suspect that the {{LessThan}} equivalents are introduced via expression canonicalization.

We'll need to dig a bit deeper to figure out what's leading to this buildup of duplicate constraints. Perhaps it's some sort of interaction between this particular shape of constraint, the constraint propagation system, and canonicalization? I'm not sure yet.

> Catalyst optimizer very time-consuming and memory-intensive with some "explode(array)" 
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-37392
>                 URL: https://issues.apache.org/jira/browse/SPARK-37392
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 3.1.2, 3.2.0
>            Reporter: Francois MARTIN
>            Priority: Major
>
> The problem occurs with the simple code below:
> {code:java}
> import session.implicits._
> Seq(
>   (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
> ).toDF()
>   .checkpoint() // or save and reload to truncate lineage
>   .createOrReplaceTempView("sub")
> session.sql("""
>   SELECT
>     *
>   FROM
>   (
>     SELECT
>       EXPLODE( ARRAY( * ) ) result
>     FROM
>     (
>       SELECT
>         _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
>       FROM
>         sub
>     )
>   )
>   WHERE
>     result != ''
>   """).show() {code}
> It takes several minutes and a very high Java heap usage, when it should be immediate.
> It does not occur when replacing the unique integer value (1) with a string value ({_}"x"{_}).
> All the time is spent in the _PruneFilters_ optimization rule.
> Not reproduced in Spark 2.4.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org