You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (Jira)" <ji...@apache.org> on 2021/02/10 03:23:00 UTC

[jira] [Resolved] (SPARK-34137) The tree string does not contain statistics for nested scalar sub queries

     [ https://issues.apache.org/jira/browse/SPARK-34137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenchen Fan resolved SPARK-34137.
---------------------------------
    Fix Version/s: 3.2.0
       Resolution: Fixed

Issue resolved by pull request 31485
[https://github.com/apache/spark/pull/31485]

> The tree string does not contain statistics for nested scalar sub queries
> -------------------------------------------------------------------------
>
>                 Key: SPARK-34137
>                 URL: https://issues.apache.org/jira/browse/SPARK-34137
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>             Fix For: 3.2.0
>
>
> How to reproduce:
> {code:scala}
> spark.sql("create table t1 using parquet as select id as a, id as b from range(1000)")
> spark.sql("create table t2 using parquet as select id as c, id as d from range(2000)")
> spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql(
>   """
>     |WITH max_store_sales AS
>     |  (SELECT max(csales) tpcds_cmax
>     |  FROM (SELECT
>     |    sum(b) csales
>     |  FROM t1 WHERE a < 100 ) x),
>     |best_ss_customer AS
>     |  (SELECT
>     |    c
>     |  FROM t2
>     |  WHERE d > (SELECT * FROM max_store_sales))
>     |
>     |SELECT c FROM best_ss_customer
>     |""".stripMargin).explain("cost")
> {code}
> Output:
> {noformat}
> == Optimized Logical Plan ==
> Project [c#4263L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3)
> +- Filter (isnotnull(d#4264L) AND (d#4264L > scalar-subquery#4262 [])), Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
>    :  +- Aggregate [max(csales#4260L) AS tpcds_cmax#4261L]
>    :     +- Aggregate [sum(b#4266L) AS csales#4260L]
>    :        +- Project [b#4266L]
>    :           +- Filter ((a#4265L < 100) AND isnotnull(a#4265L))
>    :              +- Relation default.t1[a#4265L,b#4266L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
>    +- Relation default.t2[c#4263L,d#4264L] parquet, Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
> {noformat}
> Another case is TPC-DS q23a.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org