You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gerben van der Huizen (Jira)" <ji...@apache.org> on 2022/11/29 15:11:00 UTC

[jira] [Created] (SPARK-41322) Optimized query plan cost/statistics overview

Gerben van der Huizen created SPARK-41322:
---------------------------------------------

             Summary: Optimized query plan cost/statistics overview
                 Key: SPARK-41322
                 URL: https://issues.apache.org/jira/browse/SPARK-41322
             Project: Spark
          Issue Type: Improvement
          Components: GraphX, SQL
    Affects Versions: 3.3.0
            Reporter: Gerben van der Huizen


*Motivation*

Spark SQL supports running the `EXPLAIN COST` statement on a query to show the optimized logical plan and its data costs per stage (i.e. statistics) https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html. However, it can currently be difficult to determine what the total data read cost will be for a complex query with many stages. Other query engines such as Trino/Presto attempt to provide a general estimate of resource costs of a query when running the `EXPLAIN` statement, which includes CPU, memory, row count, and data size [https://trino.io/docs/current/optimizer/cost-in-explain.html.]

*Proposal*

We suggested adding an overview/estimation of the total resources that will be used within the optimized logical plan of a Spark query, or maybe as an alternative, provide this overview/estimation when the `EXPLAIN COST` statement is called on a query. As a first version, it would already be beneficial if this general cost estimation would include anything that is available within the statistics of the optimized query plan, such as:
 * The amount of data the will be read in bytes
 * The total amount of rows 
 * etc.

Given that the optimized logical plan is divided in stages it would already be sufficient to show these parameter per stage so they can be aggregated for the entire job later on if needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org