You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "gabrywu (Jira)" <ji...@apache.org> on 2022/03/06 03:34:00 UTC

[jira] [Commented] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

    [ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501864#comment-17501864 ] 

gabrywu commented on SPARK-38258:
---------------------------------

[~yumwang] what do you think of it?

> [proposal] collect & update statistics automatically when spark SQL is running
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-38258
>                 URL: https://issues.apache.org/jira/browse/SPARK-38258
>             Project: Spark
>          Issue Type: Wish
>          Components: Spark Core, SQL
>    Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>            Reporter: gabrywu
>            Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff0000}collect & update statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs, of course can also adjust the important configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff0000}add a version number to statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org