You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Abhinav Kumar (Jira)" <ji...@apache.org> on 2023/10/19 02:28:00 UTC

[jira] [Commented] (SPARK-44817) SPIP: Incremental Stats Collection

    [ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776976#comment-17776976 ] 

Abhinav Kumar commented on SPARK-44817:
---------------------------------------

[~rakson] [~gurwls223] [~cloud_fan] - We find this issue quite common. Currently, the incremental stats collection is done mostly outside the spark application as a end of day process (to avoid SLA breaches), and sometimes within the current application, if DML materially changes the stats. This proposal seems like a good idea, consider users can control it via spark parameter.

Views?

> SPIP: Incremental Stats Collection
> ----------------------------------
>
>                 Key: SPARK-44817
>                 URL: https://issues.apache.org/jira/browse/SPARK-44817
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.0, 4.0.0
>            Reporter: Rakesh Raushan
>            Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. Initially, we can only update table level stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats.
> [SPIP Document |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org