You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/01/30 02:56:00 UTC

[jira] [Commented] (SPARK-23445) ColumnStat refactoring

    [ https://issues.apache.org/jira/browse/SPARK-23445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484292#comment-17484292 ] 

Apache Spark commented on SPARK-23445:
--------------------------------------

User 'Stove-hust' has created a pull request for this issue:
https://github.com/apache/spark/pull/35363

> ColumnStat refactoring
> ----------------------
>
>                 Key: SPARK-23445
>                 URL: https://issues.apache.org/jira/browse/SPARK-23445
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Juliusz Sompolski
>            Assignee: Juliusz Sompolski
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Refactor ColumnStat to be more flexible.
>  * Split {{ColumnStat}} and {{CatalogColumnStat}} just like {{CatalogStatistics}} is split from {{Statistics}}. This detaches how the statistics are stored from how they are processed in the query plan. {{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not depend on dataType information.
>  * For {{CatalogColumnStat}}, parse column names from property names in the metastore ({{KEY_VERSION }}property), not from metastore schema. This allows the catalog to read stats into {{CatalogColumnStat}}s even if the schema itself is not in the metastore.
>  * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate.
> The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org