You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhan Zhang (JIRA)" <ji...@apache.org> on 2017/03/09 23:27:37 UTC

[jira] [Updated] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately

     [ https://issues.apache.org/jira/browse/SPARK-19890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhan Zhang updated SPARK-19890:
-------------------------------
    Description: 
Currently the MetastoreRelation statistics is retrieved on the analyze phase, and the size is based on the table scope. But for partitioned table, this statistics is not useful as table size may 100x+ larger than the input partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase to get partition information but before physical plan generation phase so that JoinSelection can choose better join methd (broadcast, shuffledjoin, or sortmerjoin).

Although the metastorerelation does not associated with partitions, but through PhysicalOperation we can get the partition info for the table. Multiple plan can use the same meatstorerelation, but the estimation is still much better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the metastore relation and partitions with the plan to get most accurate estimation.

  was:
Currently the MetastoreRelation statistics is retrieved on the analyze phase, and the size is based on the table scope. But for partitioned table, this statistics is not useful as table size may 100x+ larger than the input partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase to get partition information but before physical plan generation phase so that JoinSelection can choose better join methd (broadcast, shuffledjoin, or sortmerjoin).

Although the metastorerelation does not associated with partitions, but through PhysicalOperation we can get the partition info for the table. Although multiple plan can use the same meatstorerelation, but the estimation still much better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the metastore relation and partitions with the plan to get most accurate estimation.


> Make MetastoreRelation statistics estimation more accurately
> ------------------------------------------------------------
>
>                 Key: SPARK-19890
>                 URL: https://issues.apache.org/jira/browse/SPARK-19890
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Zhan Zhang
>            Priority: Minor
>
> Currently the MetastoreRelation statistics is retrieved on the analyze phase, and the size is based on the table scope. But for partitioned table, this statistics is not useful as table size may 100x+ larger than the input partition size. As a result, the join optimization techniques is not applicable.
> It would be great if we can postpone the statistics to the optimization phase to get partition information but before physical plan generation phase so that JoinSelection can choose better join methd (broadcast, shuffledjoin, or sortmerjoin).
> Although the metastorerelation does not associated with partitions, but through PhysicalOperation we can get the partition info for the table. Multiple plan can use the same meatstorerelation, but the estimation is still much better than table size. This way, retrieving statistics is straightforward.
> Another possible way is to have a another data structure associating the metastore relation and partitions with the plan to get most accurate estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org