You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2020/10/27 09:50:00 UTC
[jira] [Created] (HIVE-24313) Optimise stats collection for file
sizes on cloud storage
Rajesh Balamohan created HIVE-24313:
---------------------------------------
Summary: Optimise stats collection for file sizes on cloud storage
Key: HIVE-24313
URL: https://issues.apache.org/jira/browse/HIVE-24313
Project: Hive
Issue Type: Improvement
Components: HiveServer2
Reporter: Rajesh Balamohan
When stats information is not present (e.g external table), RelOptHiveTable computes basic stats at runtime.
Following is the codepath.
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L598]
{code:java}
Statistics stats = StatsUtils.collectStatistics(hiveConf, partitionList,
hiveTblMetadata, hiveNonPartitionCols, nonPartColNamesThatRqrStats, colStatsCached,
nonPartColNamesThatRqrStats, true);
{code}
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L322]
{code:java}
for (Partition p : partList.getNotDeniedPartns()) {
BasicStats basicStats = basicStatsFactory.build(Partish.buildFor(table, p));
partStats.add(basicStats);
}
{code}
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStats.java#L205]
{code:java}
try {
ds = getFileSizeForPath(path);
} catch (IOException e) {
ds = 0L;
}
{code}
For a table & query with large number of partitions, this takes long time to compute statistics and increases compilation time. It would be good to fix it with "ForkJoinPool" ( partList.getNotDeniedPartns().parallelStream().forEach((p) )
--
This message was sent by Atlassian Jira
(v8.3.4#803005)