You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/02/01 07:54:00 UTC
[jira] [Work logged] (HIVE-27000) Improve the modularity of the *ColumnStatsMerger classes

     [ https://issues.apache.org/jira/browse/HIVE-27000?focusedWorklogId=842789&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-842789 ]

ASF GitHub Bot logged work on HIVE-27000:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Feb/23 07:53
            Start Date: 01/Feb/23 07:53
    Worklog Time Spent: 10m 
      Work Description: akshat0395 commented on code in PR #3997:
URL: https://github.com/apache/hive/pull/3997#discussion_r1092866326


##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/DateColumnStatsMerger.java:
##########
@@ -43,64 +46,57 @@ public void merge(ColumnStatisticsObj aggregateColStats, ColumnStatisticsObj new
     DateColumnStatsDataInspector aggregateData = dateInspectorFromStats(aggregateColStats);
     DateColumnStatsDataInspector newData = dateInspectorFromStats(newColStats);
 
-    setLowValue(aggregateData, newData);
-    setHighValue(aggregateData, newData);
-
-    aggregateData.setNumNulls(aggregateData.getNumNulls() + newData.getNumNulls());
-    if (aggregateData.getNdvEstimator() == null || newData.getNdvEstimator() == null) {
-      aggregateData.setNumDVs(Math.max(aggregateData.getNumDVs(), newData.getNumDVs()));
-    } else {
-      NumDistinctValueEstimator oldEst = aggregateData.getNdvEstimator();
-      NumDistinctValueEstimator newEst = newData.getNdvEstimator();
-      final long ndv;
-      if (oldEst.canMerge(newEst)) {
-        oldEst.mergeEstimators(newEst);
-        ndv = oldEst.estimateNumDistinctValues();
-        aggregateData.setNdvEstimator(oldEst);
-      } else {
-        ndv = Math.max(aggregateData.getNumDVs(), newData.getNumDVs());
-      }
-      LOG.debug("Use bitvector to merge column {}'s ndvs of {} and {} to be {}", aggregateColStats.getColName(),
-          aggregateData.getNumDVs(), newData.getNumDVs(), ndv);
-      aggregateData.setNumDVs(ndv);
+    Date lowValue = mergeLowValue(getLowValue(aggregateData), getLowValue(newData));
+    if (lowValue != null) {
+      aggregateData.setLowValue(lowValue);
+    }
+    Date highValue = mergeHighValue(getHighValue(aggregateData), getHighValue(newData));
+    if (highValue != null) {

Review Comment:
   Thanks @asolimando for the explanation, if there is a plan improve the class hierarchy then it make sense to tackle this then as well.
   





Issue Time Tracking
-------------------

    Worklog Id:     (was: 842789)
    Time Spent: 1h 20m  (was: 1h 10m)

> Improve the modularity of the *ColumnStatsMerger classes
> --------------------------------------------------------
>
>                 Key: HIVE-27000
>                 URL: https://issues.apache.org/jira/browse/HIVE-27000
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> *ColumnStatsMerger classes contain a lot of duplicate code which is not specific to the data type, and that could therefore be lifted to a common parent class.
> This phenomenon is bound to become even worse if we keep enriching further our supported set of statistics as we did in the context of HIVE-26221.
> The current ticket aims at improving the modularity and code reuse of the *ColumnStatsMerger classes, while improving unit-test coverage to cover all classes and support more use-cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)