You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/31 23:40:50 UTC

[GitHub] [hudi] yihua commented on a change in pull request #5181: [HUDI-3664] Fixing Column Stats Index composition

yihua commented on a change in pull request #5181:
URL: https://github.com/apache/hudi/pull/5181#discussion_r840042665



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##########
@@ -1508,7 +1508,7 @@ public boolean isMetadataBloomFilterIndexEnabled() {
   }
 
   public boolean isMetadataColumnStatsIndexEnabled() {
-    return isMetadataTableEnabled() && getMetadataConfig().isColumnStatsIndexEnabled();
+    return isMetadataTableEnabled() && getMetadataConfig().isColumnStatsIndexEnabled() && getMetadataConfig().isColumnStatsIndexEnabled();

Review comment:
       nit: is this redundant?

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/columnstats/ColumnStatsIndexHelper.java
##########
@@ -63,9 +63,10 @@
 import java.util.stream.Collectors;
 import java.util.stream.StreamSupport;
 
+// TODO merge w/ ColumnStatsIndexSupport
 public class ColumnStatsIndexHelper {
 
-  private static final String COLUMN_STATS_INDEX_FILE_COLUMN_NAME = "file";
+  private static final String COLUMN_STATS_INDEX_FILE_COLUMN_NAME = "fileName";

Review comment:
       I assume this is only used on the read side in intermediate dataset representation, not to be coupled with the metadata payload schema.

##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -121,16 +121,192 @@
                             "doc": "Minimum value in the range. Based on user data table schema, we can convert this to appropriate type",
                             "name": "minValue",
                             "type": [
+                                // Those types should be aligned with Parquet `Statistics` impl
+                                // making sure that we implement semantic consistent across file formats
+                                //
+                                // NOTE: Other logical types (decimal, date, timestamp, etc) will be converted
+                                //       into one of the following types, making sure that their corresponding
+                                //       ordering is preserved
                                 "null",
-                                "string"
+                                {

Review comment:
       should we keep `"string"` for compatibility across SNAPSHOTS releases?

##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -121,16 +121,192 @@
                             "doc": "Minimum value in the range. Based on user data table schema, we can convert this to appropriate type",
                             "name": "minValue",
                             "type": [
+                                // Those types should be aligned with Parquet `Statistics` impl
+                                // making sure that we implement semantic consistent across file formats
+                                //
+                                // NOTE: Other logical types (decimal, date, timestamp, etc) will be converted
+                                //       into one of the following types, making sure that their corresponding
+                                //       ordering is preserved
                                 "null",
-                                "string"
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "BooleanWrapper",
+                                    "doc": "A record wrapping boolean type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "boolean",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "IntWrapper",
+                                    "doc": "A record wrapping int type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "int",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "LongWrapper",
+                                    "doc": "A record wrapping long type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "long",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "FloatWrapper",
+                                    "doc": "A record wrapping float type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "float",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "DoubleWrapper",
+                                    "doc": "A record wrapping double type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "double",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "BytesWrapper",
+                                    "doc": "A record wrapping bytes type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "bytes",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "StringWrapper",
+                                    "doc": "A record wrapping string type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "string",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "DateWrapper",
+                                    "doc": "A record wrapping Date logical type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": {
+                                                "type": "int"
+                                                // NOTE: Due to breaking changes in code-gen b/w Avro 1.8.2 and 1.10, we can't
+                                                //       rely on logical types to do proper encoding of the native Java types,
+                                                //       and hereby have to encode statistic manually
+                                                //"logicalType": "date"
+                                            },
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "DecimalWrapper",
+                                    "doc": "A record wrapping Decimal logical type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": {
+                                                "type": "bytes",
+                                                "logicalType": "decimal",
+                                                // NOTE: This is equivalent to Spark's [[DoubleDecimal]] and should
+                                                //       be enough for almost any possible use-cases
+                                                "precision": 30,
+                                                "scale": 15
+                                            },
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "TimeMicrosWrapper",
+                                    "doc": "A record wrapping Time-micros logical type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": {
+                                                "type": "long",
+                                                "logicalType": "time-micros"
+                                            },
+                                            "name": "value"
+
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "TimestampMicrosWrapper",
+                                    "doc": "A record wrapping Timestamp-micros logical type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": {
+                                                "type": "long"
+                                                // NOTE: Due to breaking changes in code-gen b/w Avro 1.8.2 and 1.10, we can't
+                                                //       rely on logical types to do proper encoding of the native Java types,
+                                                //       and hereby have to encode statistic manually
+                                                //"logicalType": "timestamp-micros"
+                                            },
+                                            "name": "value"
+                                        }
+                                    ]
+                                }
                             ]
                         },
                         {
                             "doc": "Maximum value in the range. Based on user data table schema, we can convert it to appropriate type",
                             "name": "maxValue",
                             "type": [
+                                // Those types should be aligned with Parquet `Statistics` impl
+                                // making sure that we implement semantic consistent across file formats
+                                //
+                                // NOTE: Other logical types (decimal, date, timestamp, etc) will be converted
+                                //       into one of the following types, making sure that their corresponding
+                                //       ordering is preserved
                                 "null",
-                                "string"
+                                "org.apache.hudi.avro.model.BooleanWrapper",

Review comment:
       similar here.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -1126,23 +1127,29 @@ public static void aggregateColumnStats(IndexedRecord record, List<Schema.Field>
     }
 
     fields.forEach(field -> {
-      Map<String, Object> columnStats = columnToStats.getOrDefault(field.name(), new HashMap<>());
-      final String fieldVal = getNestedFieldValAsString((GenericRecord) record, field.name(), true, consistentLogicalTimestampEnabled);
+      Map<String, Object> columnStats = columnToStats.get(field.name());
+      GenericRecord genericRecord = (GenericRecord) record;
+      final Object fieldVal = convertValueForSpecificDataTypes(field.schema(), genericRecord.get(field.name()), consistentLogicalTimestampEnabled);
+      final Schema fieldSchema = getNestedFieldSchemaFromWriteSchema(genericRecord.getSchema(), field.name());
       // update stats
-      final int fieldSize = fieldVal == null ? 0 : fieldVal.length();
-      columnStats.put(TOTAL_SIZE, Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()) + fieldSize);
-      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 0).toString()) + fieldSize);
+      // NOTE: Unlike Parquet, Avro does not give the field size.
+      columnStats.put(TOTAL_SIZE, Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()));
+      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 0).toString()));

Review comment:
       I assume the size here is not treated as accurate so it's okay not to add the field size here.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -1126,23 +1127,29 @@ public static void aggregateColumnStats(IndexedRecord record, List<Schema.Field>
     }
 
     fields.forEach(field -> {
-      Map<String, Object> columnStats = columnToStats.getOrDefault(field.name(), new HashMap<>());
-      final String fieldVal = getNestedFieldValAsString((GenericRecord) record, field.name(), true, consistentLogicalTimestampEnabled);
+      Map<String, Object> columnStats = columnToStats.get(field.name());
+      GenericRecord genericRecord = (GenericRecord) record;
+      final Object fieldVal = convertValueForSpecificDataTypes(field.schema(), genericRecord.get(field.name()), consistentLogicalTimestampEnabled);
+      final Schema fieldSchema = getNestedFieldSchemaFromWriteSchema(genericRecord.getSchema(), field.name());
       // update stats
-      final int fieldSize = fieldVal == null ? 0 : fieldVal.length();
-      columnStats.put(TOTAL_SIZE, Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()) + fieldSize);
-      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 0).toString()) + fieldSize);
+      // NOTE: Unlike Parquet, Avro does not give the field size.
+      columnStats.put(TOTAL_SIZE, Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()));
+      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 0).toString()));
 
-      if (!isNullOrEmpty(fieldVal)) {
+      if (fieldVal != null) {
         // set the min value of the field
         if (!columnStats.containsKey(MIN)) {
           columnStats.put(MIN, fieldVal);
         }
-        if (fieldVal.compareTo(String.valueOf(columnStats.get(MIN))) < 0) {
+        if (compare(fieldVal, columnStats.get(MIN), fieldSchema) < 0) {
           columnStats.put(MIN, fieldVal);
         }
         // set the max value of the field
-        if (fieldVal.compareTo(String.valueOf(columnStats.getOrDefault(MAX, ""))) > 0) {
+        if (!columnStats.containsKey(MAX)) {
+          columnStats.put(MAX, fieldVal);
+        }
+        // set the max value of the field
+        if (compare(fieldVal, columnStats.get(MAX), fieldSchema) > 0) {

Review comment:
       So basically, for struct and nested fields, e.g., "a.b.c" and "a.b.d", the logic is always going to construct the column stats for individual nested field of "a.b.c" and "a.b.d" which has defined order based on the data type, not struct field "a" and field "a.b" which does not have pre-defined order.  Is that the case?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org