You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/14 19:31:37 UTC

[GitHub] [incubator-hudi] afilipchik opened a new pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik opened a new pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513
 
 
   
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   Fixing issue with avro and default. Current implementation doesn't set correct default fields when adding hudi metadata and also fails if record's schema has them set to:
   "dafault": null
   
   ## Brief change log
   
   *(for example:)*
     - *Modified HoodieAvroUtils
   
   ## Verify this pull request
   
   I added default to the already existing test.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#issuecomment-615383799
 
 
   @pratyakshsharma can you please open a PR with these additional test cases :) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408443108
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord record, LinkedHashSet<Field>
     GenericRecord newRecord = new GenericData.Record(newSchema);
     for (Schema.Field f : fieldsToWrite) {
       if (record.get(f.name()) == null) {
-        newRecord.put(f.name(), f.defaultVal());
+        if (f.defaultVal() instanceof Null) {
+          newRecord.put(f.name(), null);
 
 Review comment:
   the one that was modified in this PR: TestHoodieAvroUtils.java. I updated test avro schema. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408431569
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   it doesn't generate correct avro schema. This might not be the correct fix, but passing (Object) null results is the schema without default value.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408472216
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Ok.. IMO hoodie metadata fields will never be written as null. If you strongly feel about doing this change, can we think of some alternate solution rather than going back to deprecated constructors. Also wanted to check with you how are you generating schema like you mentioned before? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on issue #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#issuecomment-615272042
 
 
   @vinothchandar I tried few more combinations, looks like only the case when default value is null was not covered with avro-1.8.2. Rest looks good to me. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408480553
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   On schema generation, we had to do some stuff... Our service emits protobufs to kafka, which we transform to avro using ProtoToAvro converter. Because those are service to service messages, they are not perfect and we had to do some bending to make sure schema can evolve. Some things: 
   1) We made everything optional
   2) Avro and Protobuf allows records to have no fields. But parquet doesn't, so, we inject marker field called exists pretty much everywhere. Originally added it only to records with no fields, but found out that when fields are added and exists is autoremoved, parquet-avro reader breaks as it expects every field in parquet schema to be present in avro. I have a fix to parquet-avro to skip those fields, will send a PR to show so we can discuss whether it is a good idea or not. 
   3) Then we sometimes run sql transformation (same kafka stream produces multiple outputs). In this case, we infer schema from Spark (using NullTargetSchemaProvider). This schema must be correct as well which is not the case as spark omits defaults, so compaction breaks. Have a PR to adress.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on issue #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#issuecomment-615434579
 
 
   > @pratyakshsharma can you please open a PR with these additional test cases :) 
   
   Sure. https://issues.apache.org/jira/browse/HUDI-803 tracks this. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408480553
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   On schema generation, we had to do some stuff... Our service emits protobufs to kafka, which we transform to avro using ProtoToAvro converter. Because those are service to service messages, they are not perfect and we had to do some bending to make sure schema can evolve. Some things: 
   1) We made everything optional
   2) Avro and Protobuf allows records to have no fields. But parquet doesn't, so, we inject marker field called exists pretty much everywhere. We do it inside HUDI to avoid clutter in kafka. Originally added it only to records with no fields, but found out that when fields are added and exists is autoremoved, parquet-avro reader breaks as it expects every field in parquet schema to be present in avro. I have a fix to parquet-avro to skip those fields, will send a PR to show so we can discuss whether it is a good idea or not. 
   3) Then we sometimes run sql transformation (same kafka stream produces multiple outputs). In this case, we infer schema from Spark (using NullTargetSchemaProvider). This schema must be correct as well which is not the case as spark omits defaults, so compaction breaks. Have a PR to address.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#issuecomment-614885624
 
 
   @pratyakshsharma I will take another pass once you and @afilipchik are on the same pass.. 
   
   Discussions point towards having more tests around default values.. Might be worth it for you to try a bunch of different combinations and see what the gaps are.. (outside of this PR?) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408469635
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord record, LinkedHashSet<Field>
     GenericRecord newRecord = new GenericData.Record(newSchema);
     for (Schema.Field f : fieldsToWrite) {
       if (record.get(f.name()) == null) {
-        newRecord.put(f.name(), f.defaultVal());
+        if (f.defaultVal() instanceof Null) {
+          newRecord.put(f.name(), null);
 
 Review comment:
   Ok so when using f.defaultVal(), NullNode corresponds to JsonProperties.Null class and hence the test case fails. Good catch :) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408447606
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Incorrect avro schema is a disaster waiting to happen. Current one says: it takes "null" or String, but no defaults. Given that it is serialized in both log and parquet files it has to be correct, as if we add new field it will not be readable. Here is example exception when message was written without a field and then attempted to be read with field that doesn't have default:
   `Caused by: org.apache.avro.AvroTypeException: Found Event, expecting Event, missing required field exist
   	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
   	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
   	at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
   	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
   	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
   	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
   	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)`
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408423175
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   By doing this change, we are effectively going back to deprecated constructors again. Can you please explain the purpose of doing this? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408598302
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   > parquet-avro reader breaks as it expects every field in parquet schema to be present in avro. I have a fix to parquet-avro to skip those fields, will send a PR to show so we can discuss whether it is a good idea or not
   
   Yes, please. Would love to see how you are approaching that fix. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408425373
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord record, LinkedHashSet<Field>
     GenericRecord newRecord = new GenericData.Record(newSchema);
     for (Schema.Field f : fieldsToWrite) {
       if (record.get(f.name()) == null) {
-        newRecord.put(f.name(), f.defaultVal());
+        if (f.defaultVal() instanceof Null) {
+          newRecord.put(f.name(), null);
 
 Review comment:
   Why do you need this? If f.defaultVal() is null, it will be automatically populated like that. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408430839
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord record, LinkedHashSet<Field>
     GenericRecord newRecord = new GenericData.Record(newSchema);
     for (Schema.Field f : fieldsToWrite) {
       if (record.get(f.name()) == null) {
-        newRecord.put(f.name(), f.defaultVal());
+        if (f.defaultVal() instanceof Null) {
+          newRecord.put(f.name(), null);
 
 Review comment:
   just run the test without it. You will get an exception. 
   if schema has "default": null it breaks, which should not be the case. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408435191
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord record, LinkedHashSet<Field>
     GenericRecord newRecord = new GenericData.Record(newSchema);
     for (Schema.Field f : fieldsToWrite) {
       if (record.get(f.name()) == null) {
-        newRecord.put(f.name(), f.defaultVal());
+        if (f.defaultVal() instanceof Null) {
+          newRecord.put(f.name(), null);
 
 Review comment:
   Which test you are talking about? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408445302
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord record, LinkedHashSet<Field>
     GenericRecord newRecord = new GenericData.Record(newSchema);
     for (Schema.Field f : fieldsToWrite) {
       if (record.get(f.name()) == null) {
-        newRecord.put(f.name(), f.defaultVal());
+        if (f.defaultVal() instanceof Null) {
+          newRecord.put(f.name(), null);
 
 Review comment:
   But these tests are passing on the master branch without the patch that you did. :) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408445302
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord record, LinkedHashSet<Field>
     GenericRecord newRecord = new GenericData.Record(newSchema);
     for (Schema.Field f : fieldsToWrite) {
       if (record.get(f.name()) == null) {
-        newRecord.put(f.name(), f.defaultVal());
+        if (f.defaultVal() instanceof Null) {
+          newRecord.put(f.name(), null);
 
 Review comment:
   But these tests are passing on the master branch without the patch that you did. :) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408436224
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Ok.. I am still trying to understand what difference does it make if the schema is generated without defaultValue or if it is generated with defaultValue as null. Can you please provide more points to support this? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408477566
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Sure, there might be another way. I think there are changes with nulls in 1.9.0, but upgrading is not easy. Just to clarify, it is not for the write support, but for the read one. Consider situation:
   We have avro with 4 metada fields written, and defaults are not set. Not an issue as it is written.
   We add another metadata field (sourceOffset, or something), which also doesn't have default field and redeploy. When we try to read old record with new schema it will fail because old record doesn't have new field and there is no default for it. I hit this issue with spark-> avro transformation, have another PR to address

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408447606
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Incorrect avro schema is a disaster waiting to happen. Current one says: it takes "null" or String, but if you try to read the message with reader schema that doesn't have those fields it will break as default will not be provided. Given that it is serialized in both log and parquet files it has to be correct, as if we add new field it will not be readable. Here is example exception when message was written without a field and  attempted to be read with field that doesn't have default:
   `Caused by: org.apache.avro.AvroTypeException: Found Event, expecting Event, missing required field exist
   	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
   	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
   	at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
   	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
   	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
   	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
   	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)`
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408480553
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   On schema generation, we had to do some stuff... Our service emits protobufs to kafka, which we transform to avro using ProtoToAvro converter. Because those are service to service messages, they are not perfect and we had to do some bending to make sure schema can evolve. Some things: 
   1) We made everything optional
   2) Avro and Protobuf allows records to have no fields. But parquet doesn't, so, we inject marker field called exists pretty much everywhere. We do it inside HUDI to avoid clutter in kafka. Originally added it only to records with no fields, but found out that when fields are added and exists is autoremoved, parquet-avro reader breaks as it expects every field in parquet schema to be present in avro. I have a fix to parquet-avro to skip those fields, will send a PR to show so we can discuss whether it is a good idea or not. 
   3) Then we sometimes run sql transformation (same kafka stream produces multiple outputs). In this case, we infer schema from Spark (using NullTargetSchemaProvider). This schema must be correct as well which is not the case as spark omits defaults, so compaction breaks. Have a PR to adress.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on issue #1513:
URL: https://github.com/apache/incubator-hudi/pull/1513#issuecomment-616657717


   > Sure. https://issues.apache.org/jira/browse/HUDI-803 tracks this.
   
   https://github.com/apache/incubator-hudi/pull/1538 is raised for this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar merged pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
vinothchandar merged pull request #1513:
URL: https://github.com/apache/incubator-hudi/pull/1513


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1513:
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r424814160



##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, METADATA_FIELD_SCHEMA, "", NullNode.getInstance());

Review comment:
       Overloading of `null` to denote absence of a default value in avro, is confusing to say the least.. (atleast I learnt that now) But I am trying to grok the actual change in behavior.. 
   
   This change seems to be orthogonal to the fix below.. Effectively, we are making the metadata fields nullable (as opposed to having no default values), with this change... While I agree, metadata fields won't have nulls.. Having this safety is better.. 
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org