You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/15 21:47:42 UTC

[GitHub] [incubator-hudi] prashantwason opened a new pull request #1520: [HUDI-797] Small performance improvement for rewriting records.

prashantwason opened a new pull request #1520: [HUDI-797] Small performance improvement for rewriting records.
URL: https://github.com/apache/incubator-hudi/pull/1520
 
 
   When adding HUDI metadata field to schema, the position of incoming schema fields is retained. This allows us to lookup fields in exiting record and assign to the new record using field positions (array index lookup). This is faster than looking up fields using name (HashMap based lookup).
   
   ## What is the purpose of the pull request
   
   Small performance improvement for rewriting records during the ingestion phase. [HUDI-797](https://issues.apache.org/jira/browse/HUDI-797) as the details on the usecase and improvements. 
   
   ## Brief change log
   
   1. Modified HoodieAvroUtils.addMetadataFields to retain the field positions.
   2. Added a new rewriting function HoodieAvroUtils.rewriteHoodieRecord which rewrites using field positions rather than field names.
   
   ## Verify this pull request
   
   This change is verified automatically by all the HUDI client tests which ingest data into HUDI.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] prashantwason closed pull request #1520: [WIP][HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
prashantwason closed pull request #1520:
URL: https://github.com/apache/hudi/pull/1520


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on pull request #1520: [WIP][HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on pull request #1520:
URL: https://github.com/apache/hudi/pull/1520#issuecomment-636214691


   Since the Hoodie Metadata fields have to be added at fixed positions, the other fields of the schema will get an "offset" position. This offset cannot be guaranteed to be fixed (= num_hoodie_metadata_fields) as the existing schema may have a field named with hoodie metadata or for any other reason.
   
   Hence, for sake of data integrity, we cannot refer to fields by position in the current design.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1520: [HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1520: [HUDI-797] Small performance improvement for rewriting records.
URL: https://github.com/apache/incubator-hudi/pull/1520#discussion_r410171475
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -231,6 +254,36 @@ private static GenericRecord rewrite(GenericRecord record, LinkedHashSet<Field>
     return allFields;
   }
 
+  /*
+   * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the old
+   * schema.
+   *
+   * NOTE: This function is only suitable if newSchema has fields with the same position as record's schema.
+   */
+  public static GenericRecord rewriteHoodieRecord(GenericRecord record, Schema newSchema) {
+    return rewriteHoodieRecord(record, record.getSchema(), newSchema);
 
 Review comment:
   Please use getCombinedFieldsToWrite() function here, that will help in handling schema evolution cases also. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1520: [HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1520: [HUDI-797] Small performance improvement for rewriting records.
URL: https://github.com/apache/incubator-hudi/pull/1520#discussion_r410170533
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -231,6 +254,36 @@ private static GenericRecord rewrite(GenericRecord record, LinkedHashSet<Field>
     return allFields;
   }
 
+  /*
+   * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the old
+   * schema.
+   *
+   * NOTE: This function is only suitable if newSchema has fields with the same position as record's schema.
+   */
+  public static GenericRecord rewriteHoodieRecord(GenericRecord record, Schema newSchema) {
+    return rewriteHoodieRecord(record, record.getSchema(), newSchema);
+  }
+
+  /**
+   * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the old
+   * schema.
+   *
+   * This function has better performance than rewrite() even though it provides the same functionality.
+   *
+   * NOTE: This function is only suitable if newSchema has fields with the same position as schemaWithFields.
+   */
+  public static GenericRecord rewriteHoodieRecord(GenericRecord record, Schema schemaWithFields, Schema newSchema) {
+    GenericRecord newRecord = new GenericData.Record(newSchema);
+    for (Schema.Field f : schemaWithFields.getFields()) {
+      newRecord.put(f.pos(), record.get(f.pos()));
 
 Review comment:
   Please take care of handling default values here. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1520: [HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1520: [HUDI-797] Small performance improvement for rewriting records.
URL: https://github.com/apache/incubator-hudi/pull/1520#issuecomment-614301101
 
 
   @lamber-ken could you please review this 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1520: [HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1520: [HUDI-797] Small performance improvement for rewriting records.
URL: https://github.com/apache/incubator-hudi/pull/1520#issuecomment-614365314
 
 
   > @lamber-ken could you please review this
   
   OK : )

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on pull request #1520: [HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on pull request #1520:
URL: https://github.com/apache/incubator-hudi/pull/1520#issuecomment-620241928


   HoodieParquetRealtimeInputFormat has hardcoded the positions of the HUDI specific metadata fields.
   
     // These positions have to be deterministic across all tables
     public static final int HOODIE_COMMIT_TIME_COL_POS = 0;
     public static final int HOODIE_RECORD_KEY_COL_POS = 2;
     public static final int HOODIE_PARTITION_PATH_COL_POS = 3;
     public static final String HOODIE_READ_COLUMNS_PROP = "hoodie.read.columns.set";
   
   This is incompatible with the changes required in this PR. Am looking into the reason for the above and whether I can work around it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on pull request #1520: [HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1520:
URL: https://github.com/apache/incubator-hudi/pull/1520#issuecomment-620386911


   Meta fields have to come first.. and I think its too late to change that in hudi , as lots of data have been written already :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on pull request #1520: [HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1520:
URL: https://github.com/apache/incubator-hudi/pull/1520#issuecomment-620386762


   @prashantwason the reason for these is to additionally project out the _hoodie_record_key field needed to merge base and log files.. Hive will only hand the column numbers to be read for a given query.. Hope that helps 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] bvaradar commented on pull request #1520: [HUDI-797] Small performance improvement for rewriting records.

Posted by GitBox <gi...@apache.org>.
bvaradar commented on pull request #1520:
URL: https://github.com/apache/incubator-hudi/pull/1520#issuecomment-629040082


   @prashantwason : Looking at the comments, it looks like this PR is going to be abandoned. If so, can you please close this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org