You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/25 09:27:33 UTC

[GitHub] [hudi] manojpec commented on pull request #4067: [HUDI-2763] Metadata table records key deduplication

manojpec commented on pull request #4067:
URL: https://github.com/apache/hudi/pull/4067#issuecomment-979015349


   > Concept looks good. But why introduce a new block type and not do it for the HoodieHFileDataBlock itself?
   > 
   > When the HFile format is used, whether for Metadata Table or elsewhere in HUDI, there will always be a key for the HFile and they will be derived from some field of the record. Hence, this HFile key will always be redundant. Therefore this optimization needs to performed for HoodieHFileDataBlock itself.
   > 
   > HoodieHFileDataBlock already accepts a "keyField". We can simply this change by:
   > 
   > 1. If keyField is not None:
   >    
   >    * set keyField to "null" and do not save it
   >    * materialize the keyField from HFile key
   > 2. If keyField is None - no need to do the above
   
   I was initially proposing to go with a new on-disk block type like HFILE_METADATA_BLOCK to differentiate from other HFILE_DATA_BLOCKS. But, that makes on-disk block format change and hence not backward compatible and downgrades would not work. So, later on further discussion made this choice of layering this such a way that metadata record specific key deduplication logic doesn't get spilled over into the lower most HFile block layer. Also, this code structuring gives us the benefit of restricting the functionality only to specific users of HFile, here Metadata table and not to all. 
   
   Previously, the HFile block layer had the hard coded keyField = "key" assuming the record payload would always have this key. But it is true only for the metadata payload. Also, it didn't sit well with the config "populate meta fields" where the key could be different based on the table user. So, to support virtual keys for the metadata table we had to pull out the abstraction to higher layers. Similarly, if we make the HFile block layer assume the de-duplication needs, it might restrict the future usages of HFile type by other users. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org