You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/01 22:27:07 UTC

[GitHub] [hudi] nsivabalan commented on issue #3892: Insert produces 44764 files with ~50MB each

nsivabalan commented on issue #3892:
URL: https://github.com/apache/hudi/issues/3892#issuecomment-956756527


   Let me try to explain. @bhasudha : Can you document this somewhere. might be useful for everyone in the community in general. 
   
   Bulk_insert: 
   This does not do any small file handling. 
   And so, solely relies on  HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key() and parallelism set for bulk_insert. 
   
   Insert: 
   Will do small file handling and could bin back incoming records to existing files. 
   For first commit for a hudi table, hudi does not have any idea of the record size. and so it relies on HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key() to determine how many might got into one data file. In subsequent commits, hudi will infer the record size from previous commits and will use that to do small file handling. 
   
   btw, each operation has a different config for parallelism. just incase you weren't aware of it. 
   
   hoodie.upsert.shuffle.parallelism
   hoodie.insert.shuffle.parallelism
   hoodie.delete.shuffle.parallelism
   hoodie.bulkinsert.shuffle.parallelism
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org