You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/14 16:39:27 UTC

[GitHub] [hudi] KarthickAN commented on issue #2066: [SUPPORT] Hudi is increasing the storage size big time

KarthickAN commented on issue #2066:
URL: https://github.com/apache/hudi/issues/2066#issuecomment-708521865


   @bvaradar We have two types of data for which we have used hudi. Both of them are pretty much similar with some difference in the schema. I almost completed development. For one of the data type where we have less volume of data this difference was so huge. 
   
   Say I have following number of objects in json lines format 
   4326 Objects - 580.4 MB
   
   when transformed to hudi with snappy compression enabled it comes to
   4599 Objects - 7.2 GB
   
   which is really huge. For the other type where we have more volume of data I don't see this issue. It looks like below.
   
   json lines
   6895 Objects - 100.0 GB
   
   parquet snappy with hudi
   10597 Objects - 42.4 GB
   
   Following are the configs I am using right now. 
   
   SmallFileSize = 104857600
   MaxFileSize = 125829120
   RecordSize = 35
   CompressionRatio = 5
   InsertSplitSize = 3500000
   IndexBloomNumEntries = 1500000
   KeyGenClass = org.apache.hudi.keygen.ComplexKeyGenerator
   RecordKeyFields = sourceid,sourceassetid,sourceeventid,value,timestamp
   TableType = COPY_ON_WRITE
   PartitionPathFields = date,sourceid
   HiveStylePartitioning = True
   WriteOperation = insert
   CompressionCodec = snappy
   CommitsRetained = 1
   CombineBeforeInsert = True
   PrecombineField = timestamp
   InsertDropDuplicates = True
   InsertShuffleParallelism = 100
   
   
   Anything I should look at to improve this right now ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org