You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/14 16:39:27 UTC

[GitHub] [hudi] KarthickAN commented on issue #2066: [SUPPORT] Hudi is increasing the storage size big time

KarthickAN commented on issue #2066:
URL: https://github.com/apache/hudi/issues/2066#issuecomment-708521865

@bvaradar We have two types of data for which we have used hudi. Both of them are pretty much similar with some difference in the schema. I almost completed development. For one of the data type where we have less volume of data this difference was so huge.

Say I have following number of objects in json lines format
4326 Objects - 580.4 MB

when transformed to hudi with snappy compression enabled it comes to
4599 Objects - 7.2 GB

which is really huge. For the other type where we have more volume of data I don't see this issue. It looks like below.

json lines
6895 Objects - 100.0 GB

parquet snappy with hudi
10597 Objects - 42.4 GB

Following are the configs I am using right now.

SmallFileSize = 104857600
MaxFileSize = 125829120
RecordSize = 35
CompressionRatio = 5
InsertSplitSize = 3500000
IndexBloomNumEntries = 1500000
KeyGenClass = org.apache.hudi.keygen.ComplexKeyGenerator
RecordKeyFields = sourceid,sourceassetid,sourceeventid,value,timestamp
TableType = COPY_ON_WRITE
PartitionPathFields = date,sourceid
HiveStylePartitioning = True
WriteOperation = insert
CompressionCodec = snappy
CommitsRetained = 1
CombineBeforeInsert = True
PrecombineField = timestamp
InsertDropDuplicates = True
InsertShuffleParallelism = 100

Anything I should look at to improve this right now ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org