You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/05/17 12:43:54 UTC

[GitHub] [incubator-hudi] nsivabalan commented on issue #1625: [SUPPORT] MOR upsert table grows in size when ingesting same records

nsivabalan commented on issue #1625:
URL: https://github.com/apache/incubator-hudi/issues/1625#issuecomment-629790932


   @bvaradar : Tried to reproduce locally and couldn't. Are there chance of some data skewness?
   @rolandjohann : I couldn't repro the ever growing hudi table. May be I am missing something. Can you try my below code and let us know what do you see. 
   
   My initial insert (100k records) took 14Mb in hudi. 
   single batch of update(2k records) disk size if I write in parquet directly = 165kb. 
   
   Here are my disk sizes after same batch updates repeatedly.
   
   | Round No | Total disk size (du -s -h basePath)|
   |----------|----------------------------------|
   |    1 | 23Mb |
   |2 | 24 Mb |
   | 3| 34 Mb |
   |4 | 35Mb |
   | 5 | 46Mb |
   | 6 | 46Mb |
   | 7 | 43Mb |
   | 8 | 44Mb |
   | 9 | 43Mb |
   | 10 | 44Mb |
   | 11 | 45Mb|
   | 12 | 46Mb |
   
   
   Code to reproduce: 
   
   ```
   // spark-shell
   spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
     --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   ```
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_trips_mor"
   val basePath = "file:///tmp/hudi_trips_mor"
   val basePathParquet = "file:///tmp/parquet"
   val dataGen = new DataGenerator
   
   val inserts = convertToStringList(dataGen.generateInserts(100000))
   val dfInsert = spark.read.json(spark.sparkContext.parallelize(inserts, 10))
   dfInsert.write.format("hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").option(STORAGE_TYPE_OPT_KEY, "MERGE_ON_READ").option(TABLE_NAME, tableName).mode(Append).save(basePath)
   
   val updates = convertToStringList(dataGen.generateUpdates(2000))
   val dfUpdates = spark.read.json(spark.sparkContext.parallelize(updates, 2))
   
   dfUpdates.coalesce(1).write.format("parquet").mode(Append).save(basePathParquet)
   
   dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism", "2").option("hoodie.upsert.shuffle.parallelism", "2").option("hoodie.cleaner.commits.retained", "3").option("hoodie.cleaner.fileversions.retained", "2").option("hoodie.compact.inline", "true").option("hoodie.compact.inline.max.delta.commits", "2").option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) .option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, tableName).mode(Append).save(basePath)
   
   dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism", "2").option("hoodie.upsert.shuffle.parallelism", "2").option("hoodie.cleaner.commits.retained", "3").option("hoodie.cleaner.fileversions.retained", "2").option("hoodie.compact.inline", "true").option("hoodie.compact.inline.max.delta.commits", "2").option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) .option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, tableName).mode(Append).save(basePath)
   ```
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org