You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/25 05:41:26 UTC

[GitHub] [hudi] aresa7796 opened a new issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

aresa7796 opened a new issue #3533:
URL: https://github.com/apache/hudi/issues/3533


   Hi, guys
   How to use MOR Table to Merge small file?
   
   
   
   **Environment Description**
   
   * Hudi version : 0.8.0
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.1.2
   
   * Storage (HDFS/S3/GCS..) : FS
   
   * Running on Docker? (yes/no) : no
   
   
   **My code**
   ```
   object HudiExample {
     def main(args: Array[String]): Unit = {
       val config = ConfigFactory.load()
   
       val conf = new SparkConf()
         .setMaster("local[*]")
         .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
         .set("spark.kryoserializer.buffer.max","512m")
         .setAppName("hudi-batch")
   
       val spark = SparkSession.builder().enableHiveSupport().config(conf).getOrCreate()
       spark.sparkContext.setLogLevel("error")
       val data = Seq("""{"timestamp":1628752653,"column1":"123","column2":"234"}""","""{"timestamp":1628752654,"column1":"2","column2":"2","type": 1}""")
       import spark.implicits._
       val ds = spark.createDataset(data)
       var df = spark.read.json(ds)
   
       df = df.selectExpr("uuid() as _track_id","*")
   
       df.write
         .format("org.apache.hudi")
         .options(getQuickstartWriteConfigs)
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY,DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "timestamp")
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_track_id")
         .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true")
         .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 104857600)
         .option("hoodie.parquet.max.file.size", 134217728)
         .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name())
         .option("hoodie.table.name", "hudi_example")
         .mode(SaveMode.Append)
         .save("/opt/hudi_example")
   
     }
   }
   ```
   
   I executed this code many times,and multiple small files will be generated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-993928489


   @aresa7796 : Can you update us with more info to assist us in debugging this further


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-926278499


   @aresa7796 any more updates for us?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-908831225


   It is recommended that you try with 0.9. After executing it for many times, clustering will be triggered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] aresa7796 commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
aresa7796 commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-906216326


     Hi,  @liujinhui1994  I use clustering , got same result.
   
   ```
   df.write
         .format("org.apache.hudi")
         .options(getQuickstartWriteConfigs)
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY,DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "timestamp")
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_track_id")
         .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true")
         .option("hoodie.parquet.small.file.limit", "0")
         .option("hoodie.clustering.inline", "true")
         .option("hoodie.clustering.inline.max.commits", "4")
         .option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824")
         .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
         .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name())
         .option("hoodie.table.name", "hudi_example")
         .mode(SaveMode.Append)
         .save("/opt/hudi_example")
   ```
   I executed this code 6 times.
   ```
   ├── .hoodie
   │   ├── .20210826162656.deltacommit.crc
   │   ├── .20210826162656.deltacommit.inflight.crc
   │   ├── .20210826162656.deltacommit.requested.crc
   │   ├── .20210826162754.deltacommit.crc
   │   ├── .20210826162754.deltacommit.inflight.crc
   │   ├── .20210826162754.deltacommit.requested.crc
   │   ├── .20210826162830.deltacommit.crc
   │   ├── .20210826162830.deltacommit.inflight.crc
   │   ├── .20210826162830.deltacommit.requested.crc
   │   ├── .20210826162859.deltacommit.crc
   │   ├── .20210826162859.deltacommit.inflight.crc
   │   ├── .20210826162859.deltacommit.requested.crc
   │   ├── .20210826162904.replacecommit.crc
   │   ├── .20210826162904.replacecommit.inflight.crc
   │   ├── .20210826162904.replacecommit.requested.crc
   │   ├── .20210826162935.deltacommit.crc
   │   ├── .20210826162935.deltacommit.inflight.crc
   │   ├── .20210826162935.deltacommit.requested.crc
   │   ├── .aux
   │   │   └── .bootstrap
   │   │       ├── .fileids
   │   │       └── .partitions
   │   ├── .hoodie.properties.crc
   │   ├── .temp
   │   │   └── 20210826162904
   │   │       └── default
   │   │           ├── .bb37db55-ebb9-4a07-977b-5fb0d4340193-0_0-44-48_20210826162904.parquet.marker.CREATE.crc
   │   │           └── bb37db55-ebb9-4a07-977b-5fb0d4340193-0_0-44-48_20210826162904.parquet.marker.CREATE
   │   ├── 20210826162656.deltacommit
   │   ├── 20210826162656.deltacommit.inflight
   │   ├── 20210826162656.deltacommit.requested
   │   ├── 20210826162754.deltacommit
   │   ├── 20210826162754.deltacommit.inflight
   │   ├── 20210826162754.deltacommit.requested
   │   ├── 20210826162830.deltacommit
   │   ├── 20210826162830.deltacommit.inflight
   │   ├── 20210826162830.deltacommit.requested
   │   ├── 20210826162859.deltacommit
   │   ├── 20210826162859.deltacommit.inflight
   │   ├── 20210826162859.deltacommit.requested
   │   ├── 20210826162904.replacecommit
   │   ├── 20210826162904.replacecommit.inflight
   │   ├── 20210826162904.replacecommit.requested
   │   ├── 20210826162935.deltacommit
   │   ├── 20210826162935.deltacommit.inflight
   │   ├── 20210826162935.deltacommit.requested
   │   ├── archived
   │   └── hoodie.properties
   └── default
       ├── ..hoodie_partition_metadata.crc
       ├── .3e5a8289-01bc-4769-9c6f-f2ae6c355420-0_0-30-36_20210826162859.parquet.crc
       ├── .5f4d5381-ea37-4eb1-b8c7-2717facd0a50-0_0-30-33_20210826162754.parquet.crc
       ├── .bb37db55-ebb9-4a07-977b-5fb0d4340193-0_0-44-48_20210826162904.parquet.crc
       ├── .c1e69535-55b2-4a3a-ad58-e9be0a999304-0_0-29-29_20210826162656.parquet.crc
       ├── .cd2bf248-92bc-426c-a568-245ee89a0a17-0_0-30-34_20210826162830.parquet.crc
       ├── .d8942dd4-9097-4bf4-b2ea-59eaab48af77-0_0-30-33_20210826162935.parquet.crc
       ├── .hoodie_partition_metadata
       ├── 3e5a8289-01bc-4769-9c6f-f2ae6c355420-0_0-30-36_20210826162859.parquet
       ├── 5f4d5381-ea37-4eb1-b8c7-2717facd0a50-0_0-30-33_20210826162754.parquet
       ├── bb37db55-ebb9-4a07-977b-5fb0d4340193-0_0-44-48_20210826162904.parquet
       ├── c1e69535-55b2-4a3a-ad58-e9be0a999304-0_0-29-29_20210826162656.parquet
       ├── cd2bf248-92bc-426c-a568-245ee89a0a17-0_0-30-34_20210826162830.parquet
       └── d8942dd4-9097-4bf4-b2ea-59eaab48af77-0_0-30-33_20210826162935.parquet
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-906951740


   Observe whether the file size has improved?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-928269532


   @aresa7796 : if we dig in more if you can provide us w/ more info like file sizes, etc. As of now, we can't debug much without that info. appreciate if you can respond w/ details. 
   Recently I got to help and explain how small file handling works w/ MOR tables. there is a slight difference between COW and MOR in that aspect. Its covered [here](https://github.com/apache/hudi/issues/3676). please check it out, might help your case too. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] aresa7796 edited a comment on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
aresa7796 edited a comment on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-907033255


   Hi, @liujinhui1994 
   First execution.
   ```
   -rw-r--r--  1  staff   426K Aug 27 16:34 b295ace2-0de5-43ca-94fc-b398b75d552c-0_0-29-27_20210827163357.parquet
   ```
   Second execution
   
   ```
   -rw-r--r--  1  staff   426K Aug 27 16:34 b295ace2-0de5-43ca-94fc-b398b75d552c-0_0-29-27_20210827163357.parquet
   -rw-r--r--  1  staff   426K Aug 27 16:36 d2768e3b-3744-442b-be1a-07f3281ee91a-0_0-30-31_20210827163557.parquet
   ```
   
   only append new parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-928269532


   @aresa7796 : if we dig in more if you can provide us w/ more info like file sizes, etc. As of now, we can't debug much without that info. appreciate if you can respond w/ details. 
   Recently I got to help and explain how small file handling works w/ MOR tables. there is a slight difference between COW and MOR in that aspect. Its covered [here](https://github.com/apache/hudi/issues/3676). please check it out, might help your case too. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-1008552108


   @aresa7796 : will go ahead and close due to inactivity. Feel free to reopen if need be. will be happy to help. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-908831225


   It is recommended that you try with 0.9. After executing it for many times, clustering will be triggered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] liujinhui1994 commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
liujinhui1994 commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-906028752


   > PARQUET_SMALL_FILE_LIMIT_BYTES
   
   If it is an insert operation, you can try clustering


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-1002182564


   @aresa7796 : Let us know if you have any more updates for us on this regard. If you got it resolved, feel free to close out the GitHub issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-910401425


   @aresa7796 I do see that clustering is indeed triggered, but the you are not seeing a bigger file? I think we have look into clustering plan. in your example above, you are comparing files written in the same commit `20210827163357` , not sure if they belong to different executions actually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-917432581


   yes, can you list files along w/ sizes. Based on the logs you have provided, likely we are interested in files w/ commit time
   20210826162904 and 20210826162935. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3533:
URL: https://github.com/apache/hudi/issues/3533


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-910402168


   Can we look at the files produced by clustering? Hudi deduces this and ignore the smaller files that were previously written for e.g, 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] aresa7796 commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?

Posted by GitBox <gi...@apache.org>.
aresa7796 commented on issue #3533:
URL: https://github.com/apache/hudi/issues/3533#issuecomment-907033255


   Hi, @liujinhui1994 
   First execution.
   ```
   -rw-r--r--  1 zhuzichun  staff   426K Aug 27 16:34 b295ace2-0de5-43ca-94fc-b398b75d552c-0_0-29-27_20210827163357.parquet
   ```
   Second execution
   
   ```
   -rw-r--r--  1 zhuzichun  staff   426K Aug 27 16:34 b295ace2-0de5-43ca-94fc-b398b75d552c-0_0-29-27_20210827163357.parquet
   -rw-r--r--  1 zhuzichun  staff   426K Aug 27 16:36 d2768e3b-3744-442b-be1a-07f3281ee91a-0_0-30-31_20210827163557.parquet
   ```
   
   only append new parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org