You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/20 07:20:18 UTC

[GitHub] [hudi] stackls opened a new issue, #5371: [SUPPORT] Hudi Compaction 0.9

stackls opened a new issue, #5371:
URL: https://github.com/apache/hudi/issues/5371

   I have configured hudi inline compaction so that compactions happens after each write. Help me understand if i want to run after n delta commits ,how is it going to benefit the runs ?  Compaction costs will be reduced if not inline .  
   
   Not clear on the advantages and disadvantages between the both. Suggest me the correct and efficient hudi compaction configs for frequent updates on smaller/larger files. 
   
   hudi configs :
           "hoodie.compact.inline": inline_compact,
           "hoodie.cleaner.commits.retained": 4,
           "hoodie.cleaner.fileversions.retained": 4
           "hoodie.bulkinsert.shuffle.parallelism": 200,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #5371: [SUPPORT] Hudi Compaction

Posted by GitBox <gi...@apache.org>.

codope commented on issue #5371:
URL: https://github.com/apache/hudi/issues/5371#issuecomment-1103889198

High-level recommendation is to go for [async compaction](https://hudi.apache.org/docs/compaction#async-compaction) instead of inline compaction because if your workload is update heavy, then compacting inline would add to the ingestion latency.

There are 3 ways in which async compaction can be triggered (details for each of them is in the link I shared):
1. Using spark structured streaming
2. Using deltastreamer continuous mode
3. Using offline compactor utility (separate spark job)

Now, to set the right configs, we need to learn more about the workload. Essentially, we want to pick the right compaction strategy depending on whether your udpates touch recent partitions or whether they are spread randomly across all partitions. Inline compaction is more useful in cases where you have small amount of late arriving data trickling into older partitions. Also checkout this [FAQ](https://hudi.apache.org/learn/faq/#how-do-i-run-compaction-for-a-mor-dataset).

Additionally, you could avoid creating lots of small files. See here for mode details on small file handling: https://hudi.apache.org/learn/faq/#how-do-i-to-avoid-creating-tons-of-small-files

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #5371: [SUPPORT] Hudi Compaction

Posted by GitBox <gi...@apache.org>.

danny0405 commented on issue #5371:
URL: https://github.com/apache/hudi/issues/5371#issuecomment-1110614113

   > I also have a question relating to async compaction. I found that the `org.apache.hudi.sink.compact.HoodieFlinkCompactor` job is a flink batch job, does this mean I have to run this compaction job periodically, at when and in what frequency?
   
   We have supported the service mode now, you can take a try ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #5371: [SUPPORT] Hudi Compaction

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #5371:
URL: https://github.com/apache/hudi/issues/5371#issuecomment-1113645116

   @stackls @zhqu1148980644 Do you guys have more questions?  Feel free to close the issue if all good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #5371: [SUPPORT] Hudi Compaction

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #5371:
URL: https://github.com/apache/hudi/issues/5371#issuecomment-1110369841

   > Is there any specific hudi configs to achieve this or MOR table does take care by default ?
   
   Spark Structured Streaming and Deltastreamer continuous mode have async compaction enabled on MOR table by default.  In other cases, you can schedule and execute inline compaction with `hoodie.compact.inline=true`. You may also run independent compaction job in a way suggested by this [doc](https://hudi.apache.org/docs/compaction/#async-compaction).
   
   > I found that the org.apache.hudi.sink.compact.HoodieFlinkCompactor job is a flink batch job, does this mean I have to run this compaction job periodically, at when and in what frequency?
   @danny0405 should have better idea on this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zhqu1148980644 commented on issue #5371: [SUPPORT] Hudi Compaction

Posted by GitBox <gi...@apache.org>.

zhqu1148980644 commented on issue #5371:
URL: https://github.com/apache/hudi/issues/5371#issuecomment-1109179429

   I also have a question relating to async compaction. I found that the `org.apache.hudi.sink.compact.HoodieFlinkCompactor` job is a flink batch job, does this mean I have to run this compaction job periodically, at when and in what frequency?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #5371: [SUPPORT] Hudi Compaction

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #5371: [SUPPORT] Hudi Compaction
URL: https://github.com/apache/hudi/issues/5371


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] stackls commented on issue #5371: [SUPPORT] Hudi Compaction

Posted by GitBox <gi...@apache.org>.

stackls commented on issue #5371:
URL: https://github.com/apache/hudi/issues/5371#issuecomment-1103993086

   Thanks Codope for the reply. 
   
   https://hudi.apache.org/docs/next/performance/
   
   As per above link,
   For workloads with heavy updates, the [merge-on-read table](https://hudi.apache.org/docs/concepts#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction.
   
   Is there any specific hudi configs to achieve this or MOR  table does take care by default ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org