You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/12 16:36:58 UTC

[GitHub] [hudi] bhasudha opened a new pull request, #5304: [DOCS] Add faq for async compaction options

bhasudha opened a new pull request, #5304:
URL: https://github.com/apache/hudi/pull/5304

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   This pull request add a FAQ entry
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [x] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on a diff in pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

bhasudha commented on code in PR #5304:
URL: https://github.com/apache/hudi/pull/5304#discussion_r848645865


##########
website/docs/faq.md:
##########
@@ -253,6 +253,24 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+  - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.
+- Spark datasource:
+  - Async scheduling and async execution can be achieved by periodically running Hudi Compactor Utility or Hudi CLI. However this needs a lock provider to be configured.
+  - Alternately to avoid dependency on lock providers, scheduling alone can be done inline by regular writer using the config `hoodie.compact.schedule.inline` . And compaction execution can be done asynchronously by periodically triggering the Hudi Compactor Utility or Hudi CLI.
+- Spark structured streaming:
+  - Compactions are scheduled and executed asynchronously inside the streaming job. Async Compactions are enabled by default for structured streaming jobs on Merge-On-Read table.
+- Flink:
+  - TODO

Review Comment:
   @danny0405  Can you please help fill what options are possible for async compaction in Flink today ? Based on that I can add the description here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan merged pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

nsivabalan merged PR #5304:
URL: https://github.com/apache/hudi/pull/5304


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

danny0405 commented on code in PR #5304:
URL: https://github.com/apache/hudi/pull/5304#discussion_r849028469


##########
website/docs/compaction.md:
##########
@@ -74,7 +74,7 @@ To improve ingestion latency, Async Compaction is the default configuration.
 If immediate read performance of a new commit is important for you, or you want simplicity of not managing separate compaction jobs,
 you may want Synchronous compaction, which means that as a commit is written it is also compacted by the same job.
 
-Compaction is run synchronously by passing the flag "--disable-compaction" (Meaning to disable async compaction scheduling).
+Compaction is run synchronously by passing the flag "--disable-compaction" (Meaning to disable async compaction - disable both scheduling & execution).
 When both ingestion and compaction is running in the same spark context, you can use resource allocation configuration 

Review Comment:
   can we just name the param `--disable-async-compaction` to avoid confusion ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on a diff in pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

bhasudha commented on code in PR #5304:
URL: https://github.com/apache/hudi/pull/5304#discussion_r859184704


##########
website/learn/faq.md:
##########
@@ -253,6 +253,25 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+   - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.

Review Comment:
   You mean disabling async compaction can be done via `--disable-compaction` correct ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

danny0405 commented on code in PR #5304:
URL: https://github.com/apache/hudi/pull/5304#discussion_r849030000


##########
website/docs/faq.md:
##########
@@ -253,6 +253,24 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+  - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.
+- Spark datasource:
+  - Async scheduling and async execution can be achieved by periodically running Hudi Compactor Utility or Hudi CLI. However this needs a lock provider to be configured.
+  - Alternately to avoid dependency on lock providers, scheduling alone can be done inline by regular writer using the config `hoodie.compact.schedule.inline` . And compaction execution can be done asynchronously by periodically triggering the Hudi Compactor Utility or Hudi CLI.
+- Spark structured streaming:
+  - Compactions are scheduled and executed asynchronously inside the streaming job. Async Compactions are enabled by default for structured streaming jobs on Merge-On-Read table.
+- Flink:
+  - TODO

Review Comment:
   `compaction.schedule.enabled`: Schedule the compaction plan, enabled by default for MOR
   `compaction.async.enabled`: Async Compaction, enabled by default for MOR
   `compaction.tasks`: Parallelism of tasks that do actual compaction, default is 4
   `compaction.trigger.strategy`: Strategy to trigger compaction, options are 'num_commits': trigger compaction when reach N delta commits; 'time_elapsed': trigger compaction when time elapsed > N seconds since last compaction; 'num_and_time': trigger compaction when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is 'num_commits'
   `compaction.delta_commits`: Max delta commits needed to trigger compaction, default 5 commits
   `compaction.delta_seconds`: Max delta seconds time needed to trigger compaction, default 1 hour
   `compaction.timeout.seconds`: Max timeout time in seconds for online compaction to rollback, default 20 minutes
   `compaction.max_memory`: Max memory in MB for compaction spillable map, default 100MB
   `compaction.target_io`: Target IO per compaction (both read and write), default 500 GB



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on code in PR #5304:
URL: https://github.com/apache/hudi/pull/5304#discussion_r858192316


##########
website/learn/faq.md:
##########
@@ -253,6 +253,25 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+   - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.
+- Spark datasource:
+   - Async scheduling and async execution can be achieved by periodically running an offline Hudi Compactor Utility or Hudi CLI. However this needs a lock provider to be configured.
+   - Alternately to avoid dependency on lock providers, scheduling alone can be done inline by regular writer using the config `hoodie.compact.schedule.inline` . And compaction execution can be done offline by periodically triggering the Hudi Compactor Utility or Hudi CLI.

Review Comment:
   can we add a line that this is from 0.11



##########
website/learn/faq.md:
##########
@@ -253,6 +253,25 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+   - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.
+- Spark datasource:
+   - Async scheduling and async execution can be achieved by periodically running an offline Hudi Compactor Utility or Hudi CLI. However this needs a lock provider to be configured.

Review Comment:
   can we add that, this was the case until 0.10



##########
website/learn/faq.md:
##########
@@ -253,6 +253,25 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.

Review Comment:
   "A seperate process" -> don't really need to a separate process. can we re-word this a bit



##########
website/learn/faq.md:
##########
@@ -253,6 +253,25 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+   - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.
+- Spark datasource:
+   - Async scheduling and async execution can be achieved by periodically running an offline Hudi Compactor Utility or Hudi CLI. However this needs a lock provider to be configured.
+   - Alternately to avoid dependency on lock providers, scheduling alone can be done inline by regular writer using the config `hoodie.compact.schedule.inline` . And compaction execution can be done offline by periodically triggering the Hudi Compactor Utility or Hudi CLI.
+- Spark structured streaming:
+   - Compactions are scheduled and executed asynchronously inside the streaming job. Async Compactions are enabled by default for structured streaming jobs on Merge-On-Read table.

Review Comment:
   please add that its not possible to disable async compaction for MOR w/ spark structured streaming



##########
website/learn/faq.md:
##########
@@ -253,6 +253,25 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+   - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.

Review Comment:
   "," after Continuous mode. 
   i.e. 
   ```
   In Continuous mode, asynchronous
   ```



##########
website/learn/faq.md:
##########
@@ -253,6 +253,25 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+   - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.

Review Comment:
   we can also add a line, that is users wish to disable compaction, they can do so with --disable-compaction config with delta streamer. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on a diff in pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

bhasudha commented on code in PR #5304:
URL: https://github.com/apache/hudi/pull/5304#discussion_r849755286


##########
website/docs/compaction.md:
##########
@@ -74,7 +74,7 @@ To improve ingestion latency, Async Compaction is the default configuration.
 If immediate read performance of a new commit is important for you, or you want simplicity of not managing separate compaction jobs,
 you may want Synchronous compaction, which means that as a commit is written it is also compacted by the same job.
 
-Compaction is run synchronously by passing the flag "--disable-compaction" (Meaning to disable async compaction scheduling).
+Compaction is run synchronously by passing the flag "--disable-compaction" (Meaning to disable async compaction - disable both scheduling & execution).
 When both ingestion and compaction is running in the same spark context, you can use resource allocation configuration 

Review Comment:
   +1 on that. I ll file a separate jira for that and keep this PR decoupled for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on a diff in pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

bhasudha commented on code in PR #5304:
URL: https://github.com/apache/hudi/pull/5304#discussion_r853454521


##########
website/docs/faq.md:
##########
@@ -253,6 +253,24 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+  - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.
+- Spark datasource:
+  - Async scheduling and async execution can be achieved by periodically running Hudi Compactor Utility or Hudi CLI. However this needs a lock provider to be configured.
+  - Alternately to avoid dependency on lock providers, scheduling alone can be done inline by regular writer using the config `hoodie.compact.schedule.inline` . And compaction execution can be done asynchronously by periodically triggering the Hudi Compactor Utility or Hudi CLI.
+- Spark structured streaming:
+  - Compactions are scheduled and executed asynchronously inside the streaming job. Async Compactions are enabled by default for structured streaming jobs on Merge-On-Read table.
+- Flink:
+  - TODO

Review Comment:
   Thanks, I think these are described here - https://hudi.apache.org/docs/next/flink_configuration#compaction. Let me rephrase and point to this link as needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on a diff in pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

bhasudha commented on code in PR #5304:
URL: https://github.com/apache/hudi/pull/5304#discussion_r859184704


##########
website/learn/faq.md:
##########
@@ -253,6 +253,25 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+   - In Continuous mode asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.

Review Comment:
   You mean disabling async compaction can be done via `--disable-compaction` correct ? 



##########
website/learn/faq.md:
##########
@@ -253,6 +253,25 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices to be compacted. A compaction plan is finally written to Hudi timeline. Scheduling needs tighter coordination with other writers (regular ingestion is considered one of the writers). If scheduling is done inline with the ingestion job, this coordination is automatically taken care of. Else when scheduling happens asynchronously a lock provider needs to be configured for this coordination among multiple writers.
+- Execution: A separate process reads the compaction plan and performs compaction of file slices. Execution doesnt need the same level of coordination with other writers as Scheduling step and can be decoupled from ingestion job easily.

Review Comment:
   Will fix!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on a diff in pull request #5304: [DOCS] Add faq for async compaction options

Posted by GitBox <gi...@apache.org>.

bhasudha commented on code in PR #5304:
URL: https://github.com/apache/hudi/pull/5304#discussion_r848645116


##########
website/docs/faq.md:
##########
@@ -253,6 +253,24 @@ Simplest way to run compaction on MOR dataset is to run the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you may want to run it asynchronously as well. This can be done either via a separate [compaction job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java) that is scheduled by your workflow scheduler/notebook independently. If you are using delta streamer, then you can run in [continuous mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L241) where the ingestion and compaction are both managed concurrently in a single spark run time.
 
+### What options do I have for asynchronous compactions on MOR dataset?

Review Comment:
   @nsivabalan  please verify the description once.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org