You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/12 10:57:01 UTC

[GitHub] [hudi] zhangyue19921010 opened a new pull request, #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

zhangyue19921010 opened a new pull request, #5567:
URL: https://github.com/apache/hudi/pull/5567

   https://issues.apache.org/jira/browse/HUDI-3963
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1128525197

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r884392617


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,160 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+@leesf
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, 
+for example the schema is relatively simple, but the volume of data is pretty large or users observed insufficient data throughput and low cpu usage, etc.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file. 
+For example we will clear clear out the event after processing it to avoid to avoid unnecessary memory and GC pressure
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message

Review Comment:
   clear clear out
   to avoid to avoid
   
   syntax error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1128310618

   Hi @leesf Thanks a lot for your review. Really appreciate it !
   All comments are addressed. PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1134242594

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128525197",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128983139",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "6edd9582998823a2c3dfd181c55f2f380c248e87",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8843",
       "triggerID" : "6edd9582998823a2c3dfd181c55f2f380c248e87",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6edd9582998823a2c3dfd181c55f2f380c248e87 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8843) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1140622371

   Thanks a lot for your review @leesf 
   Really appreciate it if you can review the related pr :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leesf commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
leesf commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r878200093


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,154 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers

Review Comment:
   you can add me(leesf) as the approvers



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r878253812


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,154 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, 
+for example the schema is relatively simple, but the volume of data is pretty large or users observed insufficient data throughput and low cpu usage, etc.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file. 
+For example we will clear clear out the event after processing it to avoid to avoid unnecessary memory and GC pressure
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2. Also the default/recommended value is 1024.
+  - `hoodie.write.wait.strategy`: Used for disruptor wait strategy. The Wait Strategy determines how a consumer will wait for events to be placed into the Disruptor by a producer. 

Review Comment:
   I believe these configs are enough :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1134139744

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128525197",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128983139",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "6edd9582998823a2c3dfd181c55f2f380c248e87",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6edd9582998823a2c3dfd181c55f2f380c248e87",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 765c7c9d031deb7bdebead97032a61b4b9e5ad4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694) 
   * 6edd9582998823a2c3dfd181c55f2f380c248e87 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r874281338


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.

Review Comment:
   Yeap, add more details in this rfc. Thanks a lot for your review :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1128559897

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1128525197",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 765c7c9d031deb7bdebead97032a61b4b9e5ad4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1134135189

   Hi @leesf. Comments are addressed. PTAL. Thanks :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r878253130


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,154 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers

Review Comment:
   Sure. Really appreciate it. Thanks @leesf 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leesf commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
leesf commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r878200906


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,154 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, 
+for example the schema is relatively simple, but the volume of data is pretty large or users observed insufficient data throughput and low cpu usage, etc.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file. 
+For example we will clear clear out the event after processing it to avoid to avoid unnecessary memory and GC pressure
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2. Also the default/recommended value is 1024.
+  - `hoodie.write.wait.strategy`: Used for disruptor wait strategy. The Wait Strategy determines how a consumer will wait for events to be placed into the Disruptor by a producer. 

Review Comment:
   any other config needed to expose to end users as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r877689202


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   Hi @YuweiXiao Thanks a lot for your attention.
   It seems that flink use something like `FlinkLazyInsertIterable` which hold `BoundedInMemoryExecutor` to do ingestion works same as spark.
   So that maybe we can provide this new `disruptorExecutor` compared with `BoundedInMemoryExecutor`.
   
   <img width="1667" alt="截屏2022-05-20 上午10 52 03" src="https://user-images.githubusercontent.com/69956021/169439622-a2143e44-b609-4daf-a4a2-f64741db258c.png">
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1124852530

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a99503343f3f536b9c153ad4422e00eb0b3ff40 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1128983139

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leesf merged pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
leesf merged PR #5567:
URL: https://github.com/apache/hudi/pull/5567


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leesf commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
leesf commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r873741734


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.

Review Comment:
   `in some scenerios`, can we describe the scenerios in detail, such as the scenarios your company met?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1128361158

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 765c7c9d031deb7bdebead97032a61b4b9e5ad4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YuweiXiao commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
YuweiXiao commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r877708821


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   Oh ok, didn't realize `FlinkLazyInsertIterable`. But looking at its usage, I found the removal of producer/consumer may be suitable for Flink engine. The producer is simply a iterator of `List<HoodieRecord>`, and the consumption is single-thread too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1134141242

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128525197",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128983139",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "6edd9582998823a2c3dfd181c55f2f380c248e87",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8843",
       "triggerID" : "6edd9582998823a2c3dfd181c55f2f380c248e87",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 765c7c9d031deb7bdebead97032a61b4b9e5ad4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694) 
   * 6edd9582998823a2c3dfd181c55f2f380c248e87 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8843) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leesf commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
leesf commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r873747502


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.

Review Comment:
   is there a recommended value for the buffer size?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1124855721

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a99503343f3f536b9c153ad4422e00eb0b3ff40 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leesf commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
leesf commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r873749107


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   other engine such as flink is still on going?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1128306460

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a99503343f3f536b9c153ad4422e00eb0b3ff40 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607) 
   * 765c7c9d031deb7bdebead97032a61b4b9e5ad4a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r878254732


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   Sure will reflect these idea in this RFC ASAP.
   Thanks



##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   Sure will reflect it in this RFC ASAP.
   Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r877702288


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   > By the way, have you tried to remove the producer/consumer at all? The writing actually is a blocking single producer - single consumer cases. I guess it could perform better in some cases, as we save all the overhead of the message queue.
   
   Nice Catch! It may happen when using bulk_insert as single producer and single consumer.
   Yes, I can have a try as another option
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1124849881

   Related PR is https://github.com/apache/hudi/pull/5416 here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r874282784


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.

Review Comment:
   Yeap, there are four kinds of strategies here:
   1. BlockingWaitStrategy
   2. SleepingWaitStrategy
   3. YieldingWaitStrategy
   4. BusySpinWaitStrategy
   
   Also added the implementation details and suitable use cases.
   Actually these strategies are built in Disruptor and we want to expose this para to users through `hoodie.write.wait.strategy`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1128563984

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128525197",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 765c7c9d031deb7bdebead97032a61b4b9e5ad4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r874281672


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.

Review Comment:
   Actually, the default/recommended value is 1024. And added in RFC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1128997640

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128525197",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1128983139",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 765c7c9d031deb7bdebead97032a61b4b9e5ad4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YuweiXiao commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
YuweiXiao commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r877672462


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   AFAIK, flink does not use spark's producer/consumer model to write the data. So I am wondering how the plan rollout for Flink engine. 
   
   By the way, have you tried to remove the producer/consumer at all? The writing actually is a blocking single producer - single consumer cases. I guess it could perform better in some cases, as we save all the overhead of the message queue. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r877689202


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   Hi @YuweiXiao Thanks a lot for your attention.
   It seems that flink use something like `FlinkLazyInsertIterable` which hold `BoundedInMemoryExecutor` to do ingestion works same as spark.
   So that maybe we can provide this new disruptorExecutor as a new option.
   
   <img width="1667" alt="截屏2022-05-20 上午10 52 03" src="https://user-images.githubusercontent.com/69956021/169439622-a2143e44-b609-4daf-a4a2-f64741db258c.png">
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1125164945

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a99503343f3f536b9c153ad4422e00eb0b3ff40 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1128326804

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a99503343f3f536b9c153ad4422e00eb0b3ff40 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607) 
   * 765c7c9d031deb7bdebead97032a61b4b9e5ad4a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leesf commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
leesf commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r873748690


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.

Review Comment:
   would you please describe the strategy in more detail and how it works?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r874283970


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   Yes. Flink related writing is still on going as this feature is still in experimental.
   But it will not take much effort to migrate when spark ingestion is ready :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5567:
URL: https://github.com/apache/hudi/pull/5567#issuecomment-1129001942

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8607",
       "triggerID" : "5a99503343f3f536b9c153ad4422e00eb0b3ff40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128525197",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "765c7c9d031deb7bdebead97032a61b4b9e5ad4a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694",
       "triggerID" : "1128983139",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 765c7c9d031deb7bdebead97032a61b4b9e5ad4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8694) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leesf commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
leesf commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r878201800


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   > > By the way, have you tried to remove the producer/consumer at all? The writing actually is a blocking single producer - single consumer cases. I guess it could perform better in some cases, as we save all the overhead of the message queue.
   > 
   > Nice Catch! It may happen when using bulk_insert as single producer and single consumer. Yes, I can have a try as another option
   
   Can we also reflect it on the RFC ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #5567: [RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on code in PR #5567:
URL: https://github.com/apache/hudi/pull/5567#discussion_r877702032


##########
rfc/rfc-53/rfc-53.md:
##########
@@ -0,0 +1,120 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-53: Use Lock-Free Message Queue Improving Hoodie Writing Efficiency
+
+
+## Proposers
+@zhangyue19921010
+
+## Approvers
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3963
+
+
+## Abstract
+
+New option which use Lock-Free Message Queue called Disruptor as inner message queue to improve hoodie writing performance and optimize writing efficiency.
+
+Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction
+
+
+## Background
+
+Based on master branch, hoodie consumes upstream data (Kafka or S3 files) into the lake is a standard production-consumption model. 
+Currently, hoodie uses `LinkedBlockingQueue` as a inner message queue between Producer and Consumer.
+
+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.
+
+This RFC is to solve the performance bottleneck problem caused by locking in some large data volume scenarios
+
+This RFC provides a new option which use Lock-Free Message Queue called Disruptor as inner message queue to 
+The advantages are that:
+ - Fully use all the cpu resources without lock blocking.
+ - Improving writing performance and efficiency
+ - Solve the potential performance bottlenecks causing by locking.
+
+
+## Implementation
+
+![](DisruptorExecutor.png)
+
+This RFC mainly does two things: One is to do the code abstraction about hoodie consuming upstream data and writing into hudi format.
+The other thing is to implement disruptor based producer, inner message queue executor and message handler based on this new abstraction.
+
+Firstly, briefly introduce code abstraction(take `[based-master]` as current logic/option, and `[rfc-new]` for new option provided by this rfc)
+- [abstract] `HoodieMessageQueue`: Hold the inner message queue, control the initialization of the inner message queue, 
+control its life cycle, and provide a unified insert api, speed limit, memory control and other enrich functions. The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueue` which hold a `LinkedBlockingQueue` as inner message queue.
+    - [rfc-new] `DisruptorMessageQueue` which hold a lock free ringbuffer called disruptor as inner message queue.
+- [interface] `HoodieProducer`: Controls the producer behaviors and life cycle of hoodie reading upstream data and writing it into the inner message queue.
+The current implementations are as follows:
+    - [based-master][abstract] `BoundedInMemoryQueueProducer` Producer for `BoundedInMemoryQueue`
+        - [based-master] `IteratorBasedQueueProducer` Iterator based producer which pulls entry from iterator and produces items into the `LinkedBlockingQueue`
+        - [based-master] `FunctionBasedQueueProducer` Buffer producer which allows custom functions to insert entries to the `LinkedBlockingQueue`
+    - [rfc-new][abstract] `DisruptorBasedProducer`Producer for `DisruptorMessageQueue`
+        - [rfc-new] `IteratorBasedDisruptorProducer` Iterator based producer which pulls entry from iterator and produces items into the `DisruptorMessageQueue`
+        - [rfc-new] `FunctionBasedDisruptorQueueProducer` Buffer producer which allows custom functions to insert entries to the `DisruptorMessageQueue`
+ - [interface] `HoodieConsumer` Control hoodie to read the data from inner message queue and write them as hudi data files, and execute callback function. 
+  The current implementations are as follows:
+    - [based-master] `BoundedInMemoryQueueConsumer` Consume entries directly from `LinkedBlockingQueue` and execute callback function.
+    - [rfc-new] `DisruptorMessageHandler` which hold the same `BoundedInMemoryQueueConsumer` instant mentioned before. Use `DisruptorMessageHandler` extracts each record in disruptor then
+    using `BoundedInMemoryQueueConsumer` writing hudi data file.
+- [abstract] `HoodieExecutor`: Executor which orchestrates concurrent producers and consumers communicating through a inner message queue.
+The current implementations are as follows:
+    - [based-master] `BoundedInMemoryExecutor` takes as input the size limit, queue producer(s), consumer and transformer and exposes API to orchestrate concurrent execution of these actors communicating through a central LinkedBlockingQueue.
+    - [rfc-new] `DisruptorExecutor` Control the initialization, life cycle of the disruptor, and coordinate the work of the producer, consumer, and ringbuffer related to the disruptor, etc.
+    
+Secondly, This rfc implements disruptor related producers, message handlers and executor which use this lock-free message queue based on the above abstraction. Some compents are introduced in the first part. In this phase, we discuss how to use disruptor in hoodie writing stages.
+
+The Disruptor is a library that provides a concurrent ring buffer data structure. It is designed to provide a low-latency, high-throughput work queue in asynchronous event processing architectures.
+
+We use the Disruptor multi-producer single-consumer working model:
+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message
+- Define `HoodieDisruptorEventFactory`: Pre-populate all the hoodie events to fill the RingBuffer. 
+We can use `HoodieDisruptorEventFactory` to create `HoodieDisruptorEvent` storing the data for sharing during exchange or parallel coordination of an event.
+- Expose some necessary parameters for the users with a proper default to tune in different scenarios.
+
+Finally, let me introduce the new parameters:
+  - `hoodie.write.executor.type`: Choose the type of executor to use, which orchestrates concurrent producers and consumers communicating through a inner message queue. 
+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.

Review Comment:
   ```
   By the way, have you tried to remove the producer/consumer at all? The writing actually is a blocking single producer - single consumer cases. I guess it could perform better in some cases, as we save all the overhead of the message queue.
   ```
   Nice Catch! It may happen when using bulk_insert as single producer and single consumer.
   Yes, I can have a try as another option



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org