You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@community.apache.org by "Rongtong Jin (Jira)" <ji...@apache.org> on 2023/03/18 14:30:00 UTC
[jira] [Updated] (COMDEV-522) RocketMQ DLedger Controller Performance Optimization

     [ https://issues.apache.org/jira/browse/COMDEV-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rongtong Jin updated COMDEV-522:
--------------------------------
    Description: 
*Apache RocketMQ*

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.
Page: [https://rocketmq.apache.org|https://rocketmq.apache.org/]
Repo: [https://github.com/apache/rocketmq]

*Background*

RocketMQ 5.0 introduced a new component, the controller, which controls the high availability master-slave switch in multi-replica scenarios. It uses the DLedger Raft library as a consensus replication state machine for metadata. As a completely independent component, it can run normally in some scenarios, but in large-scale clusters, it is necessary to maintain a large number of broker groups, which is a great challenge for operational capabilities and resource waste. When dealing with a large number of Broker groups, we need to optimize performance in large-scale scenarios, leveraging the high-performance writing of DLedger itself and performing some optimization for the current Controller architecture.

*Task*

1. Polish the usage of DLedger

Currently, on the Controller side, a task queue single thread is used for requesting reads and writes to DLedger, that is, only one read/write request can be processed at a time. However, DLedger itself implements many optimizations for multi-client reads and writes and can ensure linear consistency reading. However, now all read and write processing is performed using a single logical DLedger client, which will become a serious performance bottleneck in large-scale scenarios.

2. Optimization of DLedger features usage

DLedger itself can implement many optimizations, such as ReadIndex read and FollowerRead read. After implementation, we can fully leverage the performance of reads. Currently, all Broker nodes communicate with the Leader node of the Controller. In large-scale scenarios, this will cause the requests of each Controller group to be concentrated on the Leader node, and the other Follower nodes will not share the request processing of the Leader, which will cause single-point performance bottlenecks for the Leader.

3. Full asynchronous + parallel processing

Currently, DLedger itself is fully asynchronous, but on the Controller side, all requests for the DLedger side are synchronized, and many Controller-side operations are performed synchronously in a single thread, such as heartbeat checks and other timed tasks. In large-scale scenarios, the logic of these single-threaded synchronous operations will block a large number of requests from Broker-side, so asynchronous + parallel processing can be used for optimization.

4. Correctness testing and performance testing

After completing the above optimizations, it is necessary to conduct correctness testing on the new version and use distributed chaos testing frameworks such as OpenChaos to verify correct operation under fault scenarios such as network partition and random crashes.
After completing the correctness testing, a detailed performance testing report can be produced by comparing the new and old versions.

*Skills Required*
 - Strong interest in message middleware and distributed storage systems
 - Proficient in Java development
 - In-depth understanding of distributed consensus algorithms
 - In-depth understanding of the high-availability module of RockeetMQ and the DLedger library
 - Understanding of distributed chaos testing and performance testing.

  was:
*Apache RocketMQ*

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.
Page: [https://rocketmq.apache.org|https://rocketmq.apache.org/]([https://rocketmq.apache.org/])
Repo: [https://github.com/apache/rocketmq]([https://github.com/apache/rocketmq])

*Background*

RocketMQ 5.0 introduced a new component, the controller, which controls the high availability master-slave switch in multi-replica scenarios. It uses the DLedger Raft library as a consensus replication state machine for metadata. As a completely independent component, it can run normally in some scenarios, but in large-scale clusters, it is necessary to maintain a large number of broker groups, which is a great challenge for operational capabilities and resource waste. When dealing with a large number of Broker groups, we need to optimize performance in large-scale scenarios, leveraging the high-performance writing of DLedger itself and performing some optimization for the current Controller architecture.

*Task*

1. Polish the usage of DLedger

Currently, on the Controller side, a task queue single thread is used for requesting reads and writes to DLedger, that is, only one read/write request can be processed at a time. However, DLedger itself implements many optimizations for multi-client reads and writes and can ensure linear consistency reading. However, now all read and write processing is performed using a single logical DLedger client, which will become a serious performance bottleneck in large-scale scenarios.

2. Optimization of DLedger features usage

DLedger itself can implement many optimizations, such as ReadIndex read and FollowerRead read. After implementation, we can fully leverage the performance of reads. Currently, all Broker nodes communicate with the Leader node of the Controller. In large-scale scenarios, this will cause the requests of each Controller group to be concentrated on the Leader node, and the other Follower nodes will not share the request processing of the Leader, which will cause single-point performance bottlenecks for the Leader.

3. Full asynchronous + parallel processing

Currently, DLedger itself is fully asynchronous, but on the Controller side, all requests for the DLedger side are synchronized, and many Controller-side operations are performed synchronously in a single thread, such as heartbeat checks and other timed tasks. In large-scale scenarios, the logic of these single-threaded synchronous operations will block a large number of requests from Broker-side, so asynchronous + parallel processing can be used for optimization.

4. Correctness testing and performance testing

After completing the above optimizations, it is necessary to conduct correctness testing on the new version and use distributed chaos testing frameworks such as OpenChaos to verify correct operation under fault scenarios such as network partition and random crashes.
After completing the correctness testing, a detailed performance testing report can be produced by comparing the new and old versions.

*Skills Required*
 - Strong interest in message middleware and distributed storage systems
 - Proficient in Java development
 - In-depth understanding of distributed consensus algorithms
 - In-depth understanding of the high-availability module of RockeetMQ and the DLedger library
 - Understanding of distributed chaos testing and performance testing.


> RocketMQ DLedger Controller Performance Optimization
> ----------------------------------------------------
>
>                 Key: COMDEV-522
>                 URL: https://issues.apache.org/jira/browse/COMDEV-522
>             Project: Community Development
>          Issue Type: Task
>          Components: Comdev, GSoC/Mentoring ideas
>            Reporter: Rongtong Jin
>            Priority: Major
>              Labels: RocketMQ, full-time, gsoc2023, mentor
>   Original Estimate: 350h
>  Remaining Estimate: 350h
>
> *Apache RocketMQ*
> Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.
> Page: [https://rocketmq.apache.org|https://rocketmq.apache.org/]
> Repo: [https://github.com/apache/rocketmq]
> *Background*
> RocketMQ 5.0 introduced a new component, the controller, which controls the high availability master-slave switch in multi-replica scenarios. It uses the DLedger Raft library as a consensus replication state machine for metadata. As a completely independent component, it can run normally in some scenarios, but in large-scale clusters, it is necessary to maintain a large number of broker groups, which is a great challenge for operational capabilities and resource waste. When dealing with a large number of Broker groups, we need to optimize performance in large-scale scenarios, leveraging the high-performance writing of DLedger itself and performing some optimization for the current Controller architecture.
> *Task*
> 1. Polish the usage of DLedger
> Currently, on the Controller side, a task queue single thread is used for requesting reads and writes to DLedger, that is, only one read/write request can be processed at a time. However, DLedger itself implements many optimizations for multi-client reads and writes and can ensure linear consistency reading. However, now all read and write processing is performed using a single logical DLedger client, which will become a serious performance bottleneck in large-scale scenarios.
> 2. Optimization of DLedger features usage
> DLedger itself can implement many optimizations, such as ReadIndex read and FollowerRead read. After implementation, we can fully leverage the performance of reads. Currently, all Broker nodes communicate with the Leader node of the Controller. In large-scale scenarios, this will cause the requests of each Controller group to be concentrated on the Leader node, and the other Follower nodes will not share the request processing of the Leader, which will cause single-point performance bottlenecks for the Leader.
> 3. Full asynchronous + parallel processing
> Currently, DLedger itself is fully asynchronous, but on the Controller side, all requests for the DLedger side are synchronized, and many Controller-side operations are performed synchronously in a single thread, such as heartbeat checks and other timed tasks. In large-scale scenarios, the logic of these single-threaded synchronous operations will block a large number of requests from Broker-side, so asynchronous + parallel processing can be used for optimization.
> 4. Correctness testing and performance testing
> After completing the above optimizations, it is necessary to conduct correctness testing on the new version and use distributed chaos testing frameworks such as OpenChaos to verify correct operation under fault scenarios such as network partition and random crashes.
> After completing the correctness testing, a detailed performance testing report can be produced by comparing the new and old versions.
> *Skills Required*
>  - Strong interest in message middleware and distributed storage systems
>  - Proficient in Java development
>  - In-depth understanding of distributed consensus algorithms
>  - In-depth understanding of the high-availability module of RockeetMQ and the DLedger library
>  - Understanding of distributed chaos testing and performance testing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@community.apache.org
For additional commands, e-mail: dev-help@community.apache.org