You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/10/24 08:20:59 UTC

[GitHub] [pulsar] thetumbled opened a new issue, #18173: PIP-217: LoadShedding Strategy Improment

thetumbled opened a new issue, #18173:
URL: https://github.com/apache/pulsar/issues/18173

### Motivation

We usually need to do shedding many times util the load balancing is complete.

### incorrect shedding due to `historical scoring algorithm`
ThresholdShedder implement a historical scoring algorithm when calculating scores for every broker to handle with the performance fluctuations,that is
```historyScore = historyScore == null ? currentScore : historyScore * historyPercentage + (1 - historyPercentage) * currentScore;```
But this algorithm will causing incorrect shedding bundles, and resulting into doing shedding many times until achieving stable state.
For example, say that there is only one broker `broker1` in the cluster, with cpu usage rate 90%. To reduce the burden of this solo broker, we add a new broker `broker2` into the cluster with initial cpu usage rate 10%. So the score of `broker1` is 90, and the score of `broker2` is 10.
Assuming that the shedding algorithm will shedding bundles to make the load of these two brokers equal.
- So in the first round of load balancing, `broker1` shedding bundles corresponding to 40% cpu usage to `broker2`, and both of two brokers's cpu usage rate become 50%, which is good enough.
- But in the second round of shedding checking, the scores of `broker1` is `0.9*90+0.1*50=86` (the default historyPercentage is 0.9), and the scores of `broker2` is `10*0.9+50*0.1=14`. and `avg=(86+14)/2=50, 86>avg+10` (default value of loadBalancerBrokerThresholdShedderPercentage is 10), **so the algorithm will think that the load of these two brokers are inequal**. Then `broker1` shedding bundles corresponding to `(86-14)/2=36` cpu usage to `broker2`, the cpu usage rate of `broker1` become `50-36=14`, the cpu usage rate of `broker2` become `50+36=86`.
- In the third round of shedding checking, the scores of `broker1` is `86*0.9+14*0.1=78.8`, and the scores of `broker2` is `14*0.9+86*0.1=21.2`, and `avg=(78.8+21.2)/2=50, 78.8>avg+10`. **so the algorithm will think that broker1 need to shedding bundles to broker2 Again**. Then `broker1` shedding bundles corresponding to `(78.8-21.2)/2=28.8` cpu usage to `broker2`, that is no bundles loaded on `broker1` anymore!
- ......
- After many rounds fo load balancing, we finally achieve stable state.

In fact, we just need only one round of shedding. But the algorithm think that the load between these two brokers is not even incorrectly due to `historical scoring algorithm`.
the downside of `historical scoring algorithm` is so huge that we design a new proposal to handle with the performance fluctuations, and disable the `historical scoring algorithm`. Introduced in the next section `Multi Hit Algorithm`.

### Goal

disable the `historical scoring algorithm`, and introduce a new algorithm to handle with the performance fluctuations.

### API Changes

_No response_

### Implementation

## Multi Hit Algorithm
The performance fluctuations is usual. For example, the cpu usage rate could increase by 20% suddenly, then it will fall back soon.
So we do not shedding bundles once there is any broker is judged to be overloaded, instead we count the number of **consecutive hits** of each broker. When the hit count of any broker is greater than configuration `HitCountThreshold`, we do shedding and reset the hit count to 0.
The default frequency of doing shedding is once per minutes. Say that we set `HitCountThreshold` to be 5, then we can deal with performance fluctuations lasting for 5 minutes.
But this will prolong the waiting time of load balancing when adding new brokers, that is we have to wait for 5 minutes to trigger load balancing when adding new brokers.
The solution is that, we could set two kind of `HitCountThreshold` - `HitCountThresholdForHigh` and `HitCountThresholdForLow`, and set two kind of `loadBalancerBrokerThresholdShedderPercentage` - `PercentageForHigh` and `PercentageForLow`.
If the cpu usage of any broker exceeds the average cpu usage `PercentageForHigh` `HitCountThresholdForHigh` times, we will do shedding, similarly if the cpu usage of any broker exceeds the average cpu usage `PercentageForLow` `HitCountThresholdForLow` times, we also do shedding.
We can set `HitCountThresholdForHigh` to be 1, `PercentageForHigh` to be 40, `PercentageForLow` to be 10, `HitCountThresholdForLow` to be 5, so we can trigger the load balancing in the first round of shedding because the cpu usage of new broker is usually lower pretty much than the avg.

### Alternatives

_No response_

### Anything else?

_No response_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] github-actions[bot] commented on issue #18173: PIP-217: LoadShedding Strategy Improment

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on issue #18173:
URL: https://github.com/apache/pulsar/issues/18173#issuecomment-1326948687

   The issue had no activity for 30 days, mark with Stale label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org