You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/05/14 08:13:12 UTC

[GitHub] [druid] yuanlihan opened a new issue #11256: Reduce method invocation of reservoir sampling

yuanlihan opened a new issue #11256:
URL: https://github.com/apache/druid/issues/11256


   ### Motivation
   
   The periodic task of the Coordinator service could be slow in a large cluster. It takes about 5 minutes to finish in a cycle. The periodic task consists of several serial subtasks. According to the profiling result, the segment balance task has some performance issue. I found that the root cause is that the current implementation invokes the sampling method too many times. We can reduce the number of method invocations by increasing the sample size in each invocation.
   
   <img width="1080" alt="image" src="https://user-images.githubusercontent.com/44718283/118240481-a5ecff80-b4cd-11eb-91b2-e310e0fa91ac.png">
   
   
   ### Proposed changes
   
   Adding a new Reservoir Sample method to sample K elements each time instead of only one element each time.
   A default method `pickSegmentsToMove` will be added to interface BalancerStrategy to pick K segments to move in a single method invocation.
   
   ### Rationale
   
   The current implementation picks up only one segment each time iterating all segments. When there are a lot segments need to be rebalanced or need to be decommissioned, the load balance calculation will be really slow. By picking up K segments each time will significantly reduce the number of iteration and thus speed up the process.
   
   ### Operational impact
   
   There will be no impact in operation
   
   ### Test plan (optional)
   
   Ensure test coverage and test it in test cluster
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan edited a comment on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
yuanlihan edited a comment on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841148193


   @jihoonson 
   > But I want to think about it more because we have no test coverage on the long term effect of this change.
   
   As this change will be enabled by default, I understand your concern. I tried to confirm the correctness based on the following points:
   
   - the set of segments in each server holder is immutable in each cycle
   - the output of getting k segments in a single pass is roughly equal to the result of getting k segments in k passes(1 segment per pass)
   
   > It would be great if you can provide some long term test results.
   
   Actually I have shipped this change into our on-premise cluster(which contains millions of segments) in Jan 2021 and it works well. In the past several months, the default cost-based balancer manages segment loading among 5 different historical tiers well. During this period, we have expanded the cluster by adding new historical nodes and decommissioned historical nodes for server maintenance. It seems hard to provide visual long term function testing result about this change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan edited a comment on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
yuanlihan edited a comment on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841148193


   @jihoonson 
   > But I want to think about it more because we have no test coverage on the long term effect of this change.
   
   As this change will be enabled by default, I understand your concern. I tried to confirm the correctness based on the following points:
   
   - the set of segments in each server holder is immutable in each cycle
   - the sampling method ensures(without mathematical proof) that getting k segments in a single pass is equal to getting k segments in k passes(1 segment per pass)
   
   > It would be great if you can provide some long term test results.
   
   Actually I have shipped this change into our on-premise cluster(which contains millions of segments) in Jan 2021 and it works well. In the past several months, the default cost-based balancer manages segment loading among 5 different historical tiers well. During this period, we expanded the cluster by adding new historical nodes and decommissioned historical nodes for server maintenance. It seems hard to provide visual long term function testing result about this change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan commented on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
yuanlihan commented on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841148193


   @jihoonson 
   > But I want to think about it more because we have no test coverage on the long term effect of this change.
   
   As this change will be enabled by default, I understand your concern. I tried to confirm the correctness based on the following facts:
   
   - the set of segments in each server holder is immutable in each cycle
   - the sampling method ensures(without mathematical proof) that getting k segments in a single pass is equal to getting k segments in k passes(1 segment per pass)
   
   > It would be great if you can provide some long term test results.
   
   Actually I have shipped this change into our on-premise cluster(which contains millions of segments) in Jan 2021 and it works well. In the past several months, the default cost-based balancer manages segment loading among 5 different historical tiers well. During this period, we expanded the cluster by adding new historical nodes and decommissioned historical nodes for server maintenance. It seems hard to provide visual long term function testing result about this change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841118335


   @yuanlihan thanks. I think I was confused about what is expensive. I think what is expensive is not the sample method invocation, but is the algorithm that calls the sample method over again. I looked over your PR and seems reasonable to me. But I want to think about it more because we have no test coverage on the long term effect of this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] asdf2014 closed issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
asdf2014 closed issue #11256:
URL: https://github.com/apache/druid/issues/11256


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841613674


   @yuanlihan thank you for adding more details. I think the basic idea makes sense since every segment has the same probability of being picked up. I will review your PR soon. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan edited a comment on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
yuanlihan edited a comment on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841148193


   @jihoonson 
   > But I want to think about it more because we have no test coverage on the long term effect of this change.
   
   As this change will be enabled by default, I understand your concern. I tried to confirm the correctness based on the following points:
   
   - the set of segments in each server holder is immutable in each cycle
   - the output of getting k segments in a single pass is roughly equal to the result of getting k segments in k passes(1 segment per pass)
   
   > It would be great if you can provide some long term test results.
   
   Actually I have shipped this change into our on-premise cluster(which contains millions of segments) in Jan 2021 and it works well. In the past several months, the default cost-based balancer manages segment loading among 5 different historical tiers well. During this period, we expanded the cluster by adding new historical nodes and decommissioned historical nodes for server maintenance. It seems hard to provide visual long term function testing result about this change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841118335


   @yuanlihan thanks. I think I was confused about what is expensive. I think what is expensive is not the sample method invocation, but is the algorithm that calls the sample method over again. I looked over your PR and seems reasonable to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan commented on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
yuanlihan commented on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841112760


   > > Adding a new Reservoir Sampling method to sample K elements each time instead of only one element each time.
   > 
   > I'm not sure how this can improve the performance. Does the new sampling method need to loop to sample K segments anyway? I guess I'm probably missing something. Would you please add more details on the proposed changes?
   
   Thanks for having a look at this. The default implementation samples only one segment in an iteration of all segments. Let's assume that:
   
   - the list of server holders contain 1 million segments
   - 1000 segments need to be picked up from these server holders
   
   Then the current implementation needs to call the sampling method 1000 times and each time needs to iterate 1 million segments. 
   I found that the Reservoir Sampling actually can sample K elements a single pass over the items. So in this case, the new method can sample 1000 segments in a method invocation, which means it needs to iterate 1 million segments only once.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841097547


   > Adding a new Reservoir Sampling method to sample K elements each time instead of only one element each time.
   
   I'm not sure how this can improve the performance. Does the new sampling method need to loop to sample K segments anyway? I guess I'm probably missing something. Would you please add more details on the proposed changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan edited a comment on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
yuanlihan edited a comment on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841112760


   > > Adding a new Reservoir Sampling method to sample K elements each time instead of only one element each time.
   > 
   > I'm not sure how this can improve the performance. Does the new sampling method need to loop to sample K segments anyway? I guess I'm probably missing something. Would you please add more details on the proposed changes?
   
   Thanks for having a look at this. The default implementation samples only one segment in an iteration of all segments. Let's assume that:
   
   - the list of server holders contains 1 million segments
   - 1000 segments need to be picked up from these server holders
   
   Then the current implementation needs to call the sampling method 1000 times and each time needs to iterate 1 million segments. 
   I found that the Reservoir Sampling actually can sample K elements a single pass over the items. So in this case, the new method can sample 1000 segments in a method invocation, which means it'll iterate 1 million segments only once.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841118335


   @yuanlihan thanks. I think I was confused about what is expensive. I think what is expensive is not the sample method invocation, but is the algorithm that calls the sample method over again. I looked over your PR and understand what you want now. But I want to think about it more because we have no test coverage on the long term effect of this change. It would be great if you can provide some long term test results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on issue #11256: Reduce method invocation of reservoir sampling

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on issue #11256:
URL: https://github.com/apache/druid/issues/11256#issuecomment-841118335


   @yuanlihan thanks. I think I was confused about what is expensive. I think what is expensive is not the sample method invocation, but is the algorithm that calls the sample method over again. I looked over your PR and seems reasonable to me. But I want to think about it more because we have no test coverage on the long term effect of this change. It would be great if you can provide some long term test results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org