You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pegasus.apache.org by GitBox <gi...@apache.org> on 2022/01/20 08:13:10 UTC

[GitHub] [incubator-pegasus] empiredan opened a new issue #889: Feature: implement long adder to optimize the counter of new metrics system

empiredan opened a new issue #889:
URL: https://github.com/apache/incubator-pegasus/issues/889


   # 1. Motivation
   
   In the old metrics system, to be precise, the counter is not defined independently as a metric type, though it could be implemented by `dsn::perf_counter_number_atomic` or `dsn::perf_counter_volatile_number_atomic`.
   
   Using thread ID as the hash code,  `dsn::perf_counter_number_atomic` has tried to distribute the atomic increment evenly over the `DIVIDE_CONTAINER`. Each element in `DIVIDE_CONTAINER` is a 8-byte `std::atomic<int64_t>`. The final value of the counter can be got by summing all the elements of `DIVIDE_CONTAINER`.
   
   However, threads run on different cpus are likely to update the data within the same cache line  (typically 64 bytes for x86). This will lead to [false sharing](https://en.wikipedia.org/wiki/False_sharing) problem which will degrade performance. In comparison with a single `std::atomic<int64_t>` as the counter, it consumes more memory while sometimes getting even worse performance.
   
   Therefore, in [new metrics system](https://github.com/apache/incubator-pegasus/issues/756), we need new implementations that eliminate false sharing, to improve the performance and consume less memory, if possible.
   
   # 2. Implementation
   
   ## 2.1 concurrent_long_adder
   
   To eliminate false sharing, first we can introduce a new struct that aligns a `std::atomic<int64_t>` with a cache line. Then, allocate an array of the new structs in an aligned region of memory. The size of this array is equal to the number of cpus.
   
   Since threads run on different cpus are more likely to be hashed to the different cache line of this array, the probability of false sharing can be significantly reduced. The implementation based on this idea is called `concurrent_long_adder`.
   
   `concurrent_long_adder` achieves high performance at the cost of more memory. The more cpu cores the machines has, the more memory it will consumed. The relationship between the cpu cores and memory usage is as below: 
   |The number of cpus|Memory usage|
   | :----: | :----: |
   |2|128 bytes|
   |4|256 bytes|
   |8|512 bytes|
   |16|1024 bytes|
   |32|2048 bytes|
   
   Since `dsn::perf_counter_number_atomic` will consume 856 bytes, if the machine has less than 16 cores, `concurrent_long_adder` will consume less memory than `dsn::perf_counter_number_atomic`.
   
   Nevertheless, due to the memory consumption, `concurrent_long_adder` is recommended to be used under the scenario that a counter is updated very frequently.
   
   ## 2.2 striped_long_adder
   
   Inspired by https://github.com/apache/kudu/commit/01233222e594c9e460e541a29875998e1e3b873c, `striped_long_adder` is implemented to trade off between the performance and memory usage.
   
   The idea of this long adder is that it has a *base* 8-byte `std::atomic<int64_t>`. If the long adder is not updated frequently, it will not consume extra memory; Otherwise, once collision is detected, it will allocate the same memory bytes that `concurrent_long_adder` consumes in case the performance is degraded.
   
   Since more instructions are used to detect the collision, `striped_long_adder` is slower than `concurrent_long_adder`. However, it can balance between the performance and memory usage. Therefore, `striped_long_adder` is used as the default long adder and is appropriate for the scenario that a counter is not updated frequently.
   
   # 3. Usage
   
   For the reason that virtual function will consume extra memory and slow down the execution, template is used to wrap the long adder to provide the unified API. For details please see `long_adder_wrapper` in `include/dsn/utility/long_adder.h`.
   
   # 4. Performance
   
   ## 4.1 Hardware configurations
   ```
   cpu: Intel Xeon Processor (Cascadelake)
   cores: 4
   memory: 32GB
   ```
   
   ## 4.2 Test results
   
   The performance test is quite simple: a number of threads is started, and each thread execute a number of atomic increment executions. Besides `striped_long_adder` and `concurrent_long_adder`, another 2 kinds of long adders are introduced to compare the execution performance.
   
   `simple_long_adder` is just a wrapper of a single `std::atomic<int64_t>`. `divided_long_adder` is nearly the same with `dsn::perf_counter_number_atomic` except that virtual functions are removed.
   
   ## 4.2.1 4 threads with each 1,000,000,000 operations
   
   |The type of long adder||Duration(seconds)|
   | :----: | :----: |
   |simple_long_adder|94.9675|
   |divided_long_adder|101.167|
   |striped_long_adder|36.1651|
   |concurrent_long_adder|12.3893|
   
   ## 4.2.2 8 threads with each 1,000,000,000 operations
   |The type of long adder||Duration(seconds)|
   | :----: | :----: |
   |simple_long_adder|183.897|
   |divided_long_adder|185.945|
   |striped_long_adder|70.5208|
   |concurrent_long_adder|47.1702|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] Smityz commented on issue #889: Feature: implement long adder to optimize the counter of new metrics system

Posted by GitBox <gi...@apache.org>.

Smityz commented on issue #889:
URL: https://github.com/apache/incubator-pegasus/issues/889#issuecomment-1019682548


   Through `perf_counter` may not cost lots of time in our flame graph(capture in the read Scenes).
   ![cp](https://user-images.githubusercontent.com/22953824/150717253-cf425278-6d1d-4d32-975f-f45b93c2cbc6.svg)
   
   I think it's still worth optimizing it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] acelyc111 closed issue #889: Feature: implement long adder to optimize the counter of new metrics system

Posted by GitBox <gi...@apache.org>.

acelyc111 closed issue #889:
URL: https://github.com/apache/incubator-pegasus/issues/889


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] empiredan commented on issue #889: Feature: implement long adder to optimize the counter of new metrics system

Posted by GitBox <gi...@apache.org>.

empiredan commented on issue #889:
URL: https://github.com/apache/incubator-pegasus/issues/889#issuecomment-1019698403


   > Through `perf_counter` may not cost lots of time in our flame graph(capture in the read Scenes). ![cp](https://user-images.githubusercontent.com/22953824/150717253-cf425278-6d1d-4d32-975f-f45b93c2cbc6.svg)
   > 
   > I think it's still worth optimizing it.
   
   Thanks for providing such a cool flame graph !
   
   Yeah I think currently most of our tasks that are sampled by the counter will be executed during several milliseconds or more: it may consume lots of memory, and involve network transmission or disk operations. If L1 cache is not large enough, it's likely that the counter has been evicted from L1 cache and must be loaded from memory again. This will not lead to false sharing.
   
   On the other hand, since we will adopt striped_long_adder based on striped64 as the default implementation for the counter, it will reduce the memory usage: a long duration will greatly reduce the possibility of collision, thus a single 8-byte `atomic<int64_t>` is usually enough; by contrast, `dsn::perf_counter_number_atomic` will consume 856 bytes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org