You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pegasus.apache.org by GitBox <gi...@apache.org> on 2023/01/05 17:36:48 UTC

[GitHub] [incubator-pegasus] empiredan opened a new pull request, #1304: feat(new_metrics): retire the old metrics and entities that are not in use

empiredan opened a new pull request, #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304

   This PR is to resolve https://github.com/apache/incubator-pegasus/issues/1303.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] empiredan commented on a diff in pull request #1304: feat(new_metrics): retire the old metrics and entities that are not in use

Posted by GitBox <gi...@apache.org>.
empiredan commented on code in PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304#discussion_r1066557588


##########
src/utils/metrics.h:
##########
@@ -153,6 +153,36 @@ class metric_entity : public ref_counter
 public:
     using attr_map = std::unordered_map<std::string, std::string>;
     using metric_map = std::unordered_map<const metric_prototype *, metric_ptr>;
+    using old_metric_list = std::unordered_set<const metric_prototype *>;
+
+    struct collected_old_metrics_info
+    {
+        old_metric_list old_metrics;

Review Comment:
   OK.



##########
src/utils/metrics.h:
##########
@@ -354,10 +392,96 @@ class metrics_http_service : public http_server_base
     DISALLOW_COPY_AND_ASSIGN(metrics_http_service);
 };
 
+// `metric_timer` is a timer class that encapsulates the details how each percentile is
+// computed periodically.
+//
+// To be instantiated, it requires `interval_ms` at which a percentile is computed and `exec`
+// which is used to compute percentile.
+//
+// In case that all percentiles are computed at the same time and lead to very high load,
+// first computation for percentile will be delayed at a random interval.
+class metric_timer
+{
+public:
+    enum class state : int
+    {
+        kRunning,
+        kClosing,
+        kClosed,
+    };
+
+    using on_exec_fn = std::function<void()>;
+    using on_close_fn = std::function<void()>;
+
+    metric_timer(uint64_t interval_ms, on_exec_fn on_exec, on_close_fn on_close);
+    ~metric_timer() = default;
+
+    void close();
+    void wait();
+
+    // Get the initial delay that is randomly generated by `generate_initial_delay_ms()`.
+    uint64_t get_initial_delay_ms() const { return _initial_delay_ms; }
+
+private:
+    // Generate an initial delay randomly in case that all percentiles are computed at the
+    // same time.
+    static uint64_t generate_initial_delay_ms(uint64_t interval_ms);
+
+    void on_close();
+
+    void on_timer(const boost::system::error_code &ec);
+
+    const uint64_t _initial_delay_ms;
+    const uint64_t _interval_ms;
+    const on_exec_fn _on_exec;
+    const on_close_fn _on_close;
+    std::atomic<state> _state;
+    utils::notify_event _completed;
+    std::unique_ptr<boost::asio::deadline_timer> _timer;
+
+    DISALLOW_COPY_AND_ASSIGN(metric_timer);
+};
+
 class metric_registry : public utils::singleton<metric_registry>
 {
 public:
     using entity_map = std::unordered_map<std::string, metric_entity_ptr>;
+    using old_entity_map = std::unordered_map<std::string, metric_entity::old_metric_list>;
+
+    struct collected_old_entities_info
+    {
+        old_entity_map old_entities;

Review Comment:
   OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] empiredan commented on a diff in pull request #1304: feat(new_metrics): retire stale metric entities that are not used by any other object

Posted by "empiredan (via GitHub)" <gi...@apache.org>.
empiredan commented on code in PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304#discussion_r1098143983


##########
src/utils/metrics.cpp:
##########
@@ -41,17 +48,15 @@ metric_entity::~metric_entity()
     close(close_option::kWait);
 }
 
-void metric_entity::close(close_option option)
+void metric_entity::close(close_option option) const
 {
     utils::auto_write_lock l(_lock);
 
-    // The reason why each metric is closed in the entity rather than in the destructor of each
-    // metric is that close() for the metric will return immediately without waiting for any close
-    // operation to be finished.
-    //
-    // Thus, to close all metrics owned by an entity, it's more efficient to firstly issue a close
-    // request for all metrics; then, just wait for all of the close operations to be finished.
-    // It's inefficient to wait for each metric to be closed one by one.
+    // To close all metrics owned by an entity, it's more efficient to firstly issue an asynchronous
+    // close request to each metric; then, just wait for all of the close operations to be finished.
+    // It's inefficient to wait for each metric to be closed one by one. Therefore, the metric is
+    // not
+    // closed in its destructor.
     for (auto &m : _metrics) {

Review Comment:
   Compilation can be passed while this function is qualified with `const` since actually it does not modify the direct member (add, remove or update), such as `_metrics`; however, it will change the state of the member of `_metrics`. Therefore I think it's better to declare it without `const`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1304: feat(new_metrics): retire the stale metrics and entities that are no longer used

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304#discussion_r1067769180


##########
src/utils/metrics.cpp:
##########
@@ -177,6 +203,109 @@ void metric_entity::take_snapshot(metric_json_writer &writer, const metric_filte
     writer.EndObject();
 }
 
+bool metric_entity::is_stale() const
+{
+    // Since this entity itself is still being accessed, its reference count should be 1
+    // at least.
+    CHECK_GE(get_count(), 1);
+
+    // Once this entity did not have any metric, and had only one reference kept in the
+    // registry, this entity would be considered useless.
+    return _metrics.empty() && get_count() == 1;
+}
+
+metric_entity::collected_stale_metrics_info metric_entity::collect_stale_metrics() const
+{
+    collected_stale_metrics_info collected_info;
+
+    auto now = dsn_now_ms();
+
+    utils::auto_read_lock l(_lock);
+
+    for (const auto &m : _metrics) {
+        if (!m.second->is_stale()) {

Review Comment:
   I'm a bit of confused why only partial metrics of the entity are stale, aren't all of  them become stale at the same time when the entity they belong to is destroyed?



##########
src/utils/metrics.cpp:
##########
@@ -177,6 +203,109 @@ void metric_entity::take_snapshot(metric_json_writer &writer, const metric_filte
     writer.EndObject();
 }
 
+bool metric_entity::is_stale() const
+{
+    // Since this entity itself is still being accessed, its reference count should be 1
+    // at least.
+    CHECK_GE(get_count(), 1);
+
+    // Once this entity did not have any metric, and had only one reference kept in the
+    // registry, this entity would be considered useless.

Review Comment:
   How about use `stale` instead of `useless`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1304: feat(new_metrics): retire the old metrics and entities that are not in use

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304#discussion_r1064266714


##########
src/utils/metrics.cpp:
##########
@@ -177,6 +203,109 @@ void metric_entity::take_snapshot(metric_json_writer &writer, const metric_filte
     writer.EndObject();
 }
 
+bool metric_entity::is_stale() const
+{
+    // Since this entity itself is still being accessed, its reference count should be 1
+    // at least.
+    CHECK_GE(get_count(), 1);
+
+    // Once this entity did not have any metric, and had only one reference kept in the
+    // registry, this entity would be considered useless.
+    return _metrics.empty() && get_count() == 1;
+}
+
+metric_entity::collected_old_metrics_info metric_entity::collect_old_metrics() const

Review Comment:
   what dose 'old' mean?



##########
src/utils/metrics.h:
##########
@@ -153,6 +153,36 @@ class metric_entity : public ref_counter
 public:
     using attr_map = std::unordered_map<std::string, std::string>;
     using metric_map = std::unordered_map<const metric_prototype *, metric_ptr>;
+    using old_metric_list = std::unordered_set<const metric_prototype *>;
+
+    struct collected_old_metrics_info
+    {
+        old_metric_list old_metrics;

Review Comment:
   Could you add some comments about `old_metrics`?



##########
src/utils/metrics.h:
##########
@@ -354,10 +392,96 @@ class metrics_http_service : public http_server_base
     DISALLOW_COPY_AND_ASSIGN(metrics_http_service);
 };
 
+// `metric_timer` is a timer class that encapsulates the details how each percentile is
+// computed periodically.
+//
+// To be instantiated, it requires `interval_ms` at which a percentile is computed and `exec`
+// which is used to compute percentile.
+//
+// In case that all percentiles are computed at the same time and lead to very high load,
+// first computation for percentile will be delayed at a random interval.
+class metric_timer
+{
+public:
+    enum class state : int
+    {
+        kRunning,
+        kClosing,
+        kClosed,
+    };
+
+    using on_exec_fn = std::function<void()>;
+    using on_close_fn = std::function<void()>;
+
+    metric_timer(uint64_t interval_ms, on_exec_fn on_exec, on_close_fn on_close);
+    ~metric_timer() = default;
+
+    void close();
+    void wait();
+
+    // Get the initial delay that is randomly generated by `generate_initial_delay_ms()`.
+    uint64_t get_initial_delay_ms() const { return _initial_delay_ms; }
+
+private:
+    // Generate an initial delay randomly in case that all percentiles are computed at the
+    // same time.
+    static uint64_t generate_initial_delay_ms(uint64_t interval_ms);
+
+    void on_close();
+
+    void on_timer(const boost::system::error_code &ec);
+
+    const uint64_t _initial_delay_ms;
+    const uint64_t _interval_ms;
+    const on_exec_fn _on_exec;
+    const on_close_fn _on_close;
+    std::atomic<state> _state;
+    utils::notify_event _completed;
+    std::unique_ptr<boost::asio::deadline_timer> _timer;
+
+    DISALLOW_COPY_AND_ASSIGN(metric_timer);
+};
+
 class metric_registry : public utils::singleton<metric_registry>
 {
 public:
     using entity_map = std::unordered_map<std::string, metric_entity_ptr>;
+    using old_entity_map = std::unordered_map<std::string, metric_entity::old_metric_list>;
+
+    struct collected_old_entities_info
+    {
+        old_entity_map old_entities;

Review Comment:
   Could you add some comments about it?



##########
src/utils/metrics.cpp:
##########
@@ -383,15 +547,134 @@ metric_entity_ptr metric_registry::find_or_create_entity(const metric_entity_pro
     return entity;
 }
 
+metric_registry::collected_old_entities_info metric_registry::collect_old_metrics() const
+{
+    collected_old_entities_info entities_info;
+
+    utils::auto_read_lock l(_lock);
+
+    for (const auto &entity : _entities) {
+        const auto &metrics_info = entity.second->collect_old_metrics();
+
+        entities_info.num_all_metrics += metrics_info.num_all_metrics;
+        entities_info.num_scheduled_metrics += metrics_info.num_scheduled_metrics;
+        if (!metrics_info.old_metrics.empty()) {
+            // Those entities which have metrics that should be retired will be collected.
+            entities_info.old_entities.emplace(entity.first, std::move(metrics_info.old_metrics));
+            continue;
+        }
+
+        if (!entity.second->is_stale()) {
+            continue;
+        }
+
+        // Since this entity itself should be retired, it will be collected without any
+        // metric. Actually it has already not had any metric itself.
+        (void)entities_info.old_entities[entity.first];
+    }
+
+    entities_info.num_all_entities = _entities.size();
+    return entities_info;
+}
+
+metric_registry::retired_entities_stat
+metric_registry::retire_old_metrics(const old_entity_map &old_entities)
+{
+    if (old_entities.empty()) {

Review Comment:
   If some other thread is writing `old_entities`, will it cause data race?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] empiredan commented on a diff in pull request #1304: feat(new_metrics): retire the old metrics and entities that are not in use

Posted by GitBox <gi...@apache.org>.
empiredan commented on code in PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304#discussion_r1066560021


##########
src/utils/metrics.cpp:
##########
@@ -383,15 +547,134 @@ metric_entity_ptr metric_registry::find_or_create_entity(const metric_entity_pro
     return entity;
 }
 
+metric_registry::collected_old_entities_info metric_registry::collect_old_metrics() const
+{
+    collected_old_entities_info entities_info;
+
+    utils::auto_read_lock l(_lock);
+
+    for (const auto &entity : _entities) {
+        const auto &metrics_info = entity.second->collect_old_metrics();
+
+        entities_info.num_all_metrics += metrics_info.num_all_metrics;
+        entities_info.num_scheduled_metrics += metrics_info.num_scheduled_metrics;
+        if (!metrics_info.old_metrics.empty()) {
+            // Those entities which have metrics that should be retired will be collected.
+            entities_info.old_entities.emplace(entity.first, std::move(metrics_info.old_metrics));
+            continue;
+        }
+
+        if (!entity.second->is_stale()) {
+            continue;
+        }
+
+        // Since this entity itself should be retired, it will be collected without any
+        // metric. Actually it has already not had any metric itself.
+        (void)entities_info.old_entities[entity.first];
+    }
+
+    entities_info.num_all_entities = _entities.size();
+    return entities_info;
+}
+
+metric_registry::retired_entities_stat
+metric_registry::retire_old_metrics(const old_entity_map &old_entities)
+{
+    if (old_entities.empty()) {

Review Comment:
   Actually the variable that `old_entities` references is a local constant in `process_old_metrics()`, which cannot be mutated and is also not shared by other threads. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] empiredan commented on a diff in pull request #1304: feat(new_metrics): retire the old metrics and entities that are not in use

Posted by GitBox <gi...@apache.org>.
empiredan commented on code in PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304#discussion_r1066581193


##########
src/utils/metrics.cpp:
##########
@@ -177,6 +203,109 @@ void metric_entity::take_snapshot(metric_json_writer &writer, const metric_filte
     writer.EndObject();
 }
 
+bool metric_entity::is_stale() const
+{
+    // Since this entity itself is still being accessed, its reference count should be 1
+    // at least.
+    CHECK_GE(get_count(), 1);
+
+    // Once this entity did not have any metric, and had only one reference kept in the
+    // registry, this entity would be considered useless.
+    return _metrics.empty() && get_count() == 1;
+}
+
+metric_entity::collected_old_metrics_info metric_entity::collect_old_metrics() const

Review Comment:
   `old` means a metric or entity is no longer used. Typically a metric or entity is referenced by a class. For example, the class `replica` may use a metric as the measurement of QPS for a replica.
   
   However, once the replica is removed, this metric will not be used. The framework of metrics then will find this metric and mark it `old`. After a configurable retention interval, this metric will finally be removed from the repository of metrics.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1304: feat(new_metrics): retire the old metrics and entities that are not in use

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304#discussion_r1064266714


##########
src/utils/metrics.cpp:
##########
@@ -177,6 +203,109 @@ void metric_entity::take_snapshot(metric_json_writer &writer, const metric_filte
     writer.EndObject();
 }
 
+bool metric_entity::is_stale() const
+{
+    // Since this entity itself is still being accessed, its reference count should be 1
+    // at least.
+    CHECK_GE(get_count(), 1);
+
+    // Once this entity did not have any metric, and had only one reference kept in the
+    // registry, this entity would be considered useless.
+    return _metrics.empty() && get_count() == 1;
+}
+
+metric_entity::collected_old_metrics_info metric_entity::collect_old_metrics() const

Review Comment:
   what does 'old' mean?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] empiredan merged pull request #1304: feat(new_metrics): retire stale metric entities that are not used by any other object

Posted by "empiredan (via GitHub)" <gi...@apache.org>.
empiredan merged PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1304: feat(new_metrics): retire stale metric entities that are not used by any other object

Posted by "acelyc111 (via GitHub)" <gi...@apache.org>.
acelyc111 commented on code in PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304#discussion_r1097111121


##########
src/utils/metrics.cpp:
##########
@@ -383,6 +434,111 @@ metric_entity_ptr metric_registry::find_or_create_entity(const metric_entity_pro
     return entity;
 }
 
+metric_registry::collected_stale_entities_info metric_registry::collect_stale_entities() const
+{
+    collected_stale_entities_info collected_info;
+
+    auto now = dsn_now_ms();
+
+    utils::auto_read_lock l(_lock);
+
+    for (const auto &entity : _entities) {
+        if (!entity.second->is_stale()) {
+            if (entity.second->_retire_time_ms > 0) {
+                // This entity had been scheduled to be retired. However, it was reemployed
+                // after that. It has been in use since then, therefore its scheduled time
+                // for retirement should be reset to 0.
+                collected_info.stale_entities.insert(entity.first);

Review Comment:
   How about renaming `stale_entities` because this kind of entity is not "stale" ?



##########
src/utils/metrics.cpp:
##########
@@ -41,17 +48,15 @@ metric_entity::~metric_entity()
     close(close_option::kWait);
 }
 
-void metric_entity::close(close_option option)
+void metric_entity::close(close_option option) const
 {
     utils::auto_write_lock l(_lock);
 
-    // The reason why each metric is closed in the entity rather than in the destructor of each
-    // metric is that close() for the metric will return immediately without waiting for any close
-    // operation to be finished.
-    //
-    // Thus, to close all metrics owned by an entity, it's more efficient to firstly issue a close
-    // request for all metrics; then, just wait for all of the close operations to be finished.
-    // It's inefficient to wait for each metric to be closed one by one.
+    // To close all metrics owned by an entity, it's more efficient to firstly issue an asynchronous
+    // close request to each metric; then, just wait for all of the close operations to be finished.
+    // It's inefficient to wait for each metric to be closed one by one. Therefore, the metric is
+    // not
+    // closed in its destructor.

Review Comment:
   ```suggestion
       // not closed in its destructor.
   ```



##########
src/utils/metrics.cpp:
##########
@@ -41,17 +48,15 @@ metric_entity::~metric_entity()
     close(close_option::kWait);
 }
 
-void metric_entity::close(close_option option)
+void metric_entity::close(close_option option) const
 {
     utils::auto_write_lock l(_lock);
 
-    // The reason why each metric is closed in the entity rather than in the destructor of each
-    // metric is that close() for the metric will return immediately without waiting for any close
-    // operation to be finished.
-    //
-    // Thus, to close all metrics owned by an entity, it's more efficient to firstly issue a close
-    // request for all metrics; then, just wait for all of the close operations to be finished.
-    // It's inefficient to wait for each metric to be closed one by one.
+    // To close all metrics owned by an entity, it's more efficient to firstly issue an asynchronous
+    // close request to each metric; then, just wait for all of the close operations to be finished.
+    // It's inefficient to wait for each metric to be closed one by one. Therefore, the metric is
+    // not
+    // closed in its destructor.
     for (auto &m : _metrics) {

Review Comment:
   Now that this is a `const` function, can we loop with const reference ? But we will call  `p->close()`, is it really constant?



##########
src/utils/metrics.cpp:
##########
@@ -383,6 +434,111 @@ metric_entity_ptr metric_registry::find_or_create_entity(const metric_entity_pro
     return entity;
 }
 
+metric_registry::collected_stale_entities_info metric_registry::collect_stale_entities() const
+{
+    collected_stale_entities_info collected_info;
+
+    auto now = dsn_now_ms();
+
+    utils::auto_read_lock l(_lock);
+
+    for (const auto &entity : _entities) {
+        if (!entity.second->is_stale()) {
+            if (entity.second->_retire_time_ms > 0) {
+                // This entity had been scheduled to be retired. However, it was reemployed
+                // after that. It has been in use since then, therefore its scheduled time
+                // for retirement should be reset to 0.
+                collected_info.stale_entities.insert(entity.first);
+            }
+            continue;
+        }
+
+        if (entity.second->_retire_time_ms > now) {
+            // This entity has been scheduled to be retired, however it is still within
+            // the retention interval. Thus do not collect it.
+            ++collected_info.num_scheduled_entities;
+            continue;
+        }
+
+        collected_info.stale_entities.insert(entity.first);
+    }
+
+    collected_info.num_all_entities = _entities.size();
+    return collected_info;
+}
+
+metric_registry::retired_entities_stat
+metric_registry::retire_stale_entities(const stale_entity_list &stale_entities)
+{
+    if (stale_entities.empty()) {
+        // Do not lock for empty list.
+        return retired_entities_stat();
+    }
+
+    retired_entities_stat retired_stat;
+
+    auto now = dsn_now_ms();
+
+    utils::auto_write_lock l(_lock);
+
+    for (const auto &stale_entity : stale_entities) {
+        auto iter = _entities.find(stale_entity);
+        if (dsn_unlikely(iter == _entities.end())) {
+            // The entity has been removed from the registry for some unusual reason.
+            continue;
+        }
+
+        if (!iter->second->is_stale()) {
+            if (iter->second->_retire_time_ms > 0) {
+                // For those entities which are reemployed, their scheduled time for retirement
+                // should be reset to 0 though previously they could have been scheduled to be
+                // retired.
+                iter->second->_retire_time_ms = 0;
+                ++retired_stat.num_reemployed_entities;
+            }
+            continue;
+        }
+
+        if (dsn_unlikely(iter->second->_retire_time_ms > now)) {
+            // Since in collect_stale_entities() we've filtered the metrics which have been
+            // outside the retention interval, this is unlikely to happen. However, we still
+            // check here.
+            continue;
+        }
+
+        if (iter->second->_retire_time_ms == 0) {
+            // The entity should be marked with a scheduled time for retirement, since it has
+            // already been considered stale.
+            iter->second->_retire_time_ms = now + FLAGS_entity_retirement_delay_ms;
+            ++retired_stat.num_recently_scheduled_entities;
+            continue;
+        }
+
+        // Once the entity is outside the retention interval, retire it from the registry.
+        _entities.erase(iter);
+        ++retired_stat.num_retired_entities;
+    }
+
+    return retired_stat;
+}
+
+void metric_registry::process_stale_entities()
+{
+    LOG_INFO("begin to process stale metric entities");
+
+    const auto &collected_info = collect_stale_entities();
+    const auto &retired_stat = retire_stale_entities(collected_info.stale_entities);
+
+    LOG_INFO("stat for metric entities: total={}, collected={}, retired={}, scheduled={}, "
+             "recently_scheduled={}, reemployed={}",
+             collected_info.num_all_entities,
+             collected_info.stale_entities.size(),

Review Comment:
   So, how about renaming `stale_entities` to `collected_entities` ?



##########
src/utils/test/metrics_test.cpp:
##########
@@ -2828,4 +2831,216 @@ TEST(metrics_test, http_get_metrics)
     }
 }
 
+using surviving_metrics_case = std::tuple<std::string, bool, bool, bool, bool>;
+
+class MetricsRetirementTest : public testing::TestWithParam<surviving_metrics_case>

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] empiredan commented on a diff in pull request #1304: feat(new_metrics): retire stale metric entities that are not used by any other object

Posted by "empiredan (via GitHub)" <gi...@apache.org>.
empiredan commented on code in PR #1304:
URL: https://github.com/apache/incubator-pegasus/pull/1304#discussion_r1098140697


##########
src/utils/metrics.cpp:
##########
@@ -383,6 +434,111 @@ metric_entity_ptr metric_registry::find_or_create_entity(const metric_entity_pro
     return entity;
 }
 
+metric_registry::collected_stale_entities_info metric_registry::collect_stale_entities() const
+{
+    collected_stale_entities_info collected_info;
+
+    auto now = dsn_now_ms();
+
+    utils::auto_read_lock l(_lock);
+
+    for (const auto &entity : _entities) {
+        if (!entity.second->is_stale()) {
+            if (entity.second->_retire_time_ms > 0) {
+                // This entity had been scheduled to be retired. However, it was reemployed
+                // after that. It has been in use since then, therefore its scheduled time
+                // for retirement should be reset to 0.
+                collected_info.stale_entities.insert(entity.first);
+            }
+            continue;
+        }
+
+        if (entity.second->_retire_time_ms > now) {
+            // This entity has been scheduled to be retired, however it is still within
+            // the retention interval. Thus do not collect it.
+            ++collected_info.num_scheduled_entities;
+            continue;
+        }
+
+        collected_info.stale_entities.insert(entity.first);
+    }
+
+    collected_info.num_all_entities = _entities.size();
+    return collected_info;
+}
+
+metric_registry::retired_entities_stat
+metric_registry::retire_stale_entities(const stale_entity_list &stale_entities)
+{
+    if (stale_entities.empty()) {
+        // Do not lock for empty list.
+        return retired_entities_stat();
+    }
+
+    retired_entities_stat retired_stat;
+
+    auto now = dsn_now_ms();
+
+    utils::auto_write_lock l(_lock);
+
+    for (const auto &stale_entity : stale_entities) {
+        auto iter = _entities.find(stale_entity);
+        if (dsn_unlikely(iter == _entities.end())) {
+            // The entity has been removed from the registry for some unusual reason.
+            continue;
+        }
+
+        if (!iter->second->is_stale()) {
+            if (iter->second->_retire_time_ms > 0) {
+                // For those entities which are reemployed, their scheduled time for retirement
+                // should be reset to 0 though previously they could have been scheduled to be
+                // retired.
+                iter->second->_retire_time_ms = 0;
+                ++retired_stat.num_reemployed_entities;
+            }
+            continue;
+        }
+
+        if (dsn_unlikely(iter->second->_retire_time_ms > now)) {
+            // Since in collect_stale_entities() we've filtered the metrics which have been
+            // outside the retention interval, this is unlikely to happen. However, we still
+            // check here.
+            continue;
+        }
+
+        if (iter->second->_retire_time_ms == 0) {
+            // The entity should be marked with a scheduled time for retirement, since it has
+            // already been considered stale.
+            iter->second->_retire_time_ms = now + FLAGS_entity_retirement_delay_ms;
+            ++retired_stat.num_recently_scheduled_entities;
+            continue;
+        }
+
+        // Once the entity is outside the retention interval, retire it from the registry.
+        _entities.erase(iter);
+        ++retired_stat.num_retired_entities;
+    }
+
+    return retired_stat;
+}
+
+void metric_registry::process_stale_entities()
+{
+    LOG_INFO("begin to process stale metric entities");
+
+    const auto &collected_info = collect_stale_entities();
+    const auto &retired_stat = retire_stale_entities(collected_info.stale_entities);
+
+    LOG_INFO("stat for metric entities: total={}, collected={}, retired={}, scheduled={}, "
+             "recently_scheduled={}, reemployed={}",
+             collected_info.num_all_entities,
+             collected_info.stale_entities.size(),

Review Comment:
   OK, in comparison to `stale_entities`, `collected_entities` is more accurate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org