You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@skywalking.apache.org by wu...@apache.org on 2021/06/29 02:03:48 UTC

[skywalking] 01/01: Optimize IDs reading in the persistent worker.

This is an automated email from the ASF dual-hosted git repository.

wusheng pushed a commit to branch id-read-optimization
in repository https://gitbox.apache.org/repos/asf/skywalking.git

commit 6fff7f3cff88032eef5648b3a77c578b47286678
Author: Wu Sheng <wu...@foxmail.com>
AuthorDate: Tue Jun 29 10:03:30 2021 +0800

    Optimize IDs reading in the persistent worker.
---
 CHANGES.md                                         | 38 ++++++++++++++--------
 .../analysis/worker/MetricsPersistentWorker.java   | 28 ++++++++++++++--
 2 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/CHANGES.md b/CHANGES.md
index 1fbb79b..c4a5cdd 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -4,39 +4,45 @@ Release Notes.
 
 8.7.0
 ------------------
+
 #### Project
+
 * Extract dependency management to a bom.
 * Add JDK 16 to test matrix.
 
 #### Java Agent
+
 * Supports modifying span attributes in async mode.
 * Agent supports the collection of JVM arguments and jar dependency information.
-* [Temporary] Support authentication for log report channel. This feature and grpc channel is going to be removed after Satellite 0.2.0 release.
-* Remove deprecated gRPC method, `io.grpc.ManagedChannelBuilder#nameResolverFactory`. See [gRPC-java 7133](https://github.com/grpc/grpc-java/issues/7133) for more details.
+* [Temporary] Support authentication for log report channel. This feature and grpc channel is going to be removed after
+  Satellite 0.2.0 release.
+* Remove deprecated gRPC method, `io.grpc.ManagedChannelBuilder#nameResolverFactory`.
+  See [gRPC-java 7133](https://github.com/grpc/grpc-java/issues/7133) for more details.
 * Add `Neo4j-4.x` plugin.
 * Correct `profile.duration` to `profile.max_duration` in the default `agent.config` file.
-* Fix the reponse time of gRPC.
+* Fix the response time of gRPC.
 
 #### OAP-Backend
+
 * Disable Spring sleuth meter analyzer by default.
 * Only count 5xx as error in Envoy ALS receiver.
 * Upgrade apollo core caused by CVE-2020-15170.
 * Upgrade kubernetes client caused by CVE-2020-28052.
 * Upgrade Elasticsearch 7 client caused by CVE-2020-7014.
-* Upgrade jackson related libs caused by CVE-2018-11307, CVE-2018-14718 ~ CVE-2018-14721, CVE-2018-19360 ~ CVE-2018-19362,
-   CVE-2019-14379, CVE-2019-14540, CVE-2019-14892, CVE-2019-14893, CVE-2019-16335, CVE-2019-16942, CVE-2019-16943,
-   CVE-2019-17267, CVE-2019-17531, CVE-2019-20330, CVE-2020-8840, CVE-2020-9546, CVE-2020-9547, CVE-2020-9548,
-   CVE-2018-12022, CVE-2018-12023, CVE-2019-12086, CVE-2019-14439, CVE-2020-10672, CVE-2020-10673, CVE-2020-10968,
-   CVE-2020-10969, CVE-2020-11111, CVE-2020-11112, CVE-2020-11113, CVE-2020-11619, CVE-2020-11620, CVE-2020-14060,
-   CVE-2020-14061, CVE-2020-14062, CVE-2020-14195, CVE-2020-24616, CVE-2020-24750, CVE-2020-25649, CVE-2020-35490,
-   CVE-2020-35491, CVE-2020-35728 and CVE-2020-36179 ~ CVE-2020-36190.
+* Upgrade jackson related libs caused by CVE-2018-11307, CVE-2018-14718 ~ CVE-2018-14721, CVE-2018-19360 ~
+  CVE-2018-19362, CVE-2019-14379, CVE-2019-14540, CVE-2019-14892, CVE-2019-14893, CVE-2019-16335, CVE-2019-16942,
+  CVE-2019-16943, CVE-2019-17267, CVE-2019-17531, CVE-2019-20330, CVE-2020-8840, CVE-2020-9546, CVE-2020-9547,
+  CVE-2020-9548, CVE-2018-12022, CVE-2018-12023, CVE-2019-12086, CVE-2019-14439, CVE-2020-10672, CVE-2020-10673,
+  CVE-2020-10968, CVE-2020-10969, CVE-2020-11111, CVE-2020-11112, CVE-2020-11113, CVE-2020-11619, CVE-2020-11620,
+  CVE-2020-14060, CVE-2020-14061, CVE-2020-14062, CVE-2020-14195, CVE-2020-24616, CVE-2020-24750, CVE-2020-25649,
+  CVE-2020-35490, CVE-2020-35491, CVE-2020-35728 and CVE-2020-36179 ~ CVE-2020-36190.
 * Exclude log4j 1.x caused by CVE-2019-17571.
 * Upgrade log4j 2.x caused by CVE-2020-9488.
 * Upgrade nacos libs caused by CVE-2021-29441 and CVE-2021-29442.
-* Upgrade netty caused by CVE-2019-20444, CVE-2019-20445, CVE-2019-16869, CVE-2020-11612, CVE-2021-21290, CVE-2021-21295 
-   and CVE-2021-21409.
+* Upgrade netty caused by CVE-2019-20444, CVE-2019-20445, CVE-2019-16869, CVE-2020-11612, CVE-2021-21290, CVE-2021-21295
+  and CVE-2021-21409.
 * Upgrade consul client caused by CVE-2018-1000844, CVE-2018-1000850.
-* Upgrade zookeeper caused by CVE-2019-0201. 
+* Upgrade zookeeper caused by CVE-2019-0201.
 * Upgrade snake yaml caused by CVE-2017-18640.
 * Upgrade embed tomcat caused by CVE-2020-13935.
 * Upgrade commons-lang3 to avoid potential NPE in some JDK versions.
@@ -45,8 +51,13 @@ Release Notes.
 * Fix CounterWindow increase computing issue.
 * Performance: optimize Envoy ALS analyzer performance in high traffic load scenario (reduce ~1cpu in ~10k RPS).
 * Performance: trim useless metadata fields in Envoy ALS metadata to improve performance.
+* Performance: enhance persistent session mechanism, by removing cache reloading for minute-level metrics. Reduce 30%
+  ElasticSearch ID-read traffic, tradeoff by tolerating metrics inaccurate when the cluster scales out and down.
+* Performance: enhance persistent session mechanism, about differentiating cache timeout for different dimensionality
+  metrics. The timeout of the cache for minute and hour level metrics has been prolonged to ~5 min. 
 
 #### UI
+
 * Fix the date component for log conditions.
 * Fix selector keys for duplicate options.
 * Add Python celery plugin.
@@ -55,7 +66,6 @@ Release Notes.
 
 #### Documentation
 
-
 All issues and pull requests are [here](https://github.com/apache/skywalking/milestone/90?closed=1)
 
 ------------------
diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java
index 5195ac6..2ccdf12 100644
--- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java
+++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java
@@ -51,6 +51,11 @@ import org.apache.skywalking.oap.server.telemetry.api.MetricsTag;
  */
 @Slf4j
 public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
+    /**
+     * The counter of MetricsPersistentWorker instance, to calculate session timeout offset.
+     */
+    private static long sessionTimeoutOffsetCounter = 0;
+
     private final Model model;
     private final Map<Metrics, Metrics> context;
     private final IMetricsDAO metricsDAO;
@@ -60,7 +65,9 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
     private final Optional<MetricsTransWorker> transWorker;
     private final boolean enableDatabaseSession;
     private final boolean supportUpdate;
+    private boolean isDownSampling;
     private CounterMetrics aggregationCounter;
+    private long sessionTimeout = 70_000; // Unit, ms. 70,000ms means more than one minute.
 
     MetricsPersistentWorker(ModuleDefineHolder moduleDefineHolder, Model model, IMetricsDAO metricsDAO,
                             AbstractWorker<Metrics> nextAlarmWorker, AbstractWorker<ExportEvent> nextExportWorker,
@@ -74,6 +81,7 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
         this.nextExportWorker = Optional.ofNullable(nextExportWorker);
         this.transWorker = Optional.ofNullable(transWorker);
         this.supportUpdate = supportUpdate;
+        this.isDownSampling = false;
 
         String name = "METRICS_L2_AGGREGATION";
         int size = BulkConsumePool.Creator.recommendMaxSize() / 8;
@@ -98,10 +106,11 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
             new MetricsTag.Keys("metricName", "level", "dimensionality"),
             new MetricsTag.Values(model.getName(), "2", model.getDownsampling().getName())
         );
+        sessionTimeoutOffsetCounter++;
     }
 
     /**
-     * Create the leaf MetricsPersistentWorker, no next step.
+     * Create the leaf and down-sampling MetricsPersistentWorker, no next step.
      */
     MetricsPersistentWorker(ModuleDefineHolder moduleDefineHolder, Model model, IMetricsDAO metricsDAO,
                             boolean enableDatabaseSession, boolean supportUpdate) {
@@ -109,6 +118,11 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
              null, null, null,
              enableDatabaseSession, supportUpdate
         );
+        this.isDownSampling = true;
+        // For a down-sampling metrics, we prolong the session timeout for 4 times, nearly 5 minutes.
+        // And add offset according to worker creation sequence, to avoid context clear overlap,
+        // eventually optimize load of IDs reading.
+        this.sessionTimeout = sessionTimeout * 4 + sessionTimeoutOffsetCounter * 200;
     }
 
     /**
@@ -217,6 +231,14 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
                 return;
             }
 
+            // If session is activated and this worker is about minute level metrics,
+            // we could skip `#multiGet` and trust cache,
+            // because in worst case, we override one time bucket metrics due to dirty-write in the cluster re-balancing case.
+            // In down-sampling cases(hour/day), the cache would be clear periodically to keep memory safe,
+            // then have to reload(multiGet) metrics from database.
+            if (enableDatabaseSession && !isDownSampling) {
+                return;
+            }
             final List<Metrics> dbMetrics = metricsDAO.multiGet(model, noInCacheMetrics);
             if (!enableDatabaseSession) {
                 // Clear the cache only after results from DB are returned successfully.
@@ -235,8 +257,8 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
             while (iterator.hasNext()) {
                 Metrics metrics = iterator.next();
                 metrics.extendSurvivalTime(tookTime);
-                // 70,000ms means more than one minute.
-                if (metrics.getSurvivalTime() > 70000) {
+
+                if (metrics.getSurvivalTime() > sessionTimeout) {
                     iterator.remove();
                 }
             }