You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@skywalking.apache.org by wu...@apache.org on 2021/06/29 02:03:48 UTC
[skywalking] 01/01: Optimize IDs reading in the persistent worker.
This is an automated email from the ASF dual-hosted git repository.
wusheng pushed a commit to branch id-read-optimization
in repository https://gitbox.apache.org/repos/asf/skywalking.git
commit 6fff7f3cff88032eef5648b3a77c578b47286678
Author: Wu Sheng <wu...@foxmail.com>
AuthorDate: Tue Jun 29 10:03:30 2021 +0800
Optimize IDs reading in the persistent worker.
---
CHANGES.md | 38 ++++++++++++++--------
.../analysis/worker/MetricsPersistentWorker.java | 28 ++++++++++++++--
2 files changed, 49 insertions(+), 17 deletions(-)
diff --git a/CHANGES.md b/CHANGES.md
index 1fbb79b..c4a5cdd 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -4,39 +4,45 @@ Release Notes.
8.7.0
------------------
+
#### Project
+
* Extract dependency management to a bom.
* Add JDK 16 to test matrix.
#### Java Agent
+
* Supports modifying span attributes in async mode.
* Agent supports the collection of JVM arguments and jar dependency information.
-* [Temporary] Support authentication for log report channel. This feature and grpc channel is going to be removed after Satellite 0.2.0 release.
-* Remove deprecated gRPC method, `io.grpc.ManagedChannelBuilder#nameResolverFactory`. See [gRPC-java 7133](https://github.com/grpc/grpc-java/issues/7133) for more details.
+* [Temporary] Support authentication for log report channel. This feature and grpc channel is going to be removed after
+ Satellite 0.2.0 release.
+* Remove deprecated gRPC method, `io.grpc.ManagedChannelBuilder#nameResolverFactory`.
+ See [gRPC-java 7133](https://github.com/grpc/grpc-java/issues/7133) for more details.
* Add `Neo4j-4.x` plugin.
* Correct `profile.duration` to `profile.max_duration` in the default `agent.config` file.
-* Fix the reponse time of gRPC.
+* Fix the response time of gRPC.
#### OAP-Backend
+
* Disable Spring sleuth meter analyzer by default.
* Only count 5xx as error in Envoy ALS receiver.
* Upgrade apollo core caused by CVE-2020-15170.
* Upgrade kubernetes client caused by CVE-2020-28052.
* Upgrade Elasticsearch 7 client caused by CVE-2020-7014.
-* Upgrade jackson related libs caused by CVE-2018-11307, CVE-2018-14718 ~ CVE-2018-14721, CVE-2018-19360 ~ CVE-2018-19362,
- CVE-2019-14379, CVE-2019-14540, CVE-2019-14892, CVE-2019-14893, CVE-2019-16335, CVE-2019-16942, CVE-2019-16943,
- CVE-2019-17267, CVE-2019-17531, CVE-2019-20330, CVE-2020-8840, CVE-2020-9546, CVE-2020-9547, CVE-2020-9548,
- CVE-2018-12022, CVE-2018-12023, CVE-2019-12086, CVE-2019-14439, CVE-2020-10672, CVE-2020-10673, CVE-2020-10968,
- CVE-2020-10969, CVE-2020-11111, CVE-2020-11112, CVE-2020-11113, CVE-2020-11619, CVE-2020-11620, CVE-2020-14060,
- CVE-2020-14061, CVE-2020-14062, CVE-2020-14195, CVE-2020-24616, CVE-2020-24750, CVE-2020-25649, CVE-2020-35490,
- CVE-2020-35491, CVE-2020-35728 and CVE-2020-36179 ~ CVE-2020-36190.
+* Upgrade jackson related libs caused by CVE-2018-11307, CVE-2018-14718 ~ CVE-2018-14721, CVE-2018-19360 ~
+ CVE-2018-19362, CVE-2019-14379, CVE-2019-14540, CVE-2019-14892, CVE-2019-14893, CVE-2019-16335, CVE-2019-16942,
+ CVE-2019-16943, CVE-2019-17267, CVE-2019-17531, CVE-2019-20330, CVE-2020-8840, CVE-2020-9546, CVE-2020-9547,
+ CVE-2020-9548, CVE-2018-12022, CVE-2018-12023, CVE-2019-12086, CVE-2019-14439, CVE-2020-10672, CVE-2020-10673,
+ CVE-2020-10968, CVE-2020-10969, CVE-2020-11111, CVE-2020-11112, CVE-2020-11113, CVE-2020-11619, CVE-2020-11620,
+ CVE-2020-14060, CVE-2020-14061, CVE-2020-14062, CVE-2020-14195, CVE-2020-24616, CVE-2020-24750, CVE-2020-25649,
+ CVE-2020-35490, CVE-2020-35491, CVE-2020-35728 and CVE-2020-36179 ~ CVE-2020-36190.
* Exclude log4j 1.x caused by CVE-2019-17571.
* Upgrade log4j 2.x caused by CVE-2020-9488.
* Upgrade nacos libs caused by CVE-2021-29441 and CVE-2021-29442.
-* Upgrade netty caused by CVE-2019-20444, CVE-2019-20445, CVE-2019-16869, CVE-2020-11612, CVE-2021-21290, CVE-2021-21295
- and CVE-2021-21409.
+* Upgrade netty caused by CVE-2019-20444, CVE-2019-20445, CVE-2019-16869, CVE-2020-11612, CVE-2021-21290, CVE-2021-21295
+ and CVE-2021-21409.
* Upgrade consul client caused by CVE-2018-1000844, CVE-2018-1000850.
-* Upgrade zookeeper caused by CVE-2019-0201.
+* Upgrade zookeeper caused by CVE-2019-0201.
* Upgrade snake yaml caused by CVE-2017-18640.
* Upgrade embed tomcat caused by CVE-2020-13935.
* Upgrade commons-lang3 to avoid potential NPE in some JDK versions.
@@ -45,8 +51,13 @@ Release Notes.
* Fix CounterWindow increase computing issue.
* Performance: optimize Envoy ALS analyzer performance in high traffic load scenario (reduce ~1cpu in ~10k RPS).
* Performance: trim useless metadata fields in Envoy ALS metadata to improve performance.
+* Performance: enhance persistent session mechanism, by removing cache reloading for minute-level metrics. Reduce 30%
+ ElasticSearch ID-read traffic, tradeoff by tolerating metrics inaccurate when the cluster scales out and down.
+* Performance: enhance persistent session mechanism, about differentiating cache timeout for different dimensionality
+ metrics. The timeout of the cache for minute and hour level metrics has been prolonged to ~5 min.
#### UI
+
* Fix the date component for log conditions.
* Fix selector keys for duplicate options.
* Add Python celery plugin.
@@ -55,7 +66,6 @@ Release Notes.
#### Documentation
-
All issues and pull requests are [here](https://github.com/apache/skywalking/milestone/90?closed=1)
------------------
diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java
index 5195ac6..2ccdf12 100644
--- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java
+++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java
@@ -51,6 +51,11 @@ import org.apache.skywalking.oap.server.telemetry.api.MetricsTag;
*/
@Slf4j
public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
+ /**
+ * The counter of MetricsPersistentWorker instance, to calculate session timeout offset.
+ */
+ private static long sessionTimeoutOffsetCounter = 0;
+
private final Model model;
private final Map<Metrics, Metrics> context;
private final IMetricsDAO metricsDAO;
@@ -60,7 +65,9 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
private final Optional<MetricsTransWorker> transWorker;
private final boolean enableDatabaseSession;
private final boolean supportUpdate;
+ private boolean isDownSampling;
private CounterMetrics aggregationCounter;
+ private long sessionTimeout = 70_000; // Unit, ms. 70,000ms means more than one minute.
MetricsPersistentWorker(ModuleDefineHolder moduleDefineHolder, Model model, IMetricsDAO metricsDAO,
AbstractWorker<Metrics> nextAlarmWorker, AbstractWorker<ExportEvent> nextExportWorker,
@@ -74,6 +81,7 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
this.nextExportWorker = Optional.ofNullable(nextExportWorker);
this.transWorker = Optional.ofNullable(transWorker);
this.supportUpdate = supportUpdate;
+ this.isDownSampling = false;
String name = "METRICS_L2_AGGREGATION";
int size = BulkConsumePool.Creator.recommendMaxSize() / 8;
@@ -98,10 +106,11 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
new MetricsTag.Keys("metricName", "level", "dimensionality"),
new MetricsTag.Values(model.getName(), "2", model.getDownsampling().getName())
);
+ sessionTimeoutOffsetCounter++;
}
/**
- * Create the leaf MetricsPersistentWorker, no next step.
+ * Create the leaf and down-sampling MetricsPersistentWorker, no next step.
*/
MetricsPersistentWorker(ModuleDefineHolder moduleDefineHolder, Model model, IMetricsDAO metricsDAO,
boolean enableDatabaseSession, boolean supportUpdate) {
@@ -109,6 +118,11 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
null, null, null,
enableDatabaseSession, supportUpdate
);
+ this.isDownSampling = true;
+ // For a down-sampling metrics, we prolong the session timeout for 4 times, nearly 5 minutes.
+ // And add offset according to worker creation sequence, to avoid context clear overlap,
+ // eventually optimize load of IDs reading.
+ this.sessionTimeout = sessionTimeout * 4 + sessionTimeoutOffsetCounter * 200;
}
/**
@@ -217,6 +231,14 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
return;
}
+ // If session is activated and this worker is about minute level metrics,
+ // we could skip `#multiGet` and trust cache,
+ // because in worst case, we override one time bucket metrics due to dirty-write in the cluster re-balancing case.
+ // In down-sampling cases(hour/day), the cache would be clear periodically to keep memory safe,
+ // then have to reload(multiGet) metrics from database.
+ if (enableDatabaseSession && !isDownSampling) {
+ return;
+ }
final List<Metrics> dbMetrics = metricsDAO.multiGet(model, noInCacheMetrics);
if (!enableDatabaseSession) {
// Clear the cache only after results from DB are returned successfully.
@@ -235,8 +257,8 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> {
while (iterator.hasNext()) {
Metrics metrics = iterator.next();
metrics.extendSurvivalTime(tookTime);
- // 70,000ms means more than one minute.
- if (metrics.getSurvivalTime() > 70000) {
+
+ if (metrics.getSurvivalTime() > sessionTimeout) {
iterator.remove();
}
}