You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "StefanRRichter (via GitHub)" <gi...@apache.org> on 2024/01/03 15:45:06 UTC

Re: [PR] [FLINK-33775] Report JobInitialization traces [flink]

StefanRRichter commented on code in PR #23908:
URL: https://github.com/apache/flink/pull/23908#discussion_r1440576635


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -155,7 +167,12 @@ public CheckpointStatsSnapshot createSnapshot() {
                                 counts.createSnapshot(),
                                 summary.createSnapshot(),
                                 history.createSnapshot(),
-                                latestRestoredCheckpoint);
+                                jobInitializationMetricsBuilder
+                                        .map(
+                                                JobInitializationMetricsBuilder
+                                                        ::buildRestoredCheckpointStats)
+                                        .orElse(Optional.empty())
+                                        .orElse(null));

Review Comment:
   Why 2x orElse?



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -86,9 +88,10 @@ public class CheckpointStatsTracker {
 
     private final JobID jobID;
     private final MetricGroup metricGroup;
+    private int totalNumberOfSubTasks;
 
-    /** The latest restored checkpoint. */
-    @Nullable private RestoredCheckpointStats latestRestoredCheckpoint;
+    private Optional<JobInitializationMetricsBuilder> jobInitializationMetricsBuilder =

Review Comment:
   Why this? I think optional isn't even intended to be used for fields.



##########
flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java:
##########
@@ -730,57 +739,87 @@ void restoreInternal() throws Exception {
         getEnvironment().getMetricGroup().getIOMetricGroup().markTaskInitializationStarted();
         LOG.debug("Initializing {}.", getName());
 
-        operatorChain =
-                getEnvironment().getTaskStateManager().isTaskDeployedAsFinished()
-                        ? new FinishedOperatorChain<>(this, recordWriter)
-                        : new RegularOperatorChain<>(this, recordWriter);
-        mainOperator = operatorChain.getMainOperator();
+        SubTaskInitializationMetricsBuilder initializationMetrics =
+                new SubTaskInitializationMetricsBuilder(
+                        SystemClock.getInstance().absoluteTimeMillis());
+        try {
+            operatorChain =
+                    getEnvironment().getTaskStateManager().isTaskDeployedAsFinished()
+                            ? new FinishedOperatorChain<>(this, recordWriter)
+                            : new RegularOperatorChain<>(this, recordWriter);
+            mainOperator = operatorChain.getMainOperator();
 
-        getEnvironment()
-                .getTaskStateManager()
-                .getRestoreCheckpointId()
-                .ifPresent(restoreId -> latestReportCheckpointId = restoreId);
+            getEnvironment()

Review Comment:
   Revert formatting changes in this file?



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/JobInitializationMetricsBuilder.java:
##########
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.checkpoint;
+
+import org.apache.flink.runtime.checkpoint.JobInitializationMetrics.SumMaxDuration;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.flink.runtime.checkpoint.JobInitializationMetrics.UNSET;
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkState;
+
+class JobInitializationMetricsBuilder {
+    private static final Logger LOG =
+            LoggerFactory.getLogger(JobInitializationMetricsBuilder.class);
+
+    private final List<SubTaskInitializationMetrics> reportedMetrics = new ArrayList<>();
+    private final int totalNumberOfSubTasks;
+    private final long startTs;
+    private Optional<Long> stateSize = Optional.empty();

Review Comment:
   nit: Optional abused for field.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -343,6 +376,62 @@ public void reportIncompleteStats(
         }
     }
 
+    public void reportInitializationStartTs(long initializationStartTs) {
+        jobInitializationMetricsBuilder =
+                Optional.of(
+                        new JobInitializationMetricsBuilder(
+                                totalNumberOfSubTasks, initializationStartTs));
+    }
+
+    public void reportInitializationMetrics(SubTaskInitializationMetrics initializationMetrics) {
+        statsReadWriteLock.lock();
+        try {
+            if (!jobInitializationMetricsBuilder.isPresent()) {
+                LOG.warn(
+                        "Attempted to report SubTaskInitializationMetrics [{}] without jobInitializationMetricsBuilder present",
+                        initializationMetrics);
+                return;
+            }
+            jobInitializationMetricsBuilder
+                    .get()
+                    .reportInitializationMetrics(initializationMetrics);
+            if (jobInitializationMetricsBuilder.get().isComplete()) {
+                traceInitializationMetrics(jobInitializationMetricsBuilder.get().build());
+            }
+        } catch (Exception ex) {
+            LOG.warn("Fail to log SubTaskInitializationMetrics[{}]", ex, initializationMetrics);
+        } finally {
+            statsReadWriteLock.unlock();
+        }
+    }
+
+    private void traceInitializationMetrics(JobInitializationMetrics jobInitializationMetrics) {
+        SpanBuilder span =
+                Span.builder(CheckpointStatsTracker.class, "JobInitialization")
+                        .setStartTsMillis(jobInitializationMetrics.getStartTs())
+                        .setEndTsMillis(jobInitializationMetrics.getEndTs())
+                        .setAttribute(
+                                "initializationStatus",
+                                jobInitializationMetrics.getStatus().name());
+        for (JobInitializationMetrics.SumMaxDuration duration :
+                jobInitializationMetrics.getDurationMetrics().values()) {
+            setDurationSpanAttribute(span, duration);
+        }
+        if (jobInitializationMetrics.getCheckpointId() != JobInitializationMetrics.UNSET) {
+            span.setAttribute("checkpointId", jobInitializationMetrics.getCheckpointId());
+        }
+        if (jobInitializationMetrics.getStateSize() != JobInitializationMetrics.UNSET) {
+            span.setAttribute("fullSize", jobInitializationMetrics.getStateSize());
+        }
+        metricGroup.addSpan(span);
+    }
+
+    private void setDurationSpanAttribute(
+            SpanBuilder span, JobInitializationMetrics.SumMaxDuration duration) {
+        span.setAttribute("max" + duration.getName(), duration.getMax());
+        span.setAttribute("sum" + duration.getName(), duration.getSum());

Review Comment:
   Why max and sum? Arguably, max is the most important one, is sum even meaningful to us? And if so, why not also avg or reporting the unaggregated values?



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -343,6 +376,62 @@ public void reportIncompleteStats(
         }
     }
 
+    public void reportInitializationStartTs(long initializationStartTs) {
+        jobInitializationMetricsBuilder =
+                Optional.of(
+                        new JobInitializationMetricsBuilder(
+                                totalNumberOfSubTasks, initializationStartTs));
+    }
+
+    public void reportInitializationMetrics(SubTaskInitializationMetrics initializationMetrics) {
+        statsReadWriteLock.lock();
+        try {
+            if (!jobInitializationMetricsBuilder.isPresent()) {
+                LOG.warn(
+                        "Attempted to report SubTaskInitializationMetrics [{}] without jobInitializationMetricsBuilder present",
+                        initializationMetrics);
+                return;
+            }
+            jobInitializationMetricsBuilder
+                    .get()

Review Comment:
   nit: get() it once into a local variable for further use?



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -343,6 +376,62 @@ public void reportIncompleteStats(
         }
     }
 
+    public void reportInitializationStartTs(long initializationStartTs) {
+        jobInitializationMetricsBuilder =
+                Optional.of(
+                        new JobInitializationMetricsBuilder(
+                                totalNumberOfSubTasks, initializationStartTs));
+    }
+
+    public void reportInitializationMetrics(SubTaskInitializationMetrics initializationMetrics) {
+        statsReadWriteLock.lock();
+        try {
+            if (!jobInitializationMetricsBuilder.isPresent()) {
+                LOG.warn(
+                        "Attempted to report SubTaskInitializationMetrics [{}] without jobInitializationMetricsBuilder present",
+                        initializationMetrics);
+                return;
+            }
+            jobInitializationMetricsBuilder
+                    .get()
+                    .reportInitializationMetrics(initializationMetrics);
+            if (jobInitializationMetricsBuilder.get().isComplete()) {
+                traceInitializationMetrics(jobInitializationMetricsBuilder.get().build());
+            }
+        } catch (Exception ex) {
+            LOG.warn("Fail to log SubTaskInitializationMetrics[{}]", ex, initializationMetrics);

Review Comment:
   When that happens, would it make sense to clear the optional to avoid further reporting work? We will never get to completion for this checkpoint anymore.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/SubTaskInitializationMetricsBuilder.java:
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.checkpoint;
+
+import org.apache.flink.annotation.VisibleForTesting;
+
+import javax.annotation.concurrent.NotThreadSafe;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * A builder for {@link SubTaskInitializationMetrics}.
+ *
+ * <p>This class is not thread safe, but parts of it can actually be used from different threads.

Review Comment:
   Consider adding a usage example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org