You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/08/10 21:16:00 UTC

[GitHub] [flink] XComp opened a new pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

XComp opened a new pull request #13111:
URL: https://github.com/apache/flink/pull/13111


   ## What is the purpose of the change
   
   * Adding meaningful information to the error message of an OutOfMemoryError that happens while initializing or finalizing the an OutputFormat on the master.
   
   ## Brief change log
   
   * Added OOM-enrichment to InputOutputFormatVertex.initializeOnMaster and InputOutputFormatVertex.finalizeOnMaster
   * Adapted signature of JobVertex.finalizeOnMaster and JobVertex.initializeOnMaster to indicate that not only Exceptions might occur.
   * ClusterEntrypointExceptionUtils and its method was made public
   * JavaDoc was fixed.
   
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8401a49756640fa070b5d7a15fb29bb6837f5a99 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463",
       "triggerID" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2c3cd34c2804e5366146c6fc8de541f996d7972f Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459) 
   * 012e50a23dbecc06bc06c537cbabf154eb935236 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463",
       "triggerID" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5513",
       "triggerID" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ed988e8063763917fcb66436f509415eb2cbf174",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5536",
       "triggerID" : "ed988e8063763917fcb66436f509415eb2cbf174",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e575fa045556a0420c79b58f6c0c5fc37b7b937e Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5513) 
   * ed988e8063763917fcb66436f509415eb2cbf174 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5536) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463",
       "triggerID" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5513",
       "triggerID" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ed988e8063763917fcb66436f509415eb2cbf174",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5536",
       "triggerID" : "ed988e8063763917fcb66436f509415eb2cbf174",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ed988e8063763917fcb66436f509415eb2cbf174 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5536) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r468392375



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobgraph/InputOutputFormatVertex.java
##########
@@ -24,6 +24,7 @@
 import org.apache.flink.api.common.io.OutputFormat;
 import org.apache.flink.api.common.operators.util.UserCodeWrapper;
 import org.apache.flink.runtime.OperatorIDPair;
+import org.apache.flink.runtime.entrypoint.ClusterEntryPointExceptionUtils;

Review comment:
       It feels wrong that a `Vertex` knows anything from the `o.a.f.runtime.entrypoint` package.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntryPointExceptionUtils.java
##########
@@ -61,7 +61,7 @@ private ClusterEntryPointExceptionUtils() {
 	 * {@code null} if the argument was {@code null}
 	 */
 	@Nullable
-	static Throwable tryEnrichClusterEntryPointError(@Nullable Throwable exception) {
+	public static Throwable tryEnrichClusterEntryPointError(@Nullable Throwable exception) {

Review comment:
       I think that this method won't add a proper error message in case of an out of heap space error.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntryPointExceptionUtils.java
##########
@@ -25,9 +25,9 @@
 import static org.apache.flink.util.ExceptionUtils.tryEnrichOutOfMemoryError;
 
 /**
- * Exception utils to handle and enrich exceptions occurring in TaskManager.
+ * Exception utils to handle and enrich exceptions occurring in the JobManager.

Review comment:
       ```suggestion
    * Exception utils to handle and enrich exceptions occurring in the ClusterEntrypoint.
   ```

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobgraph/InputOutputFormatVertex.java
##########
@@ -99,15 +100,16 @@ public void initializeOnMaster(ClassLoader loader) throws Exception {
 					((InitializeOnMaster) outputFormat).initializeGlobal(getParallelism());
 				}
 			}
-
+		} catch (Throwable t) {
+			throw ClusterEntryPointExceptionUtils.tryEnrichClusterEntryPointError(t);

Review comment:
       I think catching the exception here and enriching it is too specific. That way we will leave all other places where this can occur out. I believe that a better approach is to do it where most of the code paths originate from. One idea could be the generation of the `JobManagerRunnerImpl` in the `Dispatcher` and its `resultFuture`. This is the point which will be passed by job submissions as well as job recoveries.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469872339



##########
File path: flink-core/src/main/java/org/apache/flink/util/ExceptionUtils.java
##########
@@ -110,51 +112,76 @@ public static boolean isJvmFatalOrOutOfMemoryError(Throwable t) {
 	}
 
 	/**
-	 * Tries to enrich the passed exception with additional information.
-	 *
-	 * <p>This method improves error message for direct and metaspace {@link OutOfMemoryError}.
-	 * It adds description of possible causes and ways of resolution.
-	 *
-	 * @param exception exception to enrich if not {@code null}
-	 * @return the enriched exception or the original if no additional information could be added;
-	 * {@code null} if the argument was {@code null}
+	 * Tries to enrich OutOfMemoryErrors being part of the passed root Throwable's cause tree.
+	 *
+	 * <p>This method improves error messages for direct and metaspace {@link OutOfMemoryError}.
+	 * It adds description about the possible causes and ways of resolution.
+	 *
+	 * @param root The Throwable of which the cause tree shall be traversed.
+	 * @param jvmMetaspaceOomNewErrorMessage The message being used for JVM metaspace-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
+	 * @param jvmDirectOomNewErrorMessage The message being used for direct memory-related OutOfMemoryErrors. Passing
+	 *                                    <code>null</code> will disable handling this class of error.
+	 * @param jvmHeapSpaceOomNewErrorMessage The message being used for Heap space-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
 	 */
-	@Nullable
-	public static Throwable tryEnrichOutOfMemoryError(
-			@Nullable Throwable exception,
-			String jvmMetaspaceOomNewErrorMessage,
-			String jvmDirectOomNewErrorMessage) {
-		boolean isOom = exception instanceof OutOfMemoryError;
-		if (!isOom) {
-			return exception;
-		}
-
-		OutOfMemoryError oom = (OutOfMemoryError) exception;
-		if (isMetaspaceOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmMetaspaceOomNewErrorMessage);
-		} else if (isDirectOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmDirectOomNewErrorMessage);
-		}
+	public static void tryEnrichOutOfMemoryError(
+			@Nullable Throwable root,
+			@Nullable String jvmMetaspaceOomNewErrorMessage,
+			@Nullable String jvmDirectOomNewErrorMessage,
+			@Nullable String jvmHeapSpaceOomNewErrorMessage) {
+		updateDetailMessage(root, t -> {
+			if (isMetaspaceOutOfMemoryError(t)) {
+				return jvmMetaspaceOomNewErrorMessage;
+			} else if (isDirectOutOfMemoryError(t)) {
+				return jvmDirectOomNewErrorMessage;
+			} else if (isHeapSpaceOutOfMemoryError(t)) {
+				return jvmHeapSpaceOomNewErrorMessage;
+			}
 
-		return oom;
+			return null;
+		});
 	}
 
 	/**
-	 * Rewrites the error message of a {@link OutOfMemoryError}.
+	 * Updates the error message of the first Throwable appearing in the cause tree of the passed root Throwable and
+	 * matching the passed Predicate by traversing the cause tree top-to-bottom.
 	 *
-	 * @param oom original {@link OutOfMemoryError}
-	 * @param newMessage new error message
-	 * @return the origianl {@link OutOfMemoryError} if it already has the new error message or
-	 * a new {@link OutOfMemoryError} with the new error message
+	 * @param root The Throwable whose cause tree shall be traversed.
+	 * @param throwableToMessage The Function based on which the new messages are generated. The function implementation
+	 *                           should return the new message. Returning <code>null</code>, in contrast, will result in
+	 *                           not updating the message for the corresponding Throwable.
 	 */
-	private static OutOfMemoryError changeOutOfMemoryErrorMessage(OutOfMemoryError oom, String newMessage) {
-		if (oom.getMessage().equals(newMessage)) {
-			return oom;
+	public static void updateDetailMessage(Throwable root, Function<Throwable, String> throwableToMessage) {
+		if (throwableToMessage == null) {
+			return;
+		}
+
+		Throwable it = root;
+		while (it != null) {
+			String newMessage = throwableToMessage.apply(it);
+			if (newMessage != null) {
+				updateDetailMessageOfThrowable(it, newMessage);
+			}
+
+			it = it.getCause();
+		}

Review comment:
       Good catch - I updated the parameter description but forgot about the method description. 👍 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r470588260



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/entrypoint/ClusterEntryPointExceptionUtilsTest.java
##########
@@ -0,0 +1,48 @@
+package org.apache.flink.runtime.entrypoint;

Review comment:
       This file misses a license header:
   
   ```
   /*
    * Licensed to the Apache Software Foundation (ASF) under one
    * or more contributor license agreements.  See the NOTICE file
    * distributed with this work for additional information
    * regarding copyright ownership.  The ASF licenses this file
    * to you under the Apache License, Version 2.0 (the
    * "License"); you may not use this file except in compliance
    * with the License.  You may obtain a copy of the License at
    *
    *     http://www.apache.org/licenses/LICENSE-2.0
    *
    * Unless required by applicable law or agreed to in writing, software
    * distributed under the License is distributed on an "AS IS" BASIS,
    * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    * See the License for the specific language governing permissions and
    * limitations under the License.
    */
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463",
       "triggerID" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5513",
       "triggerID" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e575fa045556a0420c79b58f6c0c5fc37b7b937e Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5513) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2c3cd34c2804e5366146c6fc8de541f996d7972f Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459) 
   * 012e50a23dbecc06bc06c537cbabf154eb935236 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2c3cd34c2804e5366146c6fc8de541f996d7972f Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463",
       "triggerID" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5513",
       "triggerID" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ed988e8063763917fcb66436f509415eb2cbf174",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ed988e8063763917fcb66436f509415eb2cbf174",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e575fa045556a0420c79b58f6c0c5fc37b7b937e Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5513) 
   * ed988e8063763917fcb66436f509415eb2cbf174 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671594512


   @tillrohrmann The PR is ready to be reviewed.
   I did a bit more than what was requested in FLINK-18220: Besides finalizing, I also added the enrichment method when initializing the OutputFormat. Is this ok or too much?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann closed pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
tillrohrmann closed pull request #13111:
URL: https://github.com/apache/flink/pull/13111


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469917288



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java
##########
@@ -137,7 +138,12 @@ public JobManagerRunnerImpl(
 		this.leaderGatewayFuture = new CompletableFuture<>();
 
 		// now start the JobManager
-		this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		try {
+			this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		} catch (Throwable t) {
+			ClusterEntryPointExceptionUtils.tryEnrichClusterEntryPointError(t);
+			throw t;
+		}

Review comment:
       `Dispatcher.runRecoveredJob` will call `handleRecoveredJobStartError` if there is an error. This method will call the `ClusterEntrypoint's` fatal error handler which should already do the enrichment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8401a49756640fa070b5d7a15fb29bb6837f5a99 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r470082171



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java
##########
@@ -137,7 +138,12 @@ public JobManagerRunnerImpl(
 		this.leaderGatewayFuture = new CompletableFuture<>();
 
 		// now start the JobManager
-		this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		try {
+			this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		} catch (Throwable t) {
+			ClusterEntryPointExceptionUtils.tryEnrichClusterEntryPointError(t);
+			throw t;
+		}

Review comment:
       👍  I moved the code into `Dispatcher.internalSubmitJob(Graph)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469874590



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java
##########
@@ -134,120 +134,190 @@
  */
 public class Task implements Runnable, TaskSlotPayload, TaskActions, PartitionProducerStateProvider, CheckpointListener, BackPressureSampleableTask {
 
-	/** The class logger. */
+	/**
+	 * The class logger.
+	 */

Review comment:
       It was accidentally added. But there is actually a change that is relevant:
   ```
   @@ -755,7 +836,7 @@ public class Task implements Runnable, TaskSlotPayload, TaskActions, PartitionPr
                           // an exception was thrown as a side effect of cancelling
                           // ----------------------------------------------------------------
   
   -                       t = TaskManagerExceptionUtils.tryEnrichTaskManagerError(t);
   +                       TaskManagerExceptionUtils.tryEnrichTaskManagerError(t);
   ```
   I reverted everything except for the change above.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469889173



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java
##########
@@ -137,7 +138,12 @@ public JobManagerRunnerImpl(
 		this.leaderGatewayFuture = new CompletableFuture<>();
 
 		// now start the JobManager
-		this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		try {
+			this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		} catch (Throwable t) {
+			ClusterEntryPointExceptionUtils.tryEnrichClusterEntryPointError(t);
+			throw t;
+		}

Review comment:
       I see. But wouldn't we miss doing the error handling for a Job rerun. The JobManager instantiation happens in `Dispatcher.persistAndRun(JobGraph)` (which calls `Dispatcher.internalSubmitJob(JobGraph)` but also in `Dispatcher.runRecoveredJob(JobGraph)`. The latter call graph does not touch `Dispatcher.internalSubmitJob(JobGraph)` which means that we wouldn't be able to handle the OOM for Job reruns if moving it into `Dispatcher.internalSubmitJob(JobGraph)` in that case.
   
   Wouldn't `Dispatcher.runJob(JobGraph)` or even `Dispatcher.createJobManagerRunner` the better location for the OOM handling?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469245589



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobgraph/InputOutputFormatVertex.java
##########
@@ -24,6 +24,7 @@
 import org.apache.flink.api.common.io.OutputFormat;
 import org.apache.flink.api.common.operators.util.UserCodeWrapper;
 import org.apache.flink.runtime.OperatorIDPair;
+import org.apache.flink.runtime.entrypoint.ClusterEntryPointExceptionUtils;

Review comment:
       This change was reverted. The OOM enrichment was moved into `ExecutorGraph.failGlobal(Throwable)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463",
       "triggerID" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 012e50a23dbecc06bc06c537cbabf154eb935236 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5459",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463",
       "triggerID" : "012e50a23dbecc06bc06c537cbabf154eb935236",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e575fa045556a0420c79b58f6c0c5fc37b7b937e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 012e50a23dbecc06bc06c537cbabf154eb935236 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5463) 
   * e575fa045556a0420c79b58f6c0c5fc37b7b937e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r470083019



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java
##########
@@ -1046,6 +1047,8 @@ void failGlobalIfExecutionIsStillRunning(Throwable cause, ExecutionAttemptID fai
 	 * @param t The exception that caused the failure.
 	 */
 	public void failGlobal(Throwable t) {
+		ClusterEntryPointExceptionUtils.tryEnrichClusterEntryPointError(t);

Review comment:
       The OOM handling was moved into `ExecutionGraph.vertexFinished()`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2c3cd34c2804e5366146c6fc8de541f996d7972f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8401a49756640fa070b5d7a15fb29bb6837f5a99 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5374) 
   * 2c3cd34c2804e5366146c6fc8de541f996d7972f UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469758699



##########
File path: flink-core/src/main/java/org/apache/flink/util/ExceptionUtils.java
##########
@@ -110,51 +112,76 @@ public static boolean isJvmFatalOrOutOfMemoryError(Throwable t) {
 	}
 
 	/**
-	 * Tries to enrich the passed exception with additional information.
-	 *
-	 * <p>This method improves error message for direct and metaspace {@link OutOfMemoryError}.
-	 * It adds description of possible causes and ways of resolution.
-	 *
-	 * @param exception exception to enrich if not {@code null}
-	 * @return the enriched exception or the original if no additional information could be added;
-	 * {@code null} if the argument was {@code null}
+	 * Tries to enrich OutOfMemoryErrors being part of the passed root Throwable's cause tree.
+	 *
+	 * <p>This method improves error messages for direct and metaspace {@link OutOfMemoryError}.
+	 * It adds description about the possible causes and ways of resolution.
+	 *
+	 * @param root The Throwable of which the cause tree shall be traversed.
+	 * @param jvmMetaspaceOomNewErrorMessage The message being used for JVM metaspace-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
+	 * @param jvmDirectOomNewErrorMessage The message being used for direct memory-related OutOfMemoryErrors. Passing
+	 *                                    <code>null</code> will disable handling this class of error.
+	 * @param jvmHeapSpaceOomNewErrorMessage The message being used for Heap space-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
 	 */
-	@Nullable
-	public static Throwable tryEnrichOutOfMemoryError(
-			@Nullable Throwable exception,
-			String jvmMetaspaceOomNewErrorMessage,
-			String jvmDirectOomNewErrorMessage) {
-		boolean isOom = exception instanceof OutOfMemoryError;
-		if (!isOom) {
-			return exception;
-		}
-
-		OutOfMemoryError oom = (OutOfMemoryError) exception;
-		if (isMetaspaceOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmMetaspaceOomNewErrorMessage);
-		} else if (isDirectOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmDirectOomNewErrorMessage);
-		}
+	public static void tryEnrichOutOfMemoryError(
+			@Nullable Throwable root,
+			@Nullable String jvmMetaspaceOomNewErrorMessage,
+			@Nullable String jvmDirectOomNewErrorMessage,
+			@Nullable String jvmHeapSpaceOomNewErrorMessage) {
+		updateDetailMessage(root, t -> {
+			if (isMetaspaceOutOfMemoryError(t)) {
+				return jvmMetaspaceOomNewErrorMessage;
+			} else if (isDirectOutOfMemoryError(t)) {
+				return jvmDirectOomNewErrorMessage;
+			} else if (isHeapSpaceOutOfMemoryError(t)) {
+				return jvmHeapSpaceOomNewErrorMessage;
+			}
 
-		return oom;
+			return null;
+		});
 	}
 
 	/**
-	 * Rewrites the error message of a {@link OutOfMemoryError}.
+	 * Updates the error message of the first Throwable appearing in the cause tree of the passed root Throwable and
+	 * matching the passed Predicate by traversing the cause tree top-to-bottom.
 	 *
-	 * @param oom original {@link OutOfMemoryError}
-	 * @param newMessage new error message
-	 * @return the origianl {@link OutOfMemoryError} if it already has the new error message or
-	 * a new {@link OutOfMemoryError} with the new error message
+	 * @param root The Throwable whose cause tree shall be traversed.
+	 * @param throwableToMessage The Function based on which the new messages are generated. The function implementation
+	 *                           should return the new message. Returning <code>null</code>, in contrast, will result in
+	 *                           not updating the message for the corresponding Throwable.
 	 */
-	private static OutOfMemoryError changeOutOfMemoryErrorMessage(OutOfMemoryError oom, String newMessage) {
-		if (oom.getMessage().equals(newMessage)) {
-			return oom;
+	public static void updateDetailMessage(Throwable root, Function<Throwable, String> throwableToMessage) {

Review comment:
       if `root` can be `null`, then let's add `@Nullable`. The same for `throwableToMessage`. However, it might not be necessary to support `null` values here.

##########
File path: flink-core/src/main/java/org/apache/flink/util/ExceptionUtils.java
##########
@@ -110,51 +112,76 @@ public static boolean isJvmFatalOrOutOfMemoryError(Throwable t) {
 	}
 
 	/**
-	 * Tries to enrich the passed exception with additional information.
-	 *
-	 * <p>This method improves error message for direct and metaspace {@link OutOfMemoryError}.
-	 * It adds description of possible causes and ways of resolution.
-	 *
-	 * @param exception exception to enrich if not {@code null}
-	 * @return the enriched exception or the original if no additional information could be added;
-	 * {@code null} if the argument was {@code null}
+	 * Tries to enrich OutOfMemoryErrors being part of the passed root Throwable's cause tree.
+	 *
+	 * <p>This method improves error messages for direct and metaspace {@link OutOfMemoryError}.
+	 * It adds description about the possible causes and ways of resolution.
+	 *
+	 * @param root The Throwable of which the cause tree shall be traversed.
+	 * @param jvmMetaspaceOomNewErrorMessage The message being used for JVM metaspace-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
+	 * @param jvmDirectOomNewErrorMessage The message being used for direct memory-related OutOfMemoryErrors. Passing
+	 *                                    <code>null</code> will disable handling this class of error.
+	 * @param jvmHeapSpaceOomNewErrorMessage The message being used for Heap space-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
 	 */
-	@Nullable
-	public static Throwable tryEnrichOutOfMemoryError(
-			@Nullable Throwable exception,
-			String jvmMetaspaceOomNewErrorMessage,
-			String jvmDirectOomNewErrorMessage) {
-		boolean isOom = exception instanceof OutOfMemoryError;
-		if (!isOom) {
-			return exception;
-		}
-
-		OutOfMemoryError oom = (OutOfMemoryError) exception;
-		if (isMetaspaceOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmMetaspaceOomNewErrorMessage);
-		} else if (isDirectOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmDirectOomNewErrorMessage);
-		}
+	public static void tryEnrichOutOfMemoryError(
+			@Nullable Throwable root,
+			@Nullable String jvmMetaspaceOomNewErrorMessage,
+			@Nullable String jvmDirectOomNewErrorMessage,
+			@Nullable String jvmHeapSpaceOomNewErrorMessage) {
+		updateDetailMessage(root, t -> {
+			if (isMetaspaceOutOfMemoryError(t)) {
+				return jvmMetaspaceOomNewErrorMessage;
+			} else if (isDirectOutOfMemoryError(t)) {
+				return jvmDirectOomNewErrorMessage;
+			} else if (isHeapSpaceOutOfMemoryError(t)) {
+				return jvmHeapSpaceOomNewErrorMessage;
+			}
 
-		return oom;
+			return null;
+		});
 	}
 
 	/**
-	 * Rewrites the error message of a {@link OutOfMemoryError}.
+	 * Updates the error message of the first Throwable appearing in the cause tree of the passed root Throwable and
+	 * matching the passed Predicate by traversing the cause tree top-to-bottom.
 	 *
-	 * @param oom original {@link OutOfMemoryError}
-	 * @param newMessage new error message
-	 * @return the origianl {@link OutOfMemoryError} if it already has the new error message or
-	 * a new {@link OutOfMemoryError} with the new error message
+	 * @param root The Throwable whose cause tree shall be traversed.
+	 * @param throwableToMessage The Function based on which the new messages are generated. The function implementation
+	 *                           should return the new message. Returning <code>null</code>, in contrast, will result in
+	 *                           not updating the message for the corresponding Throwable.
 	 */
-	private static OutOfMemoryError changeOutOfMemoryErrorMessage(OutOfMemoryError oom, String newMessage) {
-		if (oom.getMessage().equals(newMessage)) {
-			return oom;
+	public static void updateDetailMessage(Throwable root, Function<Throwable, String> throwableToMessage) {
+		if (throwableToMessage == null) {
+			return;
+		}
+
+		Throwable it = root;
+		while (it != null) {
+			String newMessage = throwableToMessage.apply(it);
+			if (newMessage != null) {
+				updateDetailMessageOfThrowable(it, newMessage);
+			}
+
+			it = it.getCause();
+		}
+	}
+
+	private static void updateDetailMessageOfThrowable(Throwable throwable, String newDetailMessage) {
+		Field field;
+		try {
+			field = Throwable.class.getDeclaredField("detailMessage");
+		} catch (NoSuchFieldException e) {
+			throw new IllegalStateException("The JDK Throwable does provide a detailMessage.", e);

Review comment:
       ```suggestion
   			throw new IllegalStateException("The JDK Throwable does not provide a detailMessage.", e);
   ```

##########
File path: flink-core/src/main/java/org/apache/flink/util/ExceptionUtils.java
##########
@@ -110,51 +112,76 @@ public static boolean isJvmFatalOrOutOfMemoryError(Throwable t) {
 	}
 
 	/**
-	 * Tries to enrich the passed exception with additional information.
-	 *
-	 * <p>This method improves error message for direct and metaspace {@link OutOfMemoryError}.
-	 * It adds description of possible causes and ways of resolution.
-	 *
-	 * @param exception exception to enrich if not {@code null}
-	 * @return the enriched exception or the original if no additional information could be added;
-	 * {@code null} if the argument was {@code null}
+	 * Tries to enrich OutOfMemoryErrors being part of the passed root Throwable's cause tree.
+	 *
+	 * <p>This method improves error messages for direct and metaspace {@link OutOfMemoryError}.
+	 * It adds description about the possible causes and ways of resolution.
+	 *
+	 * @param root The Throwable of which the cause tree shall be traversed.
+	 * @param jvmMetaspaceOomNewErrorMessage The message being used for JVM metaspace-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
+	 * @param jvmDirectOomNewErrorMessage The message being used for direct memory-related OutOfMemoryErrors. Passing
+	 *                                    <code>null</code> will disable handling this class of error.
+	 * @param jvmHeapSpaceOomNewErrorMessage The message being used for Heap space-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
 	 */
-	@Nullable
-	public static Throwable tryEnrichOutOfMemoryError(
-			@Nullable Throwable exception,
-			String jvmMetaspaceOomNewErrorMessage,
-			String jvmDirectOomNewErrorMessage) {
-		boolean isOom = exception instanceof OutOfMemoryError;
-		if (!isOom) {
-			return exception;
-		}
-
-		OutOfMemoryError oom = (OutOfMemoryError) exception;
-		if (isMetaspaceOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmMetaspaceOomNewErrorMessage);
-		} else if (isDirectOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmDirectOomNewErrorMessage);
-		}
+	public static void tryEnrichOutOfMemoryError(
+			@Nullable Throwable root,
+			@Nullable String jvmMetaspaceOomNewErrorMessage,
+			@Nullable String jvmDirectOomNewErrorMessage,
+			@Nullable String jvmHeapSpaceOomNewErrorMessage) {
+		updateDetailMessage(root, t -> {
+			if (isMetaspaceOutOfMemoryError(t)) {
+				return jvmMetaspaceOomNewErrorMessage;
+			} else if (isDirectOutOfMemoryError(t)) {
+				return jvmDirectOomNewErrorMessage;
+			} else if (isHeapSpaceOutOfMemoryError(t)) {
+				return jvmHeapSpaceOomNewErrorMessage;
+			}
 
-		return oom;
+			return null;
+		});
 	}
 
 	/**
-	 * Rewrites the error message of a {@link OutOfMemoryError}.
+	 * Updates the error message of the first Throwable appearing in the cause tree of the passed root Throwable and
+	 * matching the passed Predicate by traversing the cause tree top-to-bottom.
 	 *
-	 * @param oom original {@link OutOfMemoryError}
-	 * @param newMessage new error message
-	 * @return the origianl {@link OutOfMemoryError} if it already has the new error message or
-	 * a new {@link OutOfMemoryError} with the new error message
+	 * @param root The Throwable whose cause tree shall be traversed.
+	 * @param throwableToMessage The Function based on which the new messages are generated. The function implementation
+	 *                           should return the new message. Returning <code>null</code>, in contrast, will result in
+	 *                           not updating the message for the corresponding Throwable.
 	 */
-	private static OutOfMemoryError changeOutOfMemoryErrorMessage(OutOfMemoryError oom, String newMessage) {
-		if (oom.getMessage().equals(newMessage)) {
-			return oom;
+	public static void updateDetailMessage(Throwable root, Function<Throwable, String> throwableToMessage) {
+		if (throwableToMessage == null) {
+			return;
+		}
+
+		Throwable it = root;
+		while (it != null) {
+			String newMessage = throwableToMessage.apply(it);
+			if (newMessage != null) {
+				updateDetailMessageOfThrowable(it, newMessage);
+			}
+
+			it = it.getCause();
+		}

Review comment:
       The JavaDocs and this loop are not aligned. It looks as if we continue with updating the messages even after we have found the first match.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java
##########
@@ -137,7 +138,12 @@ public JobManagerRunnerImpl(
 		this.leaderGatewayFuture = new CompletableFuture<>();
 
 		// now start the JobManager
-		this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		try {
+			this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		} catch (Throwable t) {
+			ClusterEntryPointExceptionUtils.tryEnrichClusterEntryPointError(t);
+			throw t;
+		}

Review comment:
       I would move this specific exception handling out because now we would only enrich OOMs which occur here. But what if an OOM happens in a line above (e.g. `classLoaderLease.getOrResolveClassLoader(...)`)? I think it would be better to put it into `Dispatcher.internalSubmitJob`.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntryPointExceptionUtils.java
##########
@@ -47,22 +47,28 @@
 			"which has to be investigated and fixed. The Flink Master has to be shutdown...",
 		JobManagerOptions.JVM_METASPACE.key());
 
+	private static final String JM_HEAP_SPACE_OOM_ERROR_MESSAGE = String.format(
+		"Java heap space. A heap space-related out-of-memory error has occurred. This can mean two things: either Flink Master " +
+			"requires a larger size of JVM heap space or there is a memory leak. In the first case, '%s' can be used to increase " +
+			"the amount of available heap memory. If the problem is not resolved by increasing the heap size, it indicate a " +

Review comment:
       ```suggestion
   			"the amount of available heap memory. If the problem is not resolved by increasing the heap size, it indicates a " +
   ```

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java
##########
@@ -1046,6 +1047,8 @@ void failGlobalIfExecutionIsStillRunning(Throwable cause, ExecutionAttemptID fai
 	 * @param t The exception that caused the failure.
 	 */
 	public void failGlobal(Throwable t) {
+		ClusterEntryPointExceptionUtils.tryEnrichClusterEntryPointError(t);

Review comment:
       I am not sure whether we can enrich `t` here because `t` can also come from the `TaskExecutor` if a `Task` has failed. In this case, the ClusterEntryPoint specific messages don't make sense.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java
##########
@@ -134,120 +134,190 @@
  */
 public class Task implements Runnable, TaskSlotPayload, TaskActions, PartitionProducerStateProvider, CheckpointListener, BackPressureSampleableTask {
 
-	/** The class logger. */
+	/**
+	 * The class logger.
+	 */

Review comment:
       All changes in this file seem to be quite unrelated. Please revert.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469244408



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobgraph/InputOutputFormatVertex.java
##########
@@ -99,15 +100,16 @@ public void initializeOnMaster(ClassLoader loader) throws Exception {
 					((InitializeOnMaster) outputFormat).initializeGlobal(getParallelism());
 				}
 			}
-
+		} catch (Throwable t) {
+			throw ClusterEntryPointExceptionUtils.tryEnrichClusterEntryPointError(t);

Review comment:
       I moved the OOM handling into `JobManagerRunnerImpl`'s constructor as discussed in one of our recent discussions. But, I want to point out that it's actually out-of-scope for FLINK-18220.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671594849


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 8401a49756640fa070b5d7a15fb29bb6837f5a99 (Mon Aug 10 21:18:21 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
    * **This pull request references an unassigned [Jira ticket](https://issues.apache.org/jira/browse/FLINK-18220).** According to the [code contribution guide](https://flink.apache.org/contributing/contribute-code.html), tickets need to be assigned before starting with the implementation work.
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13111:
URL: https://github.com/apache/flink/pull/13111#issuecomment-671602964


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "8401a49756640fa070b5d7a15fb29bb6837f5a99",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8401a49756640fa070b5d7a15fb29bb6837f5a99 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469865932



##########
File path: flink-core/src/main/java/org/apache/flink/util/ExceptionUtils.java
##########
@@ -110,51 +112,76 @@ public static boolean isJvmFatalOrOutOfMemoryError(Throwable t) {
 	}
 
 	/**
-	 * Tries to enrich the passed exception with additional information.
-	 *
-	 * <p>This method improves error message for direct and metaspace {@link OutOfMemoryError}.
-	 * It adds description of possible causes and ways of resolution.
-	 *
-	 * @param exception exception to enrich if not {@code null}
-	 * @return the enriched exception or the original if no additional information could be added;
-	 * {@code null} if the argument was {@code null}
+	 * Tries to enrich OutOfMemoryErrors being part of the passed root Throwable's cause tree.
+	 *
+	 * <p>This method improves error messages for direct and metaspace {@link OutOfMemoryError}.
+	 * It adds description about the possible causes and ways of resolution.
+	 *
+	 * @param root The Throwable of which the cause tree shall be traversed.
+	 * @param jvmMetaspaceOomNewErrorMessage The message being used for JVM metaspace-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
+	 * @param jvmDirectOomNewErrorMessage The message being used for direct memory-related OutOfMemoryErrors. Passing
+	 *                                    <code>null</code> will disable handling this class of error.
+	 * @param jvmHeapSpaceOomNewErrorMessage The message being used for Heap space-related OutOfMemoryErrors. Passing
+	 *                                       <code>null</code> will disable handling this class of error.
 	 */
-	@Nullable
-	public static Throwable tryEnrichOutOfMemoryError(
-			@Nullable Throwable exception,
-			String jvmMetaspaceOomNewErrorMessage,
-			String jvmDirectOomNewErrorMessage) {
-		boolean isOom = exception instanceof OutOfMemoryError;
-		if (!isOom) {
-			return exception;
-		}
-
-		OutOfMemoryError oom = (OutOfMemoryError) exception;
-		if (isMetaspaceOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmMetaspaceOomNewErrorMessage);
-		} else if (isDirectOutOfMemoryError(oom)) {
-			return changeOutOfMemoryErrorMessage(oom, jvmDirectOomNewErrorMessage);
-		}
+	public static void tryEnrichOutOfMemoryError(
+			@Nullable Throwable root,
+			@Nullable String jvmMetaspaceOomNewErrorMessage,
+			@Nullable String jvmDirectOomNewErrorMessage,
+			@Nullable String jvmHeapSpaceOomNewErrorMessage) {
+		updateDetailMessage(root, t -> {
+			if (isMetaspaceOutOfMemoryError(t)) {
+				return jvmMetaspaceOomNewErrorMessage;
+			} else if (isDirectOutOfMemoryError(t)) {
+				return jvmDirectOomNewErrorMessage;
+			} else if (isHeapSpaceOutOfMemoryError(t)) {
+				return jvmHeapSpaceOomNewErrorMessage;
+			}
 
-		return oom;
+			return null;
+		});
 	}
 
 	/**
-	 * Rewrites the error message of a {@link OutOfMemoryError}.
+	 * Updates the error message of the first Throwable appearing in the cause tree of the passed root Throwable and
+	 * matching the passed Predicate by traversing the cause tree top-to-bottom.
 	 *
-	 * @param oom original {@link OutOfMemoryError}
-	 * @param newMessage new error message
-	 * @return the origianl {@link OutOfMemoryError} if it already has the new error message or
-	 * a new {@link OutOfMemoryError} with the new error message
+	 * @param root The Throwable whose cause tree shall be traversed.
+	 * @param throwableToMessage The Function based on which the new messages are generated. The function implementation
+	 *                           should return the new message. Returning <code>null</code>, in contrast, will result in
+	 *                           not updating the message for the corresponding Throwable.
 	 */
-	private static OutOfMemoryError changeOutOfMemoryErrorMessage(OutOfMemoryError oom, String newMessage) {
-		if (oom.getMessage().equals(newMessage)) {
-			return oom;
+	public static void updateDetailMessage(Throwable root, Function<Throwable, String> throwableToMessage) {
+		if (throwableToMessage == null) {
+			return;
+		}
+
+		Throwable it = root;
+		while (it != null) {
+			String newMessage = throwableToMessage.apply(it);
+			if (newMessage != null) {
+				updateDetailMessageOfThrowable(it, newMessage);
+			}
+
+			it = it.getCause();
+		}
+	}
+
+	private static void updateDetailMessageOfThrowable(Throwable throwable, String newDetailMessage) {
+		Field field;
+		try {
+			field = Throwable.class.getDeclaredField("detailMessage");
+		} catch (NoSuchFieldException e) {
+			throw new IllegalStateException("The JDK Throwable does provide a detailMessage.", e);

Review comment:
       I updated both exception messages to be more descriptive.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469246546



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntryPointExceptionUtils.java
##########
@@ -61,7 +61,7 @@ private ClusterEntryPointExceptionUtils() {
 	 * {@code null} if the argument was {@code null}
 	 */
 	@Nullable
-	static Throwable tryEnrichClusterEntryPointError(@Nullable Throwable exception) {
+	public static Throwable tryEnrichClusterEntryPointError(@Nullable Throwable exception) {

Review comment:
       First version of the heap space-related error message was added. Don't we have to add this as well to the `TaskManagerExceptionUtils`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on a change in pull request #13111: [FLINK-18220][runtime] Extended OutOfMemoryError message.

Posted by GitBox <gi...@apache.org>.
XComp commented on a change in pull request #13111:
URL: https://github.com/apache/flink/pull/13111#discussion_r469889173



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java
##########
@@ -137,7 +138,12 @@ public JobManagerRunnerImpl(
 		this.leaderGatewayFuture = new CompletableFuture<>();
 
 		// now start the JobManager
-		this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		try {
+			this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
+		} catch (Throwable t) {
+			ClusterEntryPointExceptionUtils.tryEnrichClusterEntryPointError(t);
+			throw t;
+		}

Review comment:
       I see. But wouldn't we miss doing the error handling for a Job rerun. The JobManager instantiation happens in `Dispatcher.persistAndRun(JobGraph)` (which calls `Dispatcher.internalSubmitJob(JobGraph)` but also in `Dispatcher.runRecoveredJob(JobGraph)`. The latter call graph does not touch `Dispatcher.internalSubmitJob(JobGraph)` which means that we wouldn't be able to handle the OOM for Job reruns if moving it into `Dispatcher.internalSubmitJob(JobGraph)` in that case.
   
   Wouldn't `Dispatcher.runJob(JobGraph)` the better location for the OOM handling?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org