You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Niklas Semmler (Jira)" <ji...@apache.org> on 2022/05/30 22:14:00 UTC
[jira] [Comment Edited] (FLINK-24960) YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots hangs on azure
[ https://issues.apache.org/jira/browse/FLINK-24960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544070#comment-17544070 ]
Niklas Semmler edited comment on FLINK-24960 at 5/30/22 10:13 PM:
------------------------------------------------------------------
I took another look at this instability and tried to retrace my earlier steps. There are two aspects about this instability:
1. Why is the {{localhost:8081}} address used in this Yarn context? (Instead of the Yarn container & port)
2. Why does the exception not halt the program? I.e., why is the program continuing to run after the exception is thrown?
Previously I thought that {{localhost}} was set via the setup of the YarnClusterDescriptor. In the newer logs, the following [log line|https://github.com/apache/flink/blob/c16e4b4ce20704a0ad4387591894f13105d5e530/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L1799] shows that the address is still correctly set in the YarnClusterDescriptor:
{{7:09:15,731 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 6dad280b2159:40493 of application}}
Apparently the {{localhost}} address is added somewhere else. But where? Maybe it is done when the job is submitted? It may make sense to change the log level of this [log line|https://github.com/apache/flink/blob/f9438dd54fa6896563b152d50b7a4b3c47ad9ebf/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L244] by changing it [here|https://github.com/apache/flink/blob/7d85b273ccdbd5a2242e05e5d645ea82280f5eea/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionCapacitySchedulerITCase.java#L116].
This would however just explain point 1 not point 2. My intuition is that there is some timing problem at play that we are so far missing. Let's compare a successful to an erroneous execution:
*Successful execution*
{code}
05:26:42,745 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - YARN application has been deployed successfully.
05:26:42,745 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 6d4ce71bbaec:45932 of application 'application_1650173077369_0008'. 05:26:42,931 [ main] INFO org.apache.flink.yarn.YarnTestBase [] - Found expected output in redirected streams
05:26:42,933 [ main] INFO org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted hostname:port: 6d4ce71bbaec:45932
05:26:42,934 [ main] INFO org.apache.flink.yarn.YarnTestBase [] - Running with args [run, --detached, /__w/3/s/flink-yarn-tests/target/programs/WindowJoin.jar]
05:26:42,940 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli [] - Found Yarn properties file under /tmp/.yarn-properties-agent03_azpcontainer.
05:26:42,940 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] WARN org.apache.flink.core.plugin.PluginConfig [] - The plugins directory [plugins] does not exist.
05:26:42,941 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.cli.CliFrontend [] - Running 'run' command.
05:26:42,942 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.cli.CliFrontend [] - Building program from JAR file
05:26:42,945 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.ClientUtils [] - Starting program (detached: true)
05:26:42,975 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion does not contain a setter for field one 05:26:42,975 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - Class class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance. 05:26:42,978 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion does not contain a setter for field one 05:26:42,978 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - Class class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance. 05:26:43,033 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] WARN org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The configuration directory ('/tmp/junit8048407617893879030/conf') already contains a LOG4J config file.If you want to use logback, then please delete or rename the log configuration file. 05:26:43,033 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] WARN org.apache.flink.runtime.util.HadoopUtils [] - Could not find Hadoop configuration via any of the supported methods (Flink configuration, environment variables). 05:26:43,082 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.hadoop.yarn.client.RMProxy [] - Connecting to ResourceManager at 6d4ce71bbaec/192.168.192.2:42104
05:26:43,083 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn. YarnClusterDescriptor to locate the jar
05:26:43,086 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 6d4ce71bbaec:45932 of application 'application_1650173077369_0008'.
05:26:43,101 [Flink-RestClusterClient-IO-thread-1] INFO org.apache.flink.client.program.rest.RestClusterClient [] - Submitting job 'Windowed Join Example' (8e2a9cb1e08a4b6642ac342bd4d96dcb). 05:26:43,620 [Flink-RestClusterClient-IO-thread-4] INFO org.apache.flink.client.program.rest.RestClusterClient [] - Successfully submitted job 'Windowed Join Example' (8e2a9cb1e08a4b6642ac342bd4d96dcb) to 'http://6d4ce71bbaec: 45932'.
{code}
*Erroneous execution*
{code}
07:09:15,731 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - YARN application has been deployed successfully.
07:09:15,731 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 6dad280b2159:40493 of application 'application_1650179225006_0008'. 07:09:15,750 [ main] INFO org.apache.flink.yarn.YarnTestBase [] - Found expected output in redirected streams
07:09:15,752 [ main] INFO org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted hostname:port: 6dad280b2159:40493
07:09:15,752 [ main] INFO org.apache.flink.yarn.YarnTestBase [] - Running with args [run, --detached, /__w/2/s/flink-yarn-tests/target/programs/WindowJoin.jar]
07:09:15,757 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] WARN org.apache.flink.core.plugin.PluginConfig [] - The plugins directory [plugins] does not exist.
07:09:15,758 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.cli.CliFrontend [] - Running 'run' command.
07:09:15,759 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.cli.CliFrontend [] - Building program from JAR file
07:09:15,762 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.ClientUtils [] - Starting program (detached: true)
07:09:15,867 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion does not contain a setter for field one 07:09:15,867 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - Class class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance. 07:09:15,871 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion does not contain a setter for field one 07:09:15,871 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - Class class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance. 07:09:16,194 [Flink-RestClusterClient-IO-thread-1] INFO org.apache.flink.client.program.rest.RestClusterClient [] - Submitting job 'Windowed Join Example' (2788aebd84c123f092c0992e5a7321ca).
07:09:16,236 [flink-rest-client-netty-thread-1] WARN org.apache.flink.client.program.rest.RestClusterClient [] - Attempt to submit job 'Windowed Join Example' (2788aebd84c123f092c0992e5a7321ca) to 'http://localhost:8081' has failed. java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8081
{code}
It is noteworthy that the line {{The configuration directory ('/tmp/junit8048407617893879030/conf') already contains a LOG4J config file}} only appears in the successful execution. This may indicates that the config does not exist due to a timing issue. Then again, this may be a completely unrelated artifact.
was (Author: JIRAUSER281719):
I took another look at this instability and tried to retrace my earlier steps. There are two aspects about this instability:
1. Why is the {{localhost:8081}} address used in this Yarn context? (Instead of the Yarn container & port)
2. Why does the exception not halt the program? I.e., why is the program continuing to run after the exception is thrown?
Previously I thought that {{localhost}} was set via the setup of the YarnClusterDescriptor. In the newer logs, the following [log line|https://github.com/apache/flink/blob/c16e4b4ce20704a0ad4387591894f13105d5e530/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L1799] shows that the address is still correctly set in the YarnClusterDescriptor:
{{7:09:15,731 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 6dad280b2159:40493 of application}}
Apparently the {{localhost}} address is added somewhere else. But where? Maybe it is done when the job is submitted? It may make sense to change the log level of this [log line|https://github.com/apache/flink/blob/f9438dd54fa6896563b152d50b7a4b3c47ad9ebf/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L244] to double check it. This would however just explain point 1 not point 2.`
For completeness sake, let's compare a successful to an erroneous execution:
*Successful execution*
{code}
05:26:42,745 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - YARN application has been deployed successfully.
05:26:42,745 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 6d4ce71bbaec:45932 of application 'application_1650173077369_0008'. 05:26:42,931 [ main] INFO org.apache.flink.yarn.YarnTestBase [] - Found expected output in redirected streams
05:26:42,933 [ main] INFO org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted hostname:port: 6d4ce71bbaec:45932
05:26:42,934 [ main] INFO org.apache.flink.yarn.YarnTestBase [] - Running with args [run, --detached, /__w/3/s/flink-yarn-tests/target/programs/WindowJoin.jar]
05:26:42,940 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli [] - Found Yarn properties file under /tmp/.yarn-properties-agent03_azpcontainer.
05:26:42,940 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] WARN org.apache.flink.core.plugin.PluginConfig [] - The plugins directory [plugins] does not exist.
05:26:42,941 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.cli.CliFrontend [] - Running 'run' command.
05:26:42,942 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.cli.CliFrontend [] - Building program from JAR file
05:26:42,945 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.ClientUtils [] - Starting program (detached: true)
05:26:42,975 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion does not contain a setter for field one 05:26:42,975 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - Class class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance. 05:26:42,978 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion does not contain a setter for field one 05:26:42,978 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - Class class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance. 05:26:43,033 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] WARN org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The configuration directory ('/tmp/junit8048407617893879030/conf') already contains a LOG4J config file.If you want to use logback, then please delete or rename the log configuration file. 05:26:43,033 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] WARN org.apache.flink.runtime.util.HadoopUtils [] - Could not find Hadoop configuration via any of the supported methods (Flink configuration, environment variables). 05:26:43,082 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.hadoop.yarn.client.RMProxy [] - Connecting to ResourceManager at 6d4ce71bbaec/192.168.192.2:42104
05:26:43,083 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn. YarnClusterDescriptor to locate the jar
05:26:43,086 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 6d4ce71bbaec:45932 of application 'application_1650173077369_0008'.
05:26:43,101 [Flink-RestClusterClient-IO-thread-1] INFO org.apache.flink.client.program.rest.RestClusterClient [] - Submitting job 'Windowed Join Example' (8e2a9cb1e08a4b6642ac342bd4d96dcb). 05:26:43,620 [Flink-RestClusterClient-IO-thread-4] INFO org.apache.flink.client.program.rest.RestClusterClient [] - Successfully submitted job 'Windowed Join Example' (8e2a9cb1e08a4b6642ac342bd4d96dcb) to 'http://6d4ce71bbaec: 45932'.
{code}
*Erroneous execution*
{code}
07:09:15,731 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - YARN application has been deployed successfully.
07:09:15,731 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 6dad280b2159:40493 of application 'application_1650179225006_0008'. 07:09:15,750 [ main] INFO org.apache.flink.yarn.YarnTestBase [] - Found expected output in redirected streams
07:09:15,752 [ main] INFO org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted hostname:port: 6dad280b2159:40493
07:09:15,752 [ main] INFO org.apache.flink.yarn.YarnTestBase [] - Running with args [run, --detached, /__w/2/s/flink-yarn-tests/target/programs/WindowJoin.jar]
07:09:15,757 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] WARN org.apache.flink.core.plugin.PluginConfig [] - The plugins directory [plugins] does not exist.
07:09:15,758 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.cli.CliFrontend [] - Running 'run' command.
07:09:15,759 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.cli.CliFrontend [] - Building program from JAR file
07:09:15,762 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.client.ClientUtils [] - Starting program (detached: true)
07:09:15,867 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion does not contain a setter for field one 07:09:15,867 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - Class class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance. 07:09:15,871 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion does not contain a setter for field one 07:09:15,871 [Frontend (CLI/YARN Client) runner thread (startWithArgs()).] INFO org.apache.flink.api.java.typeutils.TypeExtractor [] - Class class org.apache.flink.streaming.api.datastream.CoGroupedStreams$TaggedUnion cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance. 07:09:16,194 [Flink-RestClusterClient-IO-thread-1] INFO org.apache.flink.client.program.rest.RestClusterClient [] - Submitting job 'Windowed Join Example' (2788aebd84c123f092c0992e5a7321ca).
07:09:16,236 [flink-rest-client-netty-thread-1] WARN org.apache.flink.client.program.rest.RestClusterClient [] - Attempt to submit job 'Windowed Join Example' (2788aebd84c123f092c0992e5a7321ca) to 'http://localhost:8081' has failed. java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8081
{code}
It is noteworthy that the line {{The configuration directory ('/tmp/junit8048407617893879030/conf') already contains a LOG4J config file}} only appears in the succesful execution. This may indicates that the config does not exist due to a timing issue and is replaced by a different config (possibly with the localhost address). Or this is a completely unrelated artifact.
> YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots hangs on azure
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-24960
> URL: https://issues.apache.org/jira/browse/FLINK-24960
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.15.0, 1.14.3
> Reporter: Yun Gao
> Assignee: Niklas Semmler
> Priority: Critical
> Labels: pull-request-available, test-stability
> Fix For: 1.16.0
>
>
> {code:java}
> Nov 18 22:37:08 ================================================================================
> Nov 18 22:37:08 Test testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase) is running.
> Nov 18 22:37:08 --------------------------------------------------------------------------------
> Nov 18 22:37:25 22:37:25,470 [ main] INFO org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted hostname:port: 5718b812c7ab:38622
> Nov 18 22:52:36 ==============================================================================
> Nov 18 22:52:36 Process produced no output for 900 seconds.
> Nov 18 22:52:36 ==============================================================================
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26722&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=cc452273-9efa-565d-9db8-ef62a38a0c10&l=36395
--
This message was sent by Atlassian Jira
(v8.20.7#820007)