You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Pawel Bartoszek <pa...@gmail.com> on 2018/10/08 10:32:16 UTC

Job manager logs for previous YARN attempts

Hi,

I am looking into the cause YARN starts new application attempt on Flink
1.5.2. The challenge is getting the logs for the first attempt. After
checking YARN I discovered that in the first attempt and the second one
application manager (job manager) gets assigned the same container id (is
this expected ?)  In this case logs from the first attempt are overwritten?
I found that *setKeepContainersAcrossApplicationAttempts* is enabled here
here
<https://github.com/apache/flink/blob/2ec72123e347e684ac40a1e1111a79a11211aadb/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L1340>

The second challenge is understanding if the job will be restored into new
application attempts or new application attempt will just have flink
running without any job?


Regards,
Pawel

*First attempt:*

pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list
appattempt_1538570922803_0020_000001
18/10/08 10:16:16 INFO client.RMProxy: Connecting to ResourceManager at
ip-10-4-X-X.eu-west-1.compute.internal/10.4.108.26:8032
Total number of containers :1
                  Container-Id           Start Time          Finish Time
             State                 Host    Node Http Address
              LOG-URL
container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018
           N/A              RUNNING
ip-10-4-X-X.eu-west-1.compute.internal:8041
http://ip-10-4-X-X.eu-west-1.compute.internal:8042
http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek

*Second attempt:*
[pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list
appattempt_1538570922803_0020_000002
18/10/08 10:16:37 INFO client.RMProxy: Connecting to ResourceManager at
ip-10-4-X-X.eu-west-1.compute.internal/10.4.X.X:8032
Total number of containers :1
                  Container-Id           Start Time          Finish Time
             State                 Host    Node Http Address
              LOG-URL
container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018
           N/A              RUNNING
ip-10-4-X-X.eu-west-1.compute.internal:8041
http://ip-10-4-X-X.eu-west-1.compute.internal:8042
http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek

Re: Job manager logs for previous YARN attempts

Posted by Gary Yao <ga...@data-artisans.com>.

Hi Pawel,

As far as I know, the application attempt is incremented if the application
master fails and a new one is brought up. Therefore, what you are seeing
should not happen. I have just deployed on AWS EMR 5.17.0 (Hadoop 2.8.4) and
killed the container running the application master – the container id was
not
reused. Can you describe how to reproduce this behavior? Do you have a
sample
application? Can you observe this behavior consistently? Can you share the
complete output of

    yarn logs -applicationId <YOUR_APPLICATION_ID>?

The call to the method setKeepContainersAcrossApplicationAttempts is needed
to
enable recovery of previously allocated TaskManager containers [1]. I
currently do not see how it is possible to keep the AM container across
application attempts.

> The second challenge is understanding if the job will be restored into new
> application attempts or new application attempt will just have flink
running
> without any job?

The job will be restored if you have HA enabled [2][3].

Best,
Gary

[1]
https://hortonworks.com/blog/apache-hadoop-yarn-hdp-2-2-fault-tolerance-features-long-running-services/
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/jobmanager_high_availability.html#yarn-cluster-high-availability
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn

On Mon, Oct 8, 2018 at 12:32 PM Pawel Bartoszek <pa...@gmail.com>
wrote:

> Hi,
>
> I am looking into the cause YARN starts new application attempt on Flink
> 1.5.2. The challenge is getting the logs for the first attempt. After
> checking YARN I discovered that in the first attempt and the second one
> application manager (job manager) gets assigned the same container id (is
> this expected ?)  In this case logs from the first attempt are overwritten?
> I found that *setKeepContainersAcrossApplicationAttempts* is enabled here
> here
> <https://github.com/apache/flink/blob/2ec72123e347e684ac40a1e1111a79a11211aadb/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L1340>
>
> The second challenge is understanding if the job will be restored into new
> application attempts or new application attempt will just have flink
> running without any job?
>
>
> Regards,
> Pawel
>
> *First attempt:*
>
> pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list
> appattempt_1538570922803_0020_000001
> 18/10/08 10:16:16 INFO client.RMProxy: Connecting to ResourceManager at
> ip-10-4-X-X.eu-west-1.compute.internal/10.4.108.26:8032
> Total number of containers :1
>                   Container-Id           Start Time          Finish Time
>              State                 Host    Node Http Address
>                 LOG-URL
> container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018
>              N/A              RUNNING
> ip-10-4-X-X.eu-west-1.compute.internal:8041
> http://ip-10-4-X-X.eu-west-1.compute.internal:8042
> http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek
>
> *Second attempt:*
> [pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list
> appattempt_1538570922803_0020_000002
> 18/10/08 10:16:37 INFO client.RMProxy: Connecting to ResourceManager at
> ip-10-4-X-X.eu-west-1.compute.internal/10.4.X.X:8032
> Total number of containers :1
>                   Container-Id           Start Time          Finish Time
>              State                 Host    Node Http Address
>                 LOG-URL
> container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018
>              N/A              RUNNING
> ip-10-4-X-X.eu-west-1.compute.internal:8041
> http://ip-10-4-X-X.eu-west-1.compute.internal:8042
> http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek
>