You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/04/30 16:46:00 UTC
[jira] [Commented] (FLINK-8900) YARN FinalStatus always shows as KILLED with Flip-6

    [ https://issues.apache.org/jira/browse/FLINK-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458752#comment-16458752 ] 

ASF GitHub Bot commented on FLINK-8900:
---------------------------------------

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/5944

    [FLINK-8900] [yarn] Set correct application status when job is finished

    ## What is the purpose of the change
    
    When finite Flink applications (batch jobs) are sent to YARN in the detached mode, the final status is currently always the same, because the job's result is not passed to the logic that initiates the application shutdown.
    
    This PR forwards the final job status via a future that is used to register the shutdown handlers.
    
    ## Brief change log
    
      - Introduce the `JobTerminationFuture` in the `MiniDispatcher`
      - 
    
    ## Verifying this change
    
    ```
    bin/flink run -m yarn-cluster -yjm 2048 -ytm 2048  ./examples/streaming/WordCount.jar
    ```
    
      - Run the batch job as described above on YARN to succeed, check that the final application status is successful.
    
      - Run the batch job with a parameter to a non existing input file on YARN, check that the final application status is failed.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (yes / **no)**
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
      - The serializers: (yes / **no** / don't know)
      - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know)
      - The S3 file system connector: (yes / **no** / don't know)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (yes / **no**)
      - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink yarn_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5944.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5944
    
----
commit f4130c64420e2ad2acb680869c9b84aa5dbcc7c7
Author: Stephan Ewen <se...@...>
Date:   2018-04-30T07:55:50Z

    [hotfix] [tests] Update log4j-test.properties
    
    Brings the logging definition in sync with other projects.
    Updates the classname for the suppressed logger in Netty to account for the new
    shading model introduced in Flink 1.4.

commit 5fcc9aca392cbcd5dfa474b0a286868b44836f23
Author: Stephan Ewen <se...@...>
Date:   2018-04-27T16:57:27Z

    [FLINK-8900] [yarn] Set correct application status when job is finished

----


> YARN FinalStatus always shows as KILLED with Flip-6
> ---------------------------------------------------
>
>                 Key: FLINK-8900
>                 URL: https://issues.apache.org/jira/browse/FLINK-8900
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Nico Kruber
>            Assignee: Gary Yao
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> Whenever I run a simple simple word count like this one on YARN with Flip-6 enabled,
> {code}
> ./bin/flink run -m yarn-cluster -yjm 768 -ytm 3072 -ys 2 -p 20 -c org.apache.flink.streaming.examples.wordcount.WordCount ./examples/streaming/WordCount.jar --input /usr/share/doc/rsync-3.0.6/COPYING
> {code}
> it will show up as {{KILLED}} in the {{State}} and {{FinalStatus}} columns even though the program ran successfully like this one (irrespective of FLINK-8899 occurring or not):
> {code}
> 2018-03-08 16:48:39,049 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Streaming WordCount (11a794d2f5dc2955d8015625ec300c20) switched from state RUNNING to FINISHED.
> 2018-03-08 16:48:39,050 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping checkpoint coordinator for job 11a794d2f5dc2955d8015625ec300c20
> 2018-03-08 16:48:39,050 INFO  org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  - Shutting down
> 2018-03-08 16:48:39,078 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job 11a794d2f5dc2955d8015625ec300c20 reached globally terminal state FINISHED.
> 2018-03-08 16:48:39,151 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register TaskManager e58efd886429e8f080815ea74ddfa734 at the SlotManager.
> 2018-03-08 16:48:39,221 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping the JobMaster for job Streaming WordCount(11a794d2f5dc2955d8015625ec300c20).
> 2018-03-08 16:48:39,270 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Close ResourceManager connection 43f725adaee14987d3ff99380701f52f: JobManager is shutting down..
> 2018-03-08 16:48:39,270 INFO  org.apache.flink.yarn.YarnResourceManager                     - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@ip-172-31-7-0.eu-west-1.compute.internal:34281/user/jobmanager_0 for job 11a794d2f5dc2955d8015625ec300c20 from the resource manager.
> 2018-03-08 16:48:39,349 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending SlotPool.
> 2018-03-08 16:48:39,349 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping SlotPool.
> 2018-03-08 16:48:39,349 INFO  org.apache.flink.runtime.jobmaster.JobManagerRunner           - JobManagerRunner already shutdown.
> 2018-03-08 16:48:39,775 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register TaskManager 4e1fb6c8f95685e24b6a4cb4b71ffb92 at the SlotManager.
> 2018-03-08 16:48:39,846 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register TaskManager b5bce0bdfa7fbb0f4a0905cc3ee1c233 at the SlotManager.
> 2018-03-08 16:48:39,876 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
> 2018-03-08 16:48:39,910 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register TaskManager a35b0690fdc6ec38bbcbe18a965000fd at the SlotManager.
> 2018-03-08 16:48:39,942 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register TaskManager 5175cabe428bea19230ac056ff2a17bb at the SlotManager.
> 2018-03-08 16:48:39,974 INFO  org.apache.flink.runtime.blob.BlobServer                      - Stopped BLOB server at 0.0.0.0:46511
> 2018-03-08 16:48:39,975 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)