You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Xun Liu (JIRA)" <ji...@apache.org> on 2018/10/15 09:33:00 UTC

[jira] [Updated] (YARN-8876) [Submarine] Job monitor of {submarine}

     [ https://issues.apache.org/jira/browse/YARN-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xun Liu updated YARN-8876:
--------------------------
    Description: 
h1. Job monitor of \{submarine}

After training, the monitoring program need auto close PS service.

The submarine needs to provide a long-term resident service that monitors each JOB mission.

This monitoring service can be processed differently according to the training tasks of different depth learning framework types.

For example: Tensorflow performs distributed training, when the training is completed,

The PS service cannot be automatically stopped. At this time, the PS needs to be actively stopped by the monitoring service.

  was:
h1. Job monitor of {submarine}

The submarine needs to provide a long-term resident service that monitors each JOB mission.

This monitoring service can be processed differently according to the training tasks of different depth learning framework types.

For example: Tensorflow performs distributed training, when the training is completed,

The PS service cannot be automatically stopped. At this time, the PS needs to be actively stopped by the monitoring service.


> [Submarine] Job monitor of {submarine}
> --------------------------------------
>
>                 Key: YARN-8876
>                 URL: https://issues.apache.org/jira/browse/YARN-8876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Xun Liu
>            Assignee: Xun Liu
>            Priority: Major
>
> h1. Job monitor of \{submarine}
> After training, the monitoring program need auto close PS service.
> The submarine needs to provide a long-term resident service that monitors each JOB mission.
> This monitoring service can be processed differently according to the training tasks of different depth learning framework types.
> For example: Tensorflow performs distributed training, when the training is completed,
> The PS service cannot be automatically stopped. At this time, the PS needs to be actively stopped by the monitoring service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org