You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Purshotam Shah (JIRA)" <ji...@apache.org> on 2014/08/28 00:50:58 UTC

[jira] [Updated] (OOZIE-1982) Workflow never resume after it goes to suspend state by oozie server ( because of Hadoop error)

     [ https://issues.apache.org/jira/browse/OOZIE-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Purshotam Shah updated OOZIE-1982:
----------------------------------

    Description: 
Oozie submit job and and it's keep on checking job status.
While oozie is checking job status and if RM goes down (job is still running), Oozie tries 3 time and suspend the job.

Job never recovers even if oozie receives job end notification.

It's good to suspend job because we don't want to keep on retrying.

But Oozie should resume once RM is up.

This can done in 2 way.
1. Job end notification should resume the suspend job.
2. Recovery service can also recover those suspended jobs.  


We may also need to introduce new job status (like platform_suspend) to differentiate it from use suspend.
If user has suspended job, server should not resume it.

> Workflow never resume after it goes to suspend state by oozie server ( because of Hadoop error)
> -----------------------------------------------------------------------------------------------
>
>                 Key: OOZIE-1982
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1982
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Purshotam Shah
>
> Oozie submit job and and it's keep on checking job status.
> While oozie is checking job status and if RM goes down (job is still running), Oozie tries 3 time and suspend the job.
> Job never recovers even if oozie receives job end notification.
> It's good to suspend job because we don't want to keep on retrying.
> But Oozie should resume once RM is up.
> This can done in 2 way.
> 1. Job end notification should resume the suspend job.
> 2. Recovery service can also recover those suspended jobs.  
> We may also need to introduce new job status (like platform_suspend) to differentiate it from use suspend.
> If user has suspended job, server should not resume it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)