You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ambari.apache.org by "Jonathan Hurley (JIRA)" <ji...@apache.org> on 2018/06/13 16:18:00 UTC

[jira] [Resolved] (AMBARI-24090) Resuming a Paused Upgrade Attempts To Retry Tasks Which Were Already Passed

     [ https://issues.apache.org/jira/browse/AMBARI-24090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hurley resolved AMBARI-24090.
--------------------------------------
    Resolution: Fixed

> Resuming a Paused Upgrade Attempts To Retry Tasks Which Were Already Passed
> ---------------------------------------------------------------------------
>
>                 Key: AMBARI-24090
>                 URL: https://issues.apache.org/jira/browse/AMBARI-24090
>             Project: Ambari
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 2.7.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> STR:
> - Perform a stack upgrade where during the upgrade, there's a slave failure due to a timed out task
> - Ignore and Proceed past this task and get to the Component Version Check. It should fail (since you had a timed out task on a slave)
> - Pause the upgrade and then fix the version by restarting the component
> - Resume the upgrade
> The upgrade attempts to place the timed out task back into a PENDING state. 
> I think I have an idea about what's going on here. Consider the following upgrade tasks:
> - 1: COMPLETED
> - 2: HOLDING_TIMEDOUT
> - 3: PENDIND
> - 4: PENDING
> - 5: PENDING
> When you "Ignore and Proceed", it sets the {{HOLDING_TIMEDOUT}} to {{TIMEDOUT}} which is technically a completed state:
> - 1: COMPLETED
> - 2: TIMEDOUT
> - 3: COMPLETED
> - 4: HOLDING_FAILED
> - 5: PENDING
> Now, you go to pause the upgrade and we set every "scheduled" state to {{ABORTED}}, so this preserves existing states:
> - 1: COMPLETED
> - 2: TIMEDOUT
> - 3: COMPLETED
> - 4: ABORTED
> - 5: ABORTED
> When you go to resume the upgrade, it searches for all {{ABORTED}} _AND_ {{TIMEDOUT}} to reset to {{PENDING}}
> - 1: COMPLETED
> - 2: PENDING
> - 3: COMPLETED
> - 4: PENDING
> - 5: PENDING
> So, because an earlier task is now set to {{PENDING}}, this causes the scheduler to barf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)