You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ambari.apache.org by Jonathan Hurley <jh...@hortonworks.com> on 2016/03/01 18:39:40 UTC

Re: Review Request 43967: Express Upgrade Stuck At Manual Prompt Due To HRC Status Calculation Cache Problem

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43967/
-----------------------------------------------------------

(Updated March 1, 2016, 12:39 p.m.)


Review request for Ambari, Alejandro Fernandez, Nate Cole, Sumit Mohanty, Sebastian Toader, and Sid Wagle.


Changes
-------

I ditched the LoadingCache for a normal Cache and I can no longer manipulate the breakpoints into causing a situation where the cache is stale. I have a high degree of confidence in the latest patch. By ditching Guava's LoadingCache, we're now able to update the cache entry directly within the context of our lock. This was the missing piece; with a LoadingCache, Guava used Future instances to update the cache at some point in the future, which was outside the context of our lock.


Bugs: AMBARI-15173
    https://issues.apache.org/jira/browse/AMBARI-15173


Repository: ambari


Description
-------

Seen while performing an upgrade, it's possible that the status of a request/stage does not match that of its tasks. Essentially, the task could be {{HOLDING}} while the request is still {{IN_PROGRESS}}.

I believe that AMBARI-15011 is responsible for this issue. AMBARI-15011 introduced, among other things, a cache to the {{HostRoleCommandStatusSummaryDTO}} which is a aggregation of the number of tasks a stage has in each state (PENDING, HOLDING, etc).

This {{HostRoleCommandStatusSummaryDTO}} is used by {{CalculatedState}} to calculate a stage's and request's status based on the tasks. 

The problem is that {{ServerActionExecutor}} is moving a tasks's state to {{HOLDING}} (reflected in the database correctly) but the cache invalidation happens inside the uncommitted transaction. This causes stale data to be re-cached. So, when we go to calculate the request and state status, we get {{IN_PROGRESS}} instead of {{HOLDING}}.

{code}
{
  "href": "http://172.22.72.13:8080/api/v1/clusters/cl1/requests/61/stages/1?fields=*,tasks/*",
  "Stage": {
    "cluster_name": "cl1",
    "context": "Stop YARN Queues",
    "display_status": "IN_PROGRESS",
    "end_time": -1,
    "progress_percent": 35,
    "request_id": 61,
    "skippable": true,
    "stage_id": 1,
    "start_time": 1456227329191,
    "status": "IN_PROGRESS"
  },
  "tasks": [
    {
      "href": "http://172.22.72.13:8080/api/v1/clusters/cl1/requests/61/stages/1/tasks/754",
      "Tasks": {
        "attempt_cnt": 1,
        "cluster_name": "cl1",
        "command": "EXECUTE",
        "command_detail": "Before continuing, please stop all YARN queues. If yarn-site's yarn.resourcemanager.work-preserving-recovery.enabled is set to true, then you can skip this step since the clients will retry on their own.",
        "custom_command_name": "org.apache.ambari.server.serveraction.upgrades.ManualStageAction",
        "end_time": -1,
        "error_log": "errors-754.txt",
        "exit_code": 0,
        "host_name": "os-r6-mkqzcs-c10tom21unsecha-6.novalocal",
        "id": 754,
        "output_log": "output-754.txt",
        "request_id": 61,
        "role": "AMBARI_SERVER_ACTION",
        "stage_id": 1,
        "start_time": 1456227329191,
        "status": "HOLDING",
        "stderr": "",
        "stdout": "",
        "structured_out": {}
      }
    }
  ]
}
{code}


Diffs (updated)
-----

  ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java 14dac79 

Diff: https://reviews.apache.org/r/43967/diff/


Testing
-------

Pending unit tests...


Thanks,

Jonathan Hurley

Re: Review Request 43967: Express Upgrade Stuck At Manual Prompt Due To HRC Status Calculation Cache Problem

Posted by Sebastian Toader <st...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43967/#review121456
-----------------------------------------------------------


Ship it!




Ship It!

- Sebastian Toader


On March 1, 2016, 6:39 p.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43967/
> -----------------------------------------------------------
> 
> (Updated March 1, 2016, 6:39 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Nate Cole, Sumit Mohanty, Sebastian Toader, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15173
>     https://issues.apache.org/jira/browse/AMBARI-15173
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Seen while performing an upgrade, it's possible that the status of a request/stage does not match that of its tasks. Essentially, the task could be {{HOLDING}} while the request is still {{IN_PROGRESS}}.
> 
> I believe that AMBARI-15011 is responsible for this issue. AMBARI-15011 introduced, among other things, a cache to the {{HostRoleCommandStatusSummaryDTO}} which is a aggregation of the number of tasks a stage has in each state (PENDING, HOLDING, etc).
> 
> This {{HostRoleCommandStatusSummaryDTO}} is used by {{CalculatedState}} to calculate a stage's and request's status based on the tasks. 
> 
> The problem is that {{ServerActionExecutor}} is moving a tasks's state to {{HOLDING}} (reflected in the database correctly) but the cache invalidation happens inside the uncommitted transaction. This causes stale data to be re-cached. So, when we go to calculate the request and state status, we get {{IN_PROGRESS}} instead of {{HOLDING}}.
> 
> {code}
> {
>   "href": "http://172.22.72.13:8080/api/v1/clusters/cl1/requests/61/stages/1?fields=*,tasks/*",
>   "Stage": {
>     "cluster_name": "cl1",
>     "context": "Stop YARN Queues",
>     "display_status": "IN_PROGRESS",
>     "end_time": -1,
>     "progress_percent": 35,
>     "request_id": 61,
>     "skippable": true,
>     "stage_id": 1,
>     "start_time": 1456227329191,
>     "status": "IN_PROGRESS"
>   },
>   "tasks": [
>     {
>       "href": "http://172.22.72.13:8080/api/v1/clusters/cl1/requests/61/stages/1/tasks/754",
>       "Tasks": {
>         "attempt_cnt": 1,
>         "cluster_name": "cl1",
>         "command": "EXECUTE",
>         "command_detail": "Before continuing, please stop all YARN queues. If yarn-site's yarn.resourcemanager.work-preserving-recovery.enabled is set to true, then you can skip this step since the clients will retry on their own.",
>         "custom_command_name": "org.apache.ambari.server.serveraction.upgrades.ManualStageAction",
>         "end_time": -1,
>         "error_log": "errors-754.txt",
>         "exit_code": 0,
>         "host_name": "os-r6-mkqzcs-c10tom21unsecha-6.novalocal",
>         "id": 754,
>         "output_log": "output-754.txt",
>         "request_id": 61,
>         "role": "AMBARI_SERVER_ACTION",
>         "stage_id": 1,
>         "start_time": 1456227329191,
>         "status": "HOLDING",
>         "stderr": "",
>         "stdout": "",
>         "structured_out": {}
>       }
>     }
>   ]
> }
> {code}
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java 14dac79 
> 
> Diff: https://reviews.apache.org/r/43967/diff/
> 
> 
> Testing
> -------
> 
> Pending unit tests...
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>