You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "Sandeep Nemuri (JIRA)" <ji...@apache.org> on 2016/08/01 10:43:20 UTC

[jira] [Updated] (SLIDER-1161) Improve regionserver status check in HBase Slider app package

     [ https://issues.apache.org/jira/browse/SLIDER-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sandeep Nemuri updated SLIDER-1161:
-----------------------------------
    Description: 
*PROBLEM* :

Using slider for launching Hbase containers.
Following is the problem statement and details :
1. Assume region server went into a big pause and lost its heartbeat with zookeeper 
2. HMaster notices this and marks the region server as DEAD 
3. However, slider agent continues to 'ps' the region server process in every heartbeat.monitor.interval (45000ms in my case) and because it is just checking for region server process being alive, it does not consider it dead 
4. After that big delay, region server finally recovers and goes to HMaster 
5. HMaster informs region server YouAreAlreadyDeadException 
6. Now, this region server brings itself down and slider also notices that process is no longer running. 
7. Slider now launches a new region server.

The issue as clearly mentioned in steps above is that there can be a huge delay between step 4 and 6. This means that we are now operating with lesser region servers and this puts more and more load on existing region servers.


The issue can be solved if slider would sync up with HMaster to find whether region server is alive or not. That way, it would immediately know that HMaster has already marked a region server as dead and will then bring down the region server and launch a new one.

  was:
*PROBLEM* :

Using slider for launching Hbase containers.
Following is the problem statement and details :
1. Assume region server went into a big pause and lost its heartbeat with zookeeper 
2. HMaster notices this and marks the region server as DEAD 
3. However, slider agent continues to 'ps' the region server process in every heartbeat.monitor.interval (45000ms in my case) and because it is just checking for region server process being alive, it does not consider it dead 
4. After that big delay, region server finally recovers and goes to HMaster 
5. HMaster informs region server YouAreAlreadyDeadException 
6. Now, this region server brings itself down and slider also notices that process is no longer running. 
7. Slider now launches a new region server.

The issue as clearly mentioned in steps above is that there can be a huge delay between step 4 and 6. This means that we are now operating with lesser region servers and this puts more and more load on existing region servers.



> Improve regionserver status check in HBase Slider app package
> -------------------------------------------------------------
>
>                 Key: SLIDER-1161
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1161
>             Project: Slider
>          Issue Type: Improvement
>          Components: app-package
>    Affects Versions: Slider 0.80
>         Environment: RHEL-6 (64 Bit)
>            Reporter: Sandeep Nemuri
>
> *PROBLEM* :
> Using slider for launching Hbase containers.
> Following is the problem statement and details :
> 1. Assume region server went into a big pause and lost its heartbeat with zookeeper 
> 2. HMaster notices this and marks the region server as DEAD 
> 3. However, slider agent continues to 'ps' the region server process in every heartbeat.monitor.interval (45000ms in my case) and because it is just checking for region server process being alive, it does not consider it dead 
> 4. After that big delay, region server finally recovers and goes to HMaster 
> 5. HMaster informs region server YouAreAlreadyDeadException 
> 6. Now, this region server brings itself down and slider also notices that process is no longer running. 
> 7. Slider now launches a new region server.
> The issue as clearly mentioned in steps above is that there can be a huge delay between step 4 and 6. This means that we are now operating with lesser region servers and this puts more and more load on existing region servers.
> The issue can be solved if slider would sync up with HMaster to find whether region server is alive or not. That way, it would immediately know that HMaster has already marked a region server as dead and will then bring down the region server and launch a new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)