You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Greg Bowyer (JIRA)" <ji...@apache.org> on 2011/02/17 18:33:24 UTC

[jira] Created: (HBASE-3545) Possible liveness issue with MasterServerAddress in HRegionServer getMaster

Possible liveness issue with MasterServerAddress in HRegionServer getMaster 
----------------------------------------------------------------------------

                 Key: HBASE-3545
                 URL: https://issues.apache.org/jira/browse/HBASE-3545
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 0.90.0
         Environment: 4 Node test cluster
2x Hbase master
3x Zookeeper nodes
4x RS
            Reporter: Greg Bowyer


As part of our evaluation of HBase we have been testing failure scenarios to see how HBase fails in certain situations.

One of these is the outright failure of a HBase master.

What presently happens, if a HBase master is shutdown, is that the standby master becomes the active master in the Zookeeper. At the same time the region servers fail to connect to the dead master and typically fail their own heartbeats as part of the reportForDuty() method.

Following this the region server attempts to get a connection to a working HBase master, inside the getMaster() method the first action is to get the address of a potentially working master server from zookeeper. Following this the code is put into a tight loop whereupon it keeps attempting to connect to the address of the master found in Zookeeper.

Unfortunately it appears that during master fail-over, it becomes possible to get the address of the old, broken master, this address is then put into the connection attempt loop, whereupon the region server attempts to infinitely connect to the failed, none existent master. At this point nothing is able to break the loop in getMaster so the RS is unable to contact the master.

At the same time the new master is waiting patiently for the existing region servers as reported in Zookeeper to re-establish contact with it.

Attached is a patch that rectifies this issue in our test cluster for both the 0.90.0 tag and trunk versions (as of git SHA b72a24f71b67598e4077a9d1452f903082b0a9b7) of HBase.

This patch is also available in a forked repository here https://github.com/GregBowyer/hbase/commit/543f5903731ef6bbfd58c990e04a2c635e5c94b4

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HBASE-3545) Possible liveness issue with MasterServerAddress in HRegionServer getMaster

Posted by "Greg Bowyer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Bowyer updated HBASE-3545:
-------------------------------

    Status: Open  (was: Patch Available)

> Possible liveness issue with MasterServerAddress in HRegionServer getMaster 
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-3545
>                 URL: https://issues.apache.org/jira/browse/HBASE-3545
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>         Environment: 4 Node test cluster
> 2x Hbase master
> 3x Zookeeper nodes
> 4x RS
>            Reporter: Greg Bowyer
>
> As part of our evaluation of HBase we have been testing failure scenarios to see how HBase fails in certain situations.
> One of these is the outright failure of a HBase master.
> What presently happens, if a HBase master is shutdown, is that the standby master becomes the active master in the Zookeeper. At the same time the region servers fail to connect to the dead master and typically fail their own heartbeats as part of the reportForDuty() method.
> Following this the region server attempts to get a connection to a working HBase master, inside the getMaster() method the first action is to get the address of a potentially working master server from zookeeper. Following this the code is put into a tight loop whereupon it keeps attempting to connect to the address of the master found in Zookeeper.
> Unfortunately it appears that during master fail-over, it becomes possible to get the address of the old, broken master, this address is then put into the connection attempt loop, whereupon the region server attempts to infinitely connect to the failed, none existent master. At this point nothing is able to break the loop in getMaster so the RS is unable to contact the master.
> At the same time the new master is waiting patiently for the existing region servers as reported in Zookeeper to re-establish contact with it.
> Attached is a patch that rectifies this issue in our test cluster for both the 0.90.0 tag and trunk versions (as of git SHA b72a24f71b67598e4077a9d1452f903082b0a9b7) of HBase.
> This patch is also available in a forked repository here https://github.com/GregBowyer/hbase/commit/543f5903731ef6bbfd58c990e04a2c635e5c94b4

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HBASE-3545) Possible liveness issue with MasterServerAddress in HRegionServer getMaster

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3545:
--------------------------------------

    Fix Version/s: 0.90.2

> Possible liveness issue with MasterServerAddress in HRegionServer getMaster 
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-3545
>                 URL: https://issues.apache.org/jira/browse/HBASE-3545
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>         Environment: 4 Node test cluster
> 2x Hbase master
> 3x Zookeeper nodes
> 4x RS
>            Reporter: Greg Bowyer
>             Fix For: 0.90.2
>
>         Attachments: 0001-Fixed-issue-where-the-region-servers-may-never-conne.patch
>
>
> As part of our evaluation of HBase we have been testing failure scenarios to see how HBase fails in certain situations.
> One of these is the outright failure of a HBase master.
> What presently happens, if a HBase master is shutdown, is that the standby master becomes the active master in the Zookeeper. At the same time the region servers fail to connect to the dead master and typically fail their own heartbeats as part of the reportForDuty() method.
> Following this the region server attempts to get a connection to a working HBase master, inside the getMaster() method the first action is to get the address of a potentially working master server from zookeeper. Following this the code is put into a tight loop whereupon it keeps attempting to connect to the address of the master found in Zookeeper.
> Unfortunately it appears that during master fail-over, it becomes possible to get the address of the old, broken master, this address is then put into the connection attempt loop, whereupon the region server attempts to infinitely connect to the failed, none existent master. At this point nothing is able to break the loop in getMaster so the RS is unable to contact the master.
> At the same time the new master is waiting patiently for the existing region servers as reported in Zookeeper to re-establish contact with it.
> Attached is a patch that rectifies this issue in our test cluster for both the 0.90.0 tag and trunk versions (as of git SHA b72a24f71b67598e4077a9d1452f903082b0a9b7) of HBase.
> This patch is also available in a forked repository here https://github.com/GregBowyer/hbase/commit/543f5903731ef6bbfd58c990e04a2c635e5c94b4

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (HBASE-3545) Possible liveness issue with MasterServerAddress in HRegionServer getMaster

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-3545.
--------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed trunk and branch (will be available in 0.90.2).  Thanks for the patch Greg.

> Possible liveness issue with MasterServerAddress in HRegionServer getMaster 
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-3545
>                 URL: https://issues.apache.org/jira/browse/HBASE-3545
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>         Environment: 4 Node test cluster
> 2x Hbase master
> 3x Zookeeper nodes
> 4x RS
>            Reporter: Greg Bowyer
>         Attachments: 0001-Fixed-issue-where-the-region-servers-may-never-conne.patch
>
>
> As part of our evaluation of HBase we have been testing failure scenarios to see how HBase fails in certain situations.
> One of these is the outright failure of a HBase master.
> What presently happens, if a HBase master is shutdown, is that the standby master becomes the active master in the Zookeeper. At the same time the region servers fail to connect to the dead master and typically fail their own heartbeats as part of the reportForDuty() method.
> Following this the region server attempts to get a connection to a working HBase master, inside the getMaster() method the first action is to get the address of a potentially working master server from zookeeper. Following this the code is put into a tight loop whereupon it keeps attempting to connect to the address of the master found in Zookeeper.
> Unfortunately it appears that during master fail-over, it becomes possible to get the address of the old, broken master, this address is then put into the connection attempt loop, whereupon the region server attempts to infinitely connect to the failed, none existent master. At this point nothing is able to break the loop in getMaster so the RS is unable to contact the master.
> At the same time the new master is waiting patiently for the existing region servers as reported in Zookeeper to re-establish contact with it.
> Attached is a patch that rectifies this issue in our test cluster for both the 0.90.0 tag and trunk versions (as of git SHA b72a24f71b67598e4077a9d1452f903082b0a9b7) of HBase.
> This patch is also available in a forked repository here https://github.com/GregBowyer/hbase/commit/543f5903731ef6bbfd58c990e04a2c635e5c94b4

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HBASE-3545) Possible liveness issue with MasterServerAddress in HRegionServer getMaster

Posted by "Greg Bowyer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Bowyer updated HBASE-3545:
-------------------------------

    Attachment: 0001-Fixed-issue-where-the-region-servers-may-never-conne.patch

> Possible liveness issue with MasterServerAddress in HRegionServer getMaster 
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-3545
>                 URL: https://issues.apache.org/jira/browse/HBASE-3545
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>         Environment: 4 Node test cluster
> 2x Hbase master
> 3x Zookeeper nodes
> 4x RS
>            Reporter: Greg Bowyer
>         Attachments: 0001-Fixed-issue-where-the-region-servers-may-never-conne.patch
>
>
> As part of our evaluation of HBase we have been testing failure scenarios to see how HBase fails in certain situations.
> One of these is the outright failure of a HBase master.
> What presently happens, if a HBase master is shutdown, is that the standby master becomes the active master in the Zookeeper. At the same time the region servers fail to connect to the dead master and typically fail their own heartbeats as part of the reportForDuty() method.
> Following this the region server attempts to get a connection to a working HBase master, inside the getMaster() method the first action is to get the address of a potentially working master server from zookeeper. Following this the code is put into a tight loop whereupon it keeps attempting to connect to the address of the master found in Zookeeper.
> Unfortunately it appears that during master fail-over, it becomes possible to get the address of the old, broken master, this address is then put into the connection attempt loop, whereupon the region server attempts to infinitely connect to the failed, none existent master. At this point nothing is able to break the loop in getMaster so the RS is unable to contact the master.
> At the same time the new master is waiting patiently for the existing region servers as reported in Zookeeper to re-establish contact with it.
> Attached is a patch that rectifies this issue in our test cluster for both the 0.90.0 tag and trunk versions (as of git SHA b72a24f71b67598e4077a9d1452f903082b0a9b7) of HBase.
> This patch is also available in a forked repository here https://github.com/GregBowyer/hbase/commit/543f5903731ef6bbfd58c990e04a2c635e5c94b4

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HBASE-3545) Possible liveness issue with MasterServerAddress in HRegionServer getMaster

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996152#comment-12996152 ] 

Hudson commented on HBASE-3545:
-------------------------------

Integrated in HBase-TRUNK #1746 (See [https://hudson.apache.org/hudson/job/HBase-TRUNK/1746/])
    HBASE-3545 Possible liveness issue with MasterServerAddress in HRegionServer getMaster


> Possible liveness issue with MasterServerAddress in HRegionServer getMaster 
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-3545
>                 URL: https://issues.apache.org/jira/browse/HBASE-3545
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>         Environment: 4 Node test cluster
> 2x Hbase master
> 3x Zookeeper nodes
> 4x RS
>            Reporter: Greg Bowyer
>         Attachments: 0001-Fixed-issue-where-the-region-servers-may-never-conne.patch
>
>
> As part of our evaluation of HBase we have been testing failure scenarios to see how HBase fails in certain situations.
> One of these is the outright failure of a HBase master.
> What presently happens, if a HBase master is shutdown, is that the standby master becomes the active master in the Zookeeper. At the same time the region servers fail to connect to the dead master and typically fail their own heartbeats as part of the reportForDuty() method.
> Following this the region server attempts to get a connection to a working HBase master, inside the getMaster() method the first action is to get the address of a potentially working master server from zookeeper. Following this the code is put into a tight loop whereupon it keeps attempting to connect to the address of the master found in Zookeeper.
> Unfortunately it appears that during master fail-over, it becomes possible to get the address of the old, broken master, this address is then put into the connection attempt loop, whereupon the region server attempts to infinitely connect to the failed, none existent master. At this point nothing is able to break the loop in getMaster so the RS is unable to contact the master.
> At the same time the new master is waiting patiently for the existing region servers as reported in Zookeeper to re-establish contact with it.
> Attached is a patch that rectifies this issue in our test cluster for both the 0.90.0 tag and trunk versions (as of git SHA b72a24f71b67598e4077a9d1452f903082b0a9b7) of HBase.
> This patch is also available in a forked repository here https://github.com/GregBowyer/hbase/commit/543f5903731ef6bbfd58c990e04a2c635e5c94b4

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira