You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org> on 2010/07/22 22:21:50 UTC

[jira] Created: (HBASE-2866) Region permanently offlined

Region permanently offlined 
----------------------------

                 Key: HBASE-2866
                 URL: https://issues.apache.org/jira/browse/HBASE-2866
             Project: HBase
          Issue Type: Bug
            Reporter: Kannan Muthukkaruppan
            Priority: Blocker


After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.


Master:
---------
{code}
2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE

2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE

2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744

2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE

2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
{code}

-------------------

Region Server:

{code}
2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
.com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
        at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
        at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
        at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
        at java.lang.Thread.run(Thread.java:619)
2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
{code}

Meta:
-----

Relevant section of META.

Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.

{code}
 test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
 7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
                           00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
                           \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
                           x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
                           PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
                           0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
                           483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
                           MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
                           FE\xA0\xFD\xC5

 ..

 test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
 58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
 cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
                           657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
                           ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
                           OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
                           Y => 'false', BLOCKCACHE => 'true'}]}}
{code}


I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.

BTW, HBase "hbck" reported this as well (which was good!):

{code}
Number of Tables: 5
Number of live region servers:92
Number of dead region servers:0
.........
ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
{code}




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891825#action_12891825 ] 

HBase Review Board commented on HBASE-2866:
-------------------------------------------

Message from: "Karthik Ranganathan" <ka...@gmail.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/
-----------------------------------------------------------

(Updated 2010-07-23 15:08:16.286786)


Review request for hbase, stack and Kannan Muthukkaruppan.


Changes
-------

Addressed some comments (re-use variables, removed white spaces)


Summary
-------

Region permanently offlined - if the ZNode is already in the target state, do not update it again.


This addresses bug HBASE-2866.
    http://issues.apache.org/jira/browse/HBASE-2866


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java 967128 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java 967128 

Diff: http://review.hbase.org/r/380/diff


Testing
-------

Ran unit tests, went through fine (except TestRowAtomicity, which is known to be failing).


Thanks,

Karthik




> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891822#action_12891822 ] 

HBase Review Board commented on HBASE-2866:
-------------------------------------------

Message from: "Karthik Ranganathan" <ka...@gmail.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/
-----------------------------------------------------------

(Updated 2010-07-23 14:53:37.123965)


Review request for hbase, stack and Kannan Muthukkaruppan.


Changes
-------

Good catch Kannan... updated the diff with comments this time.


Summary
-------

Region permanently offlined - if the ZNode is already in the target state, do not update it again.


This addresses bug HBASE-2866.
    http://issues.apache.org/jira/browse/HBASE-2866


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java 967128 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java 967128 

Diff: http://review.hbase.org/r/380/diff


Testing
-------

Ran unit tests, went through fine (except TestRowAtomicity, which is known to be failing).


Thanks,

Karthik




> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "Karthik Ranganathan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891716#action_12891716 ] 

Karthik Ranganathan commented on HBASE-2866:
--------------------------------------------

One thing we can try is to change the state of the region to "CLOSED" in UNASSIGNED in zk...

Alternatively, is it possible to edit META somehow to set the region unassigned? 

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891818#action_12891818 ] 

HBase Review Board commented on HBASE-2866:
-------------------------------------------

Message from: "Jean-Daniel Cryans" <jd...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/#review471
-----------------------------------------------------------


Some nits, I'm also trying it out (with the update=true fix)


trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1962>

    Remove all the trailing white spaces



trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1963>

    Reuse curState and newState instead



trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1961>

    Simply return when you figure that you should, then you can get rid of "update"


- Jean-Daniel





> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891806#action_12891806 ] 

HBase Review Board commented on HBASE-2866:
-------------------------------------------

Message from: "Karthik Ranganathan" <ka...@gmail.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/
-----------------------------------------------------------

(Updated 2010-07-23 14:26:01.718168)


Review request for hbase, stack and Kannan Muthukkaruppan.


Changes
-------

Adding hbase group


Summary
-------

Region permanently offlined - if the ZNode is already in the target state, do not update it again.


This addresses bug HBASE-2866.
    http://issues.apache.org/jira/browse/HBASE-2866


Diffs
-----

  trunk/src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java 967128 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java 967128 

Diff: http://review.hbase.org/r/380/diff


Testing
-------

Ran unit tests, went through fine (except TestRowAtomicity, which is known to be failing).


Thanks,

Karthik




> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891766#action_12891766 ] 

stack commented on HBASE-2866:
------------------------------

@Kannan

bq. ...So need to look into what was happening with the other region...

Yes. Would help improve hbck tool.

bq. ...And, surprisingly though, ZK UNASSIGNED node shows 4 entries....

Whats up w/ that?

So, jgray and karthik, the current state of code in master is that its in transition still.. we have not yet hit the end point -- that there is still a bunch of change coming.  Right?  Can we have a fixup for this issue for now?  I'd like to roll a new 0.89.x but w/ a fix for this (sounds like you have it Karthik).

@Kannan Restart master is how its addressed currently.  Want to add a little tool to UI?

@Karthik A region is unassigned in .META. if it does not have a server and startcode as this one does.

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "Karthik Ranganathan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891804#action_12891804 ] 

Karthik Ranganathan commented on HBASE-2866:
--------------------------------------------


Stack - just uploaded a review at http://review.hbase.org/r/380/

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HBASE-2866) Region permanently offlined

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-2866.
---------------------------------------

     Hadoop Flags: [Reviewed]
    Fix Version/s: 0.90.0
       Resolution: Fixed

Committed to trunk, thanks Karthik!

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891832#action_12891832 ] 

HBase Review Board commented on HBASE-2866:
-------------------------------------------

Message from: "Jean-Daniel Cryans" <jd...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/#review473
-----------------------------------------------------------

Ship it!


+1 LGTM and TestAdmin passes on my machine without flinching

- Jean-Daniel





> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891483#action_12891483 ] 

stack commented on HBASE-2866:
------------------------------

@Kannan Thanks.  Looking at master and at code, my thought is that the fixup code didn't run because that region is stuck in transition.   Here is where we'd skip out starting at about #562 in BaseScanner:

{code}
    synchronized (this.master.getRegionManager()) {
      /* We don't assign regions that are offline, in transition or were on
       * a dead server. Regions that were on a dead server will get reassigned
       * by ProcessServerShutdown
       */
      if (info.isOffline() ||
        this.master.getRegionManager().regionIsInTransition(info.getRegionNameAsString()) ||
         // St.Ack ^^^^^^^^^^ My guess is we are in here^^^^^^^^^
          (serverName != null && this.master.getServerManager().isDead(serverName))) {
        return;
      }
{code}

I think 'status' in shell:

{code}
hbase(main):003:0> status 'detailed'
version 0.89.0-SNAPSHOT
0 regionsInTransition
1 live servers
    192.168.1.157:49248 1279864501042
        requests=0, regions=3, usedHeap=32, maxHeap=994
        .META.,,1
            stores=2, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
        x,,1279864569260.65c4857477eb31bff0fafae4797a90d8.
            stores=1, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
        -ROOT-,,0
            stores=1, storefiles=1, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
0 dead servers
{code}

@Karthik Give me a clue as to what you are thinking and I'll have a go at fixing this one if you don't have the time boss.

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891814#action_12891814 ] 

HBase Review Board commented on HBASE-2866:
-------------------------------------------

Message from: "Kannan Muthukkaruppan" <ka...@facebook.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/#review470
-----------------------------------------------------------



trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1960>

    should this be:
    
     update = true;
    
    


- Kannan





> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HBASE-2866) Region permanently offlined

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kannan Muthukkaruppan reassigned HBASE-2866:
--------------------------------------------

    Assignee: Karthik Ranganathan

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891831#action_12891831 ] 

HBase Review Board commented on HBASE-2866:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/#review474
-----------------------------------------------------------

Ship it!


Just one comment


trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1965>

    What is the watcher being triggered if data is not changing?


- stack





> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "Karthik Ranganathan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891649#action_12891649 ] 

Karthik Ranganathan commented on HBASE-2866:
--------------------------------------------

Hey Stack,

Have a fix ready - testing now, will put it up in a bit.

Fix is simple: we get into this situation because we update the same region in transition in ZK again and again, which bumps up the revision number of the ZNode. This causes the update to fail. So if the ZNode is already in the target state, do not update it again.

The above explanation is super-cryptic :), so will sync up with you on the issue and the fix.

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-2866) Region permanently offlined

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kannan Muthukkaruppan updated HBASE-2866:
-----------------------------------------

    Attachment: master.log

@Stack: Attached the master's log.

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891689#action_12891689 ] 

Kannan Muthukkaruppan commented on HBASE-2866:
----------------------------------------------

Stack:

{code}
hbase(main):002:0> status 'detailed'
status 'detailed'
version 0.21.0-SNAPSHOT
1 regionsInTransition
    name=test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44., state=PENDING_OPEN
92 live servers
...
{code}

So we find one region in PENDING_OPEN state. HBase "hbck" though complained about 2 regions. (So need to look into what was happening with the other region).

And, surprisingly though, ZK UNASSIGNED node shows 4 entries.

{code}
hbase(main):006:0> zk "ls /hbase/UNASSIGNED"
zk "ls /hbase/UNASSIGNED"
[9d675cd0b61c44c6605e752490a36eaf, 2e8c1def6dfe3ee1b8fc965629a93041, 9697bd45dc3d62d7e6b519a13d668062, 8d5490bfc166c687657cb09203bd7d44]
{code}



> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891694#action_12891694 ] 

Kannan Muthukkaruppan commented on HBASE-2866:
----------------------------------------------

Wondering how this case is handled...

Master asks a RS to open a region, and I guess adds 
it to regions in transition. If RS dies before even starting
to work on the region, is there some timeout mechanism
that kicks in and master realized it needs to give this
region to someone else?

---------

For now, what's the manual steps to be taken to get master 
to reassign the region? Restarting the master would work
I suppose for now. Any other ideas?






> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-2866) Region permanently offlined

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891381#action_12891381 ] 

Jean-Daniel Cryans commented on HBASE-2866:
-------------------------------------------

This issue looks like the one in TestAdmin that fails every now and then on Hudson like: http://hudson.zones.apache.org/hudson/job/HBase-TRUNK/1397/

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1                                                                                                                                                                                                     
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.