You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org> on 2010/07/22 22:21:50 UTC
[jira] Created: (HBASE-2866) Region permanently offlined
Region permanently offlined
----------------------------
Key: HBASE-2866
URL: https://issues.apache.org/jira/browse/HBASE-2866
Project: HBase
Issue Type: Bug
Reporter: Kannan Muthukkaruppan
Priority: Blocker
After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
Master:
---------
{code}
2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
{code}
-------------------
Region Server:
{code}
2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
.com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
at java.lang.Thread.run(Thread.java:619)
2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
{code}
Meta:
-----
Relevant section of META.
Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
{code}
test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
\x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
FE\xA0\xFD\xC5
..
test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
Y => 'false', BLOCKCACHE => 'true'}]}}
{code}
I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
BTW, HBase "hbck" reported this as well (which was good!):
{code}
Number of Tables: 5
Number of live region servers:92
Number of dead region servers:0
.........
ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
{code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891825#action_12891825 ]
HBase Review Board commented on HBASE-2866:
-------------------------------------------
Message from: "Karthik Ranganathan" <ka...@gmail.com>
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/
-----------------------------------------------------------
(Updated 2010-07-23 15:08:16.286786)
Review request for hbase, stack and Kannan Muthukkaruppan.
Changes
-------
Addressed some comments (re-use variables, removed white spaces)
Summary
-------
Region permanently offlined - if the ZNode is already in the target state, do not update it again.
This addresses bug HBASE-2866.
http://issues.apache.org/jira/browse/HBASE-2866
Diffs (updated)
-----
trunk/src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java 967128
trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java 967128
Diff: http://review.hbase.org/r/380/diff
Testing
-------
Ran unit tests, went through fine (except TestRowAtomicity, which is known to be failing).
Thanks,
Karthik
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891822#action_12891822 ]
HBase Review Board commented on HBASE-2866:
-------------------------------------------
Message from: "Karthik Ranganathan" <ka...@gmail.com>
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/
-----------------------------------------------------------
(Updated 2010-07-23 14:53:37.123965)
Review request for hbase, stack and Kannan Muthukkaruppan.
Changes
-------
Good catch Kannan... updated the diff with comments this time.
Summary
-------
Region permanently offlined - if the ZNode is already in the target state, do not update it again.
This addresses bug HBASE-2866.
http://issues.apache.org/jira/browse/HBASE-2866
Diffs (updated)
-----
trunk/src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java 967128
trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java 967128
Diff: http://review.hbase.org/r/380/diff
Testing
-------
Ran unit tests, went through fine (except TestRowAtomicity, which is known to be failing).
Thanks,
Karthik
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "Karthik Ranganathan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891716#action_12891716 ]
Karthik Ranganathan commented on HBASE-2866:
--------------------------------------------
One thing we can try is to change the state of the region to "CLOSED" in UNASSIGNED in zk...
Alternatively, is it possible to edit META somehow to set the region unassigned?
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891818#action_12891818 ]
HBase Review Board commented on HBASE-2866:
-------------------------------------------
Message from: "Jean-Daniel Cryans" <jd...@apache.org>
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/#review471
-----------------------------------------------------------
Some nits, I'm also trying it out (with the update=true fix)
trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1962>
Remove all the trailing white spaces
trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1963>
Reuse curState and newState instead
trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1961>
Simply return when you figure that you should, then you can get rid of "update"
- Jean-Daniel
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891806#action_12891806 ]
HBase Review Board commented on HBASE-2866:
-------------------------------------------
Message from: "Karthik Ranganathan" <ka...@gmail.com>
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/
-----------------------------------------------------------
(Updated 2010-07-23 14:26:01.718168)
Review request for hbase, stack and Kannan Muthukkaruppan.
Changes
-------
Adding hbase group
Summary
-------
Region permanently offlined - if the ZNode is already in the target state, do not update it again.
This addresses bug HBASE-2866.
http://issues.apache.org/jira/browse/HBASE-2866
Diffs
-----
trunk/src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java 967128
trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java 967128
Diff: http://review.hbase.org/r/380/diff
Testing
-------
Ran unit tests, went through fine (except TestRowAtomicity, which is known to be failing).
Thanks,
Karthik
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891766#action_12891766 ]
stack commented on HBASE-2866:
------------------------------
@Kannan
bq. ...So need to look into what was happening with the other region...
Yes. Would help improve hbck tool.
bq. ...And, surprisingly though, ZK UNASSIGNED node shows 4 entries....
Whats up w/ that?
So, jgray and karthik, the current state of code in master is that its in transition still.. we have not yet hit the end point -- that there is still a bunch of change coming. Right? Can we have a fixup for this issue for now? I'd like to roll a new 0.89.x but w/ a fix for this (sounds like you have it Karthik).
@Kannan Restart master is how its addressed currently. Want to add a little tool to UI?
@Karthik A region is unassigned in .META. if it does not have a server and startcode as this one does.
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "Karthik Ranganathan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891804#action_12891804 ]
Karthik Ranganathan commented on HBASE-2866:
--------------------------------------------
Stack - just uploaded a review at http://review.hbase.org/r/380/
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-2866) Region permanently offlined
Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jean-Daniel Cryans resolved HBASE-2866.
---------------------------------------
Hadoop Flags: [Reviewed]
Fix Version/s: 0.90.0
Resolution: Fixed
Committed to trunk, thanks Karthik!
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891832#action_12891832 ]
HBase Review Board commented on HBASE-2866:
-------------------------------------------
Message from: "Jean-Daniel Cryans" <jd...@apache.org>
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/#review473
-----------------------------------------------------------
Ship it!
+1 LGTM and TestAdmin passes on my machine without flinching
- Jean-Daniel
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891483#action_12891483 ]
stack commented on HBASE-2866:
------------------------------
@Kannan Thanks. Looking at master and at code, my thought is that the fixup code didn't run because that region is stuck in transition. Here is where we'd skip out starting at about #562 in BaseScanner:
{code}
synchronized (this.master.getRegionManager()) {
/* We don't assign regions that are offline, in transition or were on
* a dead server. Regions that were on a dead server will get reassigned
* by ProcessServerShutdown
*/
if (info.isOffline() ||
this.master.getRegionManager().regionIsInTransition(info.getRegionNameAsString()) ||
// St.Ack ^^^^^^^^^^ My guess is we are in here^^^^^^^^^
(serverName != null && this.master.getServerManager().isDead(serverName))) {
return;
}
{code}
I think 'status' in shell:
{code}
hbase(main):003:0> status 'detailed'
version 0.89.0-SNAPSHOT
0 regionsInTransition
1 live servers
192.168.1.157:49248 1279864501042
requests=0, regions=3, usedHeap=32, maxHeap=994
.META.,,1
stores=2, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
x,,1279864569260.65c4857477eb31bff0fafae4797a90d8.
stores=1, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
-ROOT-,,0
stores=1, storefiles=1, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
0 dead servers
{code}
@Karthik Give me a clue as to what you are thinking and I'll have a go at fixing this one if you don't have the time boss.
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891814#action_12891814 ]
HBase Review Board commented on HBASE-2866:
-------------------------------------------
Message from: "Kannan Muthukkaruppan" <ka...@facebook.com>
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/#review470
-----------------------------------------------------------
trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1960>
should this be:
update = true;
- Kannan
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HBASE-2866) Region permanently offlined
Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kannan Muthukkaruppan reassigned HBASE-2866:
--------------------------------------------
Assignee: Karthik Ranganathan
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891831#action_12891831 ]
HBase Review Board commented on HBASE-2866:
-------------------------------------------
Message from: stack@duboce.net
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/#review474
-----------------------------------------------------------
Ship it!
Just one comment
trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
<http://review.hbase.org/r/380/#comment1965>
What is the watcher being triggered if data is not changing?
- stack
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "Karthik Ranganathan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891649#action_12891649 ]
Karthik Ranganathan commented on HBASE-2866:
--------------------------------------------
Hey Stack,
Have a fix ready - testing now, will put it up in a bit.
Fix is simple: we get into this situation because we update the same region in transition in ZK again and again, which bumps up the revision number of the ZNode. This causes the update to fail. So if the ZNode is already in the target state, do not update it again.
The above explanation is super-cryptic :), so will sync up with you on the issue and the fix.
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-2866) Region permanently offlined
Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kannan Muthukkaruppan updated HBASE-2866:
-----------------------------------------
Attachment: master.log
@Stack: Attached the master's log.
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891689#action_12891689 ]
Kannan Muthukkaruppan commented on HBASE-2866:
----------------------------------------------
Stack:
{code}
hbase(main):002:0> status 'detailed'
status 'detailed'
version 0.21.0-SNAPSHOT
1 regionsInTransition
name=test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44., state=PENDING_OPEN
92 live servers
...
{code}
So we find one region in PENDING_OPEN state. HBase "hbck" though complained about 2 regions. (So need to look into what was happening with the other region).
And, surprisingly though, ZK UNASSIGNED node shows 4 entries.
{code}
hbase(main):006:0> zk "ls /hbase/UNASSIGNED"
zk "ls /hbase/UNASSIGNED"
[9d675cd0b61c44c6605e752490a36eaf, 2e8c1def6dfe3ee1b8fc965629a93041, 9697bd45dc3d62d7e6b519a13d668062, 8d5490bfc166c687657cb09203bd7d44]
{code}
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891694#action_12891694 ]
Kannan Muthukkaruppan commented on HBASE-2866:
----------------------------------------------
Wondering how this case is handled...
Master asks a RS to open a region, and I guess adds
it to regions in transition. If RS dies before even starting
to work on the region, is there some timeout mechanism
that kicks in and master realized it needs to give this
region to someone else?
---------
For now, what's the manual steps to be taken to get master
to reassign the region? Restarting the master would work
I suppose for now. Any other ideas?
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2866) Region permanently offlined
Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891381#action_12891381 ]
Jean-Daniel Cryans commented on HBASE-2866:
-------------------------------------------
This issue looks like the one in TestAdmin that fails every now and then on Hudson like: http://hudson.zones.apache.org/hudson/job/HBase-TRUNK/1397/
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
>
> After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME =>
> 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.