You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Chris Kline (JIRA)" <ji...@apache.org> on 2007/12/29 02:19:43 UTC

[jira] Created: (HADOOP-2500) [HBase] Unreadable region kills region servers

[HBase] Unreadable region kills region servers
----------------------------------------------

                 Key: HADOOP-2500
                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
             Project: Hadoop
          Issue Type: Bug
          Components: contrib/hbase
         Environment: CentOS 5
            Reporter: Chris Kline


Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space.  I created some space, restarted HDFS and fsck reported corruption with an HBase file.  I cleared up that corruption and restarted HBase.  I was still unable to read anything from HBase even though HSFS was now healthy.

The following was gather from the log files.  When HMaster starts up, it finds a region that is no good (Key: 17_125736271):
2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good

HMaster then assigns this region to RegionServer X.60:
2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020

The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits
2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496
2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
        at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
        at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
        at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
        at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
        at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
        at java.lang.Thread.run(Thread.java:619)
2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception
java.lang.NullPointerException
        at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
        at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
        at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
        at java.lang.Thread.run(Thread.java:619)
2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting

The HMaster then tries to assign the same region to X.60 again and fails.  The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits).

The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS.  After deleting that file and restarting HBase, HBase appears to be back to normal.

One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster:
2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020

There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2500) [HBase] Unreadable region kills region servers

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HADOOP-2500:
--------------------------


Address for 0.16 release

> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Priority: Critical
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space.  I created some space, restarted HDFS and fsck reported corruption with an HBase file.  I cleared up that corruption and restarted HBase.  I was still unable to read anything from HBase even though HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits).
> The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS.  After deleting that file and restarting HBase, HBase appears to be back to normal.
> One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2500) [HBase] Unreadable region kills region servers

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558593#action_12558593 ] 

Hudson commented on HADOOP-2500:
--------------------------------

Integrated in Hadoop-Nightly #365 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/365/])

> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Assignee: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.16.0
>
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space.  I created some space, restarted HDFS and fsck reported corruption with an HBase file.  I cleared up that corruption and restarted HBase.  I was still unable to read anything from HBase even though HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits).
> The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS.  After deleting that file and restarting HBase, HBase appears to be back to normal.
> One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2500) [HBase] Unreadable region kills region servers

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558032#action_12558032 ] 

Jim Kellerman commented on HADOOP-2500:
---------------------------------------

Bryan Duxbury wrote:
> At the very least, we should not assign a region to a region server if it is detected as "no good".

That is an unfortunate wording of a log message in the Master. It is saying that the current 
assignment of the region is no good because the information it read from the meta region
had a server or start code that did not match a known server. It does not mean that the
master thinks the region itself is no good.

> Also, if a RegionServer tries to access a region and it has difficulties, it should report to the
> master that it can't read the region, and the master should stop trying to serve it.
> From a more general standpoint, maybe when a bad region is detected, its files should be 
> moved to a different location and generally excluded from the cluster. This would allow you to 
> recover from problems better.

Yes, we absolutely need to do something, just not sure exactly what yet.

One thing for certain. zero length files should be ignored/deleted.


> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Priority: Critical
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space.  I created some space, restarted HDFS and fsck reported corruption with an HBase file.  I cleared up that corruption and restarted HBase.  I was still unable to read anything from HBase even though HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits).
> The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS.  After deleting that file and restarting HBase, HBase appears to be back to normal.
> One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-2500) [HBase] Unreadable region kills region servers

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman reassigned HADOOP-2500:
-------------------------------------

    Assignee: Jim Kellerman

> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Assignee: Jim Kellerman
>            Priority: Critical
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space.  I created some space, restarted HDFS and fsck reported corruption with an HBase file.  I cleared up that corruption and restarted HBase.  I was still unable to read anything from HBase even though HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits).
> The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS.  After deleting that file and restarting HBase, HBase appears to be back to normal.
> One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2500) [HBase] Unreadable region kills region servers

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555405#action_12555405 ] 

Bryan Duxbury commented on HADOOP-2500:
---------------------------------------

At the very least, we should not assign a region to a region server if it is detected as "no good". Also, if a RegionServer tries to access a region and it has difficulties, it should report to the master that it can't read the region, and the master should stop trying to serve it.

>From a more general standpoint, maybe when a bad region is detected, its files should be moved to a different location and generally excluded from the cluster. This would allow you to recover from problems better.

> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Priority: Critical
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space.  I created some space, restarted HDFS and fsck reported corruption with an HBase file.  I cleared up that corruption and restarted HBase.  I was still unable to read anything from HBase even though HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits).
> The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS.  After deleting that file and restarting HBase, HBase appears to be back to normal.
> One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2500) [HBase] Unreadable region kills region servers

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HADOOP-2500:
--------------------------

    Priority: Critical  (was: Major)

Marked the issue critical

> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Priority: Critical
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space.  I created some space, restarted HDFS and fsck reported corruption with an HBase file.  I cleared up that corruption and restarted HBase.  I was still unable to read anything from HBase even though HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits).
> The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS.  After deleting that file and restarting HBase, HBase appears to be back to normal.
> One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-2500) [HBase] Unreadable region kills region servers

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman resolved HADOOP-2500.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.16.0

Patch submitted for HADOOP-2587 incorporated fix for this issue. Tests passed. Committed.

> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Assignee: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.16.0
>
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space.  I created some space, restarted HDFS and fsck reported corruption with an HBase file.  I cleared up that corruption and restarted HBase.  I was still unable to read anything from HBase even though HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits).
> The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS.  After deleting that file and restarting HBase, HBase appears to be back to normal.
> One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2500) [HBase] Unreadable region kills region servers

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558036#action_12558036 ] 

Bryan Duxbury commented on HADOOP-2500:
---------------------------------------

So, we should:

 * Change the "no good" message to something a tad more descriptive, like "assignment of region is invalid"
 * Enumerate the known ways that a RegionServer can fail to serve a region, trap those problems, and figure out what responses we'd like to give to those events
 

> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Priority: Critical
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space.  I created some space, restarted HDFS and fsck reported corruption with an HBase file.  I cleared up that corruption and restarted HBase.  I was still unable to read anything from HBase even though HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits).
> The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS.  After deleting that file and restarting HBase, HBase appears to be back to normal.
> One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.