You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Chet Murthy <ch...@watson.ibm.com> on 2011/01/06 07:37:38 UTC

perplexing HBase bug: looking for where to learn how to debug

I've just started using hbase, and have encountered a perplexing bug.
The bug occurs on one set of Linux boxes, and not on another set, even
though they're both x86_64 Linux, and both are running -identical- JVM
releases.

I've attached a description of the probelm below, but really, what I'm
wondering is, if there's a description someplace of various places to
turn on instrumentation in hbase, so I can figure out what's wrong.  I
plan to do a lot of work with hbase in the future, so knowing how to
debug it is in some sense more important than finding out the fix for
this particular bug.

I really am looking to learn how to fish here.  I'm sure I can slowly
dig around find all the various tracing facilities and such, but I
figured there might be a cheat-sheet someplace ....

Thanks,
--chet--

================================================================

Basically, I set up hadoop 0.20.0 + hbase 0.20.6, in a cluster with 1
namenode, and anywhere from 2-5 datanodes which are also
regionservers.  I'm running a single zookeeper node, since this is
just for testing.  Furthermore, all these machines are isolated,
high-performance, SMP, with lots of memory.  Modern Intel/AMD boxes.

The cluster which 'works" runs Fedora 9 on Opteron, and the one that
"fails" runs RHEL5 on Intel Xeon (something-or-other -- I forget).

The test I'm running is Yahoo Cluster benchmark (YCSB).  I'm just
trying to load 1m records, and on the cluster that fails, I get,
variously:

(1) a load will fail with an error like:

com.yahoo.ycsb.DBException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact 
region server  -- nothing found, no 'location' returned, tableName=usertable, 
reload=true -- for region , row 'user1000015788', but failed after 11 
attempts.
Exceptions:
org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
listed in .META. for region usertable,,1294095537393
org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
listed in .META. for region usertable,,1294095537393

(b) a load will succeed, but there won't be 1m rows (where I use the
"count" command in "hbase shell" to count).

(c) sometimes, a "truncate" will fail, with an error of the form
above.  the step which fails is the "disable" step.

Java stack-dumps from the regionservers don't show any threads doing
anything interesting.  I don't know how to interrogate Zookeeper;
perhaps there's something messed-up in there ....

RE: perplexing HBase bug: looking for where to learn how to debug

Posted by Jonathan Gray <jg...@fb.com>.

The first step to debugging HBase is usually going through the Master and RegionServer logs.  Sometimes it can be more art than science but a majority of our debugging is done with log analysis.

If you can find specific offending regions, you can parse through the logs looking for mentions of that region and see where things went wrong.

If you're just getting started with HBase, I would also recommend working with the latest 0.90RC as issues like you're seeing have been fixed since then.

JG

> -----Original Message-----
> From: Chet Murthy [mailto:chet@watson.ibm.com]
> Sent: Wednesday, January 05, 2011 10:38 PM
> To: user@hbase.apache.org; dev@hbase.apache.org
> Subject: perplexing HBase bug: looking for where to learn how to debug
> 
> 
> I've just started using hbase, and have encountered a perplexing bug.
> The bug occurs on one set of Linux boxes, and not on another set, even
> though they're both x86_64 Linux, and both are running -identical- JVM
> releases.
> 
> I've attached a description of the probelm below, but really, what I'm
> wondering is, if there's a description someplace of various places to turn on
> instrumentation in hbase, so I can figure out what's wrong.  I plan to do a lot
> of work with hbase in the future, so knowing how to debug it is in some
> sense more important than finding out the fix for this particular bug.
> 
> I really am looking to learn how to fish here.  I'm sure I can slowly dig around
> find all the various tracing facilities and such, but I figured there might be a
> cheat-sheet someplace ....
> 
> Thanks,
> --chet--
> 
> ==========================================================
> ======
> 
> Basically, I set up hadoop 0.20.0 + hbase 0.20.6, in a cluster with 1 namenode,
> and anywhere from 2-5 datanodes which are also regionservers.  I'm running
> a single zookeeper node, since this is just for testing.  Furthermore, all these
> machines are isolated, high-performance, SMP, with lots of memory.
> Modern Intel/AMD boxes.
> 
> The cluster which 'works" runs Fedora 9 on Opteron, and the one that "fails"
> runs RHEL5 on Intel Xeon (something-or-other -- I forget).
> 
> The test I'm running is Yahoo Cluster benchmark (YCSB).  I'm just trying to
> load 1m records, and on the cluster that fails, I get,
> variously:
> 
> (1) a load will fail with an error like:
> 
> com.yahoo.ycsb.DBException:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact region server  -- nothing found, no 'location' returned,
> tableName=usertable, reload=true -- for region , row 'user1000015788', but
> failed after 11 attempts.
> Exceptions:
> org.apache.hadoop.hbase.client.NoServerForRegionException: No server
> address listed in .META. for region usertable,,1294095537393
> org.apache.hadoop.hbase.client.NoServerForRegionException: No server
> address listed in .META. for region usertable,,1294095537393
> 
> (b) a load will succeed, but there won't be 1m rows (where I use the "count"
> command in "hbase shell" to count).
> 
> (c) sometimes, a "truncate" will fail, with an error of the form above.  the
> step which fails is the "disable" step.
> 
> Java stack-dumps from the regionservers don't show any threads doing
> anything interesting.  I don't know how to interrogate Zookeeper; perhaps
> there's something messed-up in there ....

RE: perplexing HBase bug: looking for where to learn how to debug

Posted by Jonathan Gray <jg...@fb.com>.

The first step to debugging HBase is usually going through the Master and RegionServer logs.  Sometimes it can be more art than science but a majority of our debugging is done with log analysis.

If you can find specific offending regions, you can parse through the logs looking for mentions of that region and see where things went wrong.

If you're just getting started with HBase, I would also recommend working with the latest 0.90RC as issues like you're seeing have been fixed since then.

JG

> -----Original Message-----
> From: Chet Murthy [mailto:chet@watson.ibm.com]
> Sent: Wednesday, January 05, 2011 10:38 PM
> To: user@hbase.apache.org; dev@hbase.apache.org
> Subject: perplexing HBase bug: looking for where to learn how to debug
> 
> 
> I've just started using hbase, and have encountered a perplexing bug.
> The bug occurs on one set of Linux boxes, and not on another set, even
> though they're both x86_64 Linux, and both are running -identical- JVM
> releases.
> 
> I've attached a description of the probelm below, but really, what I'm
> wondering is, if there's a description someplace of various places to turn on
> instrumentation in hbase, so I can figure out what's wrong.  I plan to do a lot
> of work with hbase in the future, so knowing how to debug it is in some
> sense more important than finding out the fix for this particular bug.
> 
> I really am looking to learn how to fish here.  I'm sure I can slowly dig around
> find all the various tracing facilities and such, but I figured there might be a
> cheat-sheet someplace ....
> 
> Thanks,
> --chet--
> 
> ==========================================================
> ======
> 
> Basically, I set up hadoop 0.20.0 + hbase 0.20.6, in a cluster with 1 namenode,
> and anywhere from 2-5 datanodes which are also regionservers.  I'm running
> a single zookeeper node, since this is just for testing.  Furthermore, all these
> machines are isolated, high-performance, SMP, with lots of memory.
> Modern Intel/AMD boxes.
> 
> The cluster which 'works" runs Fedora 9 on Opteron, and the one that "fails"
> runs RHEL5 on Intel Xeon (something-or-other -- I forget).
> 
> The test I'm running is Yahoo Cluster benchmark (YCSB).  I'm just trying to
> load 1m records, and on the cluster that fails, I get,
> variously:
> 
> (1) a load will fail with an error like:
> 
> com.yahoo.ycsb.DBException:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact region server  -- nothing found, no 'location' returned,
> tableName=usertable, reload=true -- for region , row 'user1000015788', but
> failed after 11 attempts.
> Exceptions:
> org.apache.hadoop.hbase.client.NoServerForRegionException: No server
> address listed in .META. for region usertable,,1294095537393
> org.apache.hadoop.hbase.client.NoServerForRegionException: No server
> address listed in .META. for region usertable,,1294095537393
> 
> (b) a load will succeed, but there won't be 1m rows (where I use the "count"
> command in "hbase shell" to count).
> 
> (c) sometimes, a "truncate" will fail, with an error of the form above.  the
> step which fails is the "disable" step.
> 
> Java stack-dumps from the regionservers don't show any threads doing
> anything interesting.  I don't know how to interrogate Zookeeper; perhaps
> there's something messed-up in there ....