You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by James Baldassari <jb...@gmail.com> on 2010/06/08 22:57:46 UTC
Re: HBase fail-over/reliability issues

Well it took almost a month, but I finally saw this problem in the wild
again (0.20.3 + HBASE-2180), and this time I think I have more info about
what happened.  The symptoms were the same as usual: read throughput drops
drastically, the HBase client throws tons of errors, and at least one region
server has lots of errors in its log.  I can't figure out exactly what went
wrong, but I think this is the series of events:

1. A region split occurred, and region server rts-hdfs01 opened a new region
(users,ce5ba693689ed0648c99fb87b5eb2b53,1275964270542)
2. There was a duplicate region assignment (the new region from #1 above),
and it appears that both rts-hdfs01 and rts-hdfs02 had that region for about
15 seconds
3. During the time that rts-hdfs01 and rts-hdfs02 had the same region, they
both started a compaction on that region almost simultaneously
4. rts-hdfs02 detected the duplicate region assignment and closed the region
while rts-hdfs01 was still running its compaction on that region
5. rts-hdfs01 got an error that its lease expired on some compaction
directory (caused by rts-hdfs02 closing the region, I assume), and it logged
an error that the compaction failed
6. rts-hdfs01 logs lots of "Got error in response to OP_READ_BLOCK for
file", "No live nodes contain current block", and "Cannot open filename"
errors

I was able to fix the problem by simply restarting the region server on
rts-hdfs01.  After that region server came back up and the master finished
reassigning all the regions, everything was back to normal.  I didn't
restart any other region servers, any data nodes, or the master.  I still
not an expert in HBase internals, but it seems like this is some bug related
to region splitting and assignment of daughter regions.  Maybe this is a
known issue in 0.20.3?  Here are the relevant logs:

Region server log for rts-hdfs01, which is the region server which got the
error about the lease expiring and all the read errors:
http://pastebin.com/0L6fS4du
The problem starts on line 38.

Region server log for rts-hdfs02, which is the region server which got the
duplicate region (users,ce5ba693689ed0648c99fb87b5eb2b53,1275964270542):
http://pastebin.com/EmMtF4V8
See lines 29-36 where the duplicate region assignment occurs.

The master log: http://pastebin.com/YCa3L9e1
See lines 14-33.

Thanks,
James


On Tue, May 11, 2010 at 1:01 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi James,
>
> I'd recommend just the following in your log4j properties to tone down
> the log volume:
>
> log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN
>
> log4j.logger.org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace=WARN
>
> This will keep the INFO level logs that are very useful for debugging
> the issue, but shouldn't generate GB after GB.
>
> I'm both sorry and glad to hear your problem is reproducing every day.
> The bad news is you've got a problem, the good news is we should be
> able to get to the bottom of it much faster ! :)
>
> -Todd
>
>
> On Mon, May 10, 2010 at 9:59 AM, James Baldassari <jb...@gmail.com>
> wrote:
> > Hi Todd,
> >
> > Our log files were getting to be several gigabytes in size at the INFO
> level
> > (particularly the datanode logs), so we changed the log level in all
> log4j
> > configs to be WARN.  Do you think we're potentially missing some useful
> > information at INFO and lower?  I could lower the log level if you think
> it
> > would help.
> >
> > I did go back as far as I could in the logs to try to find out what was
> > going on as soon as the problem appeared.  I didn't see anything
> > interesting, possibly because the log level was at WARN.  The only two
> > errors I see are these:
> >
> > 2010-05-06 19:15:56,176 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.25.119.115:50010,
> > storageID=DS-2139827131-10.25.119.115-50010-1266376030910,
> infoPort=50075,
> > ipcPort=50020):DataXceiver
> > java.io.EOFException
> >
> > This happens once every 5 minutes like clockwork.  I'm not sure if this
> has
> > anything to do with the problem.  Then all of a sudden those errors about
> > the block being invalid started showing up:
> >
> > 2010-05-06 19:17:37,946 WARN
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.25.119.115:50010,
> > storageID=DS-2139827131-10.25.119.115-50010-1266376030910,
> infoPort=50075,
> > ipcPort=50020):Got exception while serving blk_-26696347696547536_49275
> to /
> > 10.24.166.67:
> > java.io.IOException: Block blk_-26696347696547536_49275 is not valid.
> >
> > So let me know if I should change the log level and whether I need to
> enable
> > NN clienttrace.  I don't think I'll have to wait long for this problem to
> > reappear.  It seems to be happening almost every day now.
> >
> > Thanks,
> > James
> >
> >
> > On Sat, May 8, 2010 at 6:43 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> Hi James,
> >>
> >> You'll need to go farther back in the logs to find what happened to the
> >> block that caused it to get deleted. All of the logs below are too late
> >> (the
> >> block's already gone, we need to figure out why).
> >>
> >> Can you look backwards through the past several days of the NN logs?
> Have
> >> you disabled the NN clienttrace log in log4j.properties?
> >>
> >> -Todd
> >>
> >> On Fri, May 7, 2010 at 11:17 PM, James Baldassari <
> jbaldassari@gmail.com
> >> >wrote:
> >>
> >> > OK, these logs are huge, so I'm just going to post the first 1,000
> lines
> >> > from each for now.  Let me know if it would be helpful to have more.
>  The
> >> > namenode logs didn't contain either of the strings you were interested
> >> in.
> >> > A few of the datanode logs had '4841840178880951849':
> >> >
> >> > http://pastebin.com/4MxzkxjW
> >> > http://pastebin.com/C7XErfTJ
> >> > http://pastebin.com/tn8pk2Xa
> >> >
> >> > One of the region servers had log entries about the region 73382377:
> >> >
> >> > http://pastebin.com/8uh6hQxQ
> >> >
> >> > We rebooted the name node as well as all data nodes and region
> servers;
> >> > then
> >> > we ran a major compaction on both of our tables.  The cluster seems
> >> stable
> >> > at this point, but I have a bad feeling that this problem may hit us
> >> again
> >> > soon because it has happened twice in the past two days now.
> >> >
> >> > -James
> >> >
> >> >
> >> > On Sat, May 8, 2010 at 12:30 AM, Todd Lipcon <to...@cloudera.com>
> wrote:
> >> >
> >> > > If you can grep for '4841840178880951849' as well
> >> > > as /hbase/users/73382377/data/312780071564432169 across all of your
> >> > > datanode
> >> > > logs plus your NN, and put that online somewhere, that would be
> great.
> >> If
> >> > > you can grep with -C 20 to get some context that would help as well.
> >> > >
> >> > > Grepping for the region in question (73382377) in the RS logs would
> >> also
> >> > be
> >> > > helpful.
> >> > >
> >> > > Thanks
> >> > > -Todd
> >> > >
> >> > > On Fri, May 7, 2010 at 9:16 PM, James Baldassari <
> >> jbaldassari@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > On Sat, May 8, 2010 at 12:02 AM, Stack <st...@duboce.net> wrote:
> >> > > >
> >> > > > > On Fri, May 7, 2010 at 8:27 PM, James Baldassari <
> >> > > jbaldassari@gmail.com>
> >> > > > > wrote:
> >> > > > > > java.io.IOException: Cannot open filename
> >> > > > > > /hbase/users/73382377/data/312780071564432169
> >> > > > > >
> >> > > > > This is the regionserver log?  Is this deploying the region?  It
> >> > fails?
> >> > > > >
> >> > > >
> >> > > > This error is on the client accessing HBase.  This exception was
> >> thrown
> >> > > on
> >> > > > a
> >> > > > get call to an HTable instance.  I'm not sure if it was deploying
> the
> >> > > > region.  All I know is that the system had been running with all
> >> > regions
> >> > > > available (as far as I know), and then all of a sudden these
> errors
> >> > > started
> >> > > > showing up on the client.
> >> > > >
> >> > > >
> >> > > > >
> >> > > > > > Our cluster throughput goes from around 3k requests/second
> down
> >> to
> >> > > > > 500-1000
> >> > > > > > and does not recover without manual intervention.  The region
> >> > server
> >> > > > log
> >> > > > > for
> >> > > > > > that region says:
> >> > > > > >
> >> > > > > > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> >> > > > > > 10.24.166.74:50010 for file
> >> > > > > /hbase/users/73382377/data/312780071564432169
> >> > > > > > for block -4841840178880951849:java.io.IOException: Got error
> in
> >> > > > response
> >> > > > > to
> >> > > > > > OP_READ_BLOCK for file
> >> > /hbase/users/73382377/data/312780071564432169
> >> > > > for
> >> > > > > > block -4841840178880951849
> >> > > > > >
> >> > > > > > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40
> on
> >> > > 60020,
> >> > > > > call
> >> > > > > > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd,
> >> > > maxVersions=1,
> >> > > > > > timeRange=[0,9223372036854775807), families={(family=data,
> >> > > > columns=ALL})
> >> > > > > > from 10.24.117.100:2365: error: java.io.IOException: Cannot
> open
> >> > > > > filename
> >> > > > > > /hbase/users/73382377/data/312780071564432169
> >> > > > > > java.io.IOException: Cannot open filename
> >> > > > > > /hbase/users/73382377/data/312780071564432169
> >> > > > > >
> >> > > > > > The datanode log for 10.24.116.74 says:
> >> > > > > >
> >> > > > > > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> > > > > DatanodeRegistration(
> >> > > > > > 10.24.166.74:50010,
> >> > > > > storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> >> > > > > > infoPort=50075, ipcPort=50020):
> >> > > > > > Got exception while serving blk_-4841840178880951849_50277 to
> /
> >> > > > > 10.25.119.113
> >> > > > > > :
> >> > > > > > java.io.IOException: Block blk_-4841840178880951849_50277 is
> not
> >> > > valid.
> >> > > > > >
> >> > > > >
> >> > > > > Whats your hadoop?  Is it 0.20.2 or CDH?  Any patches?
> >> > > > >
> >> > > >
> >> > > > Hadoop is vanilla CDH 2.  HBase is 0.20.3 + HBase-2180
> >> > > >
> >> > > >
> >> > > > >
> >> > > > >
> >> > > > > > Running a major compaction on the users table fixed the
> problem
> >> the
> >> > > > first
> >> > > > > > time it happened, but this time the major compaction didn't
> fix
> >> it,
> >> > > so
> >> > > > > we're
> >> > > > > > in the process of rebooting the whole cluster.  I'm wondering
> a
> >> few
> >> > > > > things:
> >> > > > > >
> >> > > > > > 1. What could trigger this problem?
> >> > > > > > 2. Why can't the system fail over to another
> >> > > block/file/datanode/region
> >> > > > > > server?  We're using 3x replication in HDFS, and we have 8
> data
> >> > nodes
> >> > > > > which
> >> > > > > > double as our region servers.
> >> > > > > > 3. Are there any best practices for achieving high
> availability
> >> in
> >> > an
> >> > > > > HBase
> >> > > > > > cluster?  How can I configure the system to gracefully (and
> >> > > > > automatically)
> >> > > > > > handle these types of problems?
> >> > > > > >
> >> > > > >
> >> > > > > Let us know what your hadoop is and then we figure more on the
> >> issues
> >> > > > > above.
> >> > > > >
> >> > > >
> >> > > > If you need complete stack traces or any additional information,
> >> please
> >> > > let
> >> > > > me know.
> >> > > >
> >> > > >
> >> > > > > Thanks James,
> >> > > > > St.Ack
> >> > > > > P.S. Its eight node cluster on what kinda hw? (You've probably
> said
> >> > in
> >> > > > > the past and I can dig through mail -- just say -- and then what
> >> kind
> >> > > > > of loading are you applying?  Ditto for if you've said this
> >> already)
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Todd Lipcon
> >> > > Software Engineer, Cloudera
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>