You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Vidhyashankar Venkataraman <vi...@yahoo-inc.com> on 2011/01/28 08:56:05 UTC

Unresponsive master in Hbase 0.90.0

Hi

I am not sure whether this was observed before but I have been running into issues with a hung master.

This is on a 700 node cluster with Hbase 0.90.0 and Hadoop 0.20.x.

Every now and then, the master fails to respond (any request throws MasterNotRunning Exception)and gets stuck and remain unresponsive. The master process is alive. The last line in the master log says "Waiting for split writer threads to finish" sometime just after startup (log messages shown below). And there are no more log messages after that even after a couple of hours. jstack -F on the process throws a DebuggerException on every thread and says no deadlocks.

Is there any way else to monitor the master? I didn't observe this in my small scale (40-node) tests: Hbase 0.90 worked just fine..

Cheers
Vidhya





2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110270.yst.yahoo.net,60020,1296199618314 belongs to an existing region server
2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615 doesn't belong to a known region server, splitting
2011-01-28 07:35:49,867 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-0,5,main]: starting
2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-1,5,main]: starting
2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 1: hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266, length=0
2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-2,5,main]: starting
2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.util.FSUtils: Running on HDFS without append enabled may result in data loss
2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 might be still open, length is 0
2011-01-28 07:35:49,869 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 for reading. File is emptyjava.io.EOFException
2011-01-28 07:35:49,875 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Archived processed log hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 to hdfs://b3110120.yst.yahoo.net:4600/hbase/.oldlogs/b3110271.yst.yahoo.net%3A60020.1296183219266
2011-01-28 07:35:49,877 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish

Re: Unresponsive master in Hbase 0.90.0

Posted by Stack <st...@duboce.net>.

Its not in the manual yet Vidhya.  Assignment has completely changed
in 0.90.  No more do we assign by adding payload to the heartbeat.
Now we assign by direct rpc from master to regionserver with master
and regionserver moving the region through state changes up in zk
until region is successfully opened (OFFLINE->OPENING->OPENED -- with
OPENING done a few times to make sure there has not been intercession
because operations were taking too long).

Let me explain more:

In new master, on fresh startup (as opposed to a master joining an
already running cluster), after waiting on regionserver checkin and
having assigned root and meta, we'll take a distinct 'fresh startup'
code path.  We scan .META. for all entries.  We then give all entries
to the load balancer.  It produces an assignment plan that is per
regionserver based.  We then fire up a little executor service that
will run a bounded number of threads concurrently.  Each running
thread manages the bulk assign of regions to a particular
regionserver.

Assignment now can be in general a little slower because all state
transitions are mediated via zookeeper rather than in-memory in the
master.  But, in this special bulk assign startup mode, we make use of
zk's async ops so we do bulk state transition changes up in zk rather
manage individual changes so all runs faster.  There is a new rpc
where we can dump on the RS all the regions its to open.  ZK timeouts
during this startup phase are all extended.

The bulk assign threads per regionserver stay up until all regions
have opened on an individual regionserver, then the executor finishes
and the next runs (We could be better here -- especially on a cluster
of 700 nodes).  I spent time timing this stuff and I'd say bulk assign
even with the async zk ops is probably slower than how we used do it
but not by much.

The new master logs are very different than the old so it might take a
while getting your head around whats going on.  Hopefully you can
avoid having to do this.

What are you seeing?

St.Ack

On Mon, Jan 31, 2011 at 10:19 AM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
> Yes, I will file an issue after collecting the right logs.
>
> We will try finding the cause of the META server choke.
>
> Another question: the master still seems to be taking (a lot of) time to load the table during startup: I found that the regions percheckin config variable isnt used anymore. I havent looked at that part of the code yet, but what is now the master's part in assigning regions in 0.90? (Can you let me know if they are explained in Hbase docs in the release?)
>
> Thank you
> Vidhya
>
> On 1/31/11 10:06 AM, "Stack" <st...@duboce.net> wrote:
>
> On Mon, Jan 31, 2011 at 9:54 AM, Vidhyashankar Venkataraman
> <vi...@yahoo-inc.com> wrote:
>> The Hbase cluster doesn't have the master problems with hadoop-append turned on: we will try finding out why it wasn't working with a non-append version of hadoop (with a previous version of hadoop, it was getting stuck while splitting logs).
>>
>
> I'd say don't bother Vidhya.  You should run w/ append anyways.
> Meantime file an issue if you don't mind and dump in there your data.
> We need to take care of this so others don't trip over what you saw.
> I'm sure that plenty of users will innocently try to bring up 0.90.0
> on an Hadoop w/o append.
>
>> But there are other issues now (with append turned on) which we are trying to resolve. The region server that's hosting the META region is getting choked after a table was loaded with around 100 regions per server (this is likely the target load that we wanted to have and this worked in 0.89 with the same number of nodes and Hbase 0.90 worked fine with 40 nodes and that's why I started straight with this number). The node can be pinged, but not accessible through ssh and I am unable to perform most hbase operations on the cluster as a result.
>>
>>   Can the RS hosting META be a potential bottleneck in the system at all? (I will try shutting down that particular node and see what happens).
>>
>
> At 700 nodes scale, its quite possible we're doing something dumb.
> Any data you can glean to help us here would be appreciated.  I'd have
> thought that 0.90.0 would put less load on .META. since we've removed
> some of the reasons for .META. access.
>
> St.Ack
>
>

Re: Unresponsive master in Hbase 0.90.0

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

Yes, I will file an issue after collecting the right logs.

We will try finding the cause of the META server choke.

Another question: the master still seems to be taking (a lot of) time to load the table during startup: I found that the regions percheckin config variable isnt used anymore. I havent looked at that part of the code yet, but what is now the master's part in assigning regions in 0.90? (Can you let me know if they are explained in Hbase docs in the release?)

Thank you
Vidhya

On 1/31/11 10:06 AM, "Stack" <st...@duboce.net> wrote:

On Mon, Jan 31, 2011 at 9:54 AM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
> The Hbase cluster doesn't have the master problems with hadoop-append turned on: we will try finding out why it wasn't working with a non-append version of hadoop (with a previous version of hadoop, it was getting stuck while splitting logs).
>

I'd say don't bother Vidhya.  You should run w/ append anyways.
Meantime file an issue if you don't mind and dump in there your data.
We need to take care of this so others don't trip over what you saw.
I'm sure that plenty of users will innocently try to bring up 0.90.0
on an Hadoop w/o append.

> But there are other issues now (with append turned on) which we are trying to resolve. The region server that's hosting the META region is getting choked after a table was loaded with around 100 regions per server (this is likely the target load that we wanted to have and this worked in 0.89 with the same number of nodes and Hbase 0.90 worked fine with 40 nodes and that's why I started straight with this number). The node can be pinged, but not accessible through ssh and I am unable to perform most hbase operations on the cluster as a result.
>
>   Can the RS hosting META be a potential bottleneck in the system at all? (I will try shutting down that particular node and see what happens).
>

At 700 nodes scale, its quite possible we're doing something dumb.
Any data you can glean to help us here would be appreciated.  I'd have
thought that 0.90.0 would put less load on .META. since we've removed
some of the reasons for .META. access.

St.Ack

Re: Unresponsive master in Hbase 0.90.0

Posted by Stack <st...@duboce.net>.

On Mon, Jan 31, 2011 at 9:54 AM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
> The Hbase cluster doesn't have the master problems with hadoop-append turned on: we will try finding out why it wasn't working with a non-append version of hadoop (with a previous version of hadoop, it was getting stuck while splitting logs).
>

I'd say don't bother Vidhya.  You should run w/ append anyways.
Meantime file an issue if you don't mind and dump in there your data.
We need to take care of this so others don't trip over what you saw.
I'm sure that plenty of users will innocently try to bring up 0.90.0
on an Hadoop w/o append.

> But there are other issues now (with append turned on) which we are trying to resolve. The region server that's hosting the META region is getting choked after a table was loaded with around 100 regions per server (this is likely the target load that we wanted to have and this worked in 0.89 with the same number of nodes and Hbase 0.90 worked fine with 40 nodes and that's why I started straight with this number). The node can be pinged, but not accessible through ssh and I am unable to perform most hbase operations on the cluster as a result.
>
>   Can the RS hosting META be a potential bottleneck in the system at all? (I will try shutting down that particular node and see what happens).
>

At 700 nodes scale, its quite possible we're doing something dumb.
Any data you can glean to help us here would be appreciated.  I'd have
thought that 0.90.0 would put less load on .META. since we've removed
some of the reasons for .META. access.

St.Ack

Re: Unresponsive master in Hbase 0.90.0

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

The Hbase cluster doesn't have the master problems with hadoop-append turned on: we will try finding out why it wasn't working with a non-append version of hadoop (with a previous version of hadoop, it was getting stuck while splitting logs).

But there are other issues now (with append turned on) which we are trying to resolve. The region server that's hosting the META region is getting choked after a table was loaded with around 100 regions per server (this is likely the target load that we wanted to have and this worked in 0.89 with the same number of nodes and Hbase 0.90 worked fine with 40 nodes and that's why I started straight with this number). The node can be pinged, but not accessible through ssh and I am unable to perform most hbase operations on the cluster as a result.

   Can the RS hosting META be a potential bottleneck in the system at all? (I will try shutting down that particular node and see what happens).

Vidhya

On 1/28/11 3:49 PM, "Vidhyashankar Venkataraman" <vi...@yahoo-inc.com> wrote:

64 bit Java 1.6.

Why is the master even trying to issue a split with an empty log/region in hand? ( private List<Path> splitLog(final FileStatus[] logfiles)  )

V

On 1/28/11 3:06 PM, "Todd Lipcon" <to...@cloudera.com> wrote:

The 16000 second sleep is really strange... never seen anything like it.

What JVM are you running?

-Todd

On Fri, Jan 28, 2011 at 11:29 AM, Stack <st...@duboce.net> wrote:

> On Fri, Jan 28, 2011 at 11:23 AM, Vidhyashankar Venkataraman
> <vi...@yahoo-inc.com> wrote:
> > We are working on trying to fix this (cc'ed Adam as well).
> >
> >>> Hmm.. maybe before you restart remove the directory
> >>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/ completely so no files
> >>> to be processed on restart.
> >
> > This one, I had tried during one of the attempts: and it created new logs
> directory and still hung at some point which I think was the same point. (I
> will have to dig in to see what exactly happened).
> >
> > We havent yet looked at that part of the code, but why is the master even
> trying to issue a split with an empty log/region in hand?
> >
>
> Can you tar up one of these regionserver dirs and put it somewhere I
> can pull?  I'll try it over here.
> St.Ack
>

--
Todd Lipcon
Software Engineer, Cloudera

Re: Unresponsive master in Hbase 0.90.0

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

64 bit Java 1.6.

Why is the master even trying to issue a split with an empty log/region in hand? ( private List<Path> splitLog(final FileStatus[] logfiles)  )

V

On 1/28/11 3:06 PM, "Todd Lipcon" <to...@cloudera.com> wrote:

The 16000 second sleep is really strange... never seen anything like it.

What JVM are you running?

-Todd

On Fri, Jan 28, 2011 at 11:29 AM, Stack <st...@duboce.net> wrote:

> On Fri, Jan 28, 2011 at 11:23 AM, Vidhyashankar Venkataraman
> <vi...@yahoo-inc.com> wrote:
> > We are working on trying to fix this (cc'ed Adam as well).
> >
> >>> Hmm.. maybe before you restart remove the directory
> >>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/ completely so no files
> >>> to be processed on restart.
> >
> > This one, I had tried during one of the attempts: and it created new logs
> directory and still hung at some point which I think was the same point. (I
> will have to dig in to see what exactly happened).
> >
> > We havent yet looked at that part of the code, but why is the master even
> trying to issue a split with an empty log/region in hand?
> >
>
> Can you tar up one of these regionserver dirs and put it somewhere I
> can pull?  I'll try it over here.
> St.Ack
>



--
Todd Lipcon
Software Engineer, Cloudera

Re: Unresponsive master in Hbase 0.90.0

Posted by Todd Lipcon <to...@cloudera.com>.

The 16000 second sleep is really strange... never seen anything like it.

What JVM are you running?

-Todd

On Fri, Jan 28, 2011 at 11:29 AM, Stack <st...@duboce.net> wrote:

> On Fri, Jan 28, 2011 at 11:23 AM, Vidhyashankar Venkataraman
> <vi...@yahoo-inc.com> wrote:
> > We are working on trying to fix this (cc'ed Adam as well).
> >
> >>> Hmm.. maybe before you restart remove the directory
> >>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/ completely so no files
> >>> to be processed on restart.
> >
> > This one, I had tried during one of the attempts: and it created new logs
> directory and still hung at some point which I think was the same point. (I
> will have to dig in to see what exactly happened).
> >
> > We havent yet looked at that part of the code, but why is the master even
> trying to issue a split with an empty log/region in hand?
> >
>
> Can you tar up one of these regionserver dirs and put it somewhere I
> can pull?  I'll try it over here.
> St.Ack
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Unresponsive master in Hbase 0.90.0

Posted by Stack <st...@duboce.net>.

On Fri, Jan 28, 2011 at 11:23 AM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
> We are working on trying to fix this (cc'ed Adam as well).
>
>>> Hmm.. maybe before you restart remove the directory
>>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/ completely so no files
>>> to be processed on restart.
>
> This one, I had tried during one of the attempts: and it created new logs directory and still hung at some point which I think was the same point. (I will have to dig in to see what exactly happened).
>
> We havent yet looked at that part of the code, but why is the master even trying to issue a split with an empty log/region in hand?
>

Can you tar up one of these regionserver dirs and put it somewhere I
can pull?  I'll try it over here.
St.Ack

Re: Unresponsive master in Hbase 0.90.0

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

We are working on trying to fix this (cc'ed Adam as well).

>> Hmm.. maybe before you restart remove the directory
>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/ completely so no files
>> to be processed on restart.

This one, I had tried during one of the attempts: and it created new logs directory and still hung at some point which I think was the same point. (I will have to dig in to see what exactly happened).

We havent yet looked at that part of the code, but why is the master even trying to issue a split with an empty log/region in hand?

Re: Unresponsive master in Hbase 0.90.0

Posted by Stack <st...@duboce.net>.

On Fri, Jan 28, 2011 at 10:40 AM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
>>> Is this new cluster start or master joining an already running cluster (looks
>>> like former).
>
> Either way, I get this problem. In particular, these logs were pulled out after I had done a createTable with boundaries (around 100 empty regions per node) and shut it down and then restarted. A similar thing happened when master was restarted on a running cluster.
>

If you move aside
hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615,
does it come up then?




>>> Can you fix this?  Can you run w/ a working append?
> I will run with hadoop append and see what happens.
>

Well, we should figure this issue first if you don't mind but yeah,
you should be up on append.


>>> Can you do 'kill -QUIT PID' and see if anything shows in the .out file for hbase?
>>> It looks like a master is hung for sure.   My guess would be that
>>> we're making some presumption based on presence of append.  Lets
>>>figure it and fix.
>
> Some updates on the what happened with the master:
> 5 hours after the master was sleeping, I saw some activity in the logs (a thread timeout I suppose): Note the first line here was the last line in the log snippet I posted in the previous mail. I checked whether there was any mistake with the gc: I looked at the vmstat/gc logs and I couldn't find any swaps/ serious gc happening. The master has crashed with a number
>
> 2011-01-28 07:35:49,877 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish
> 2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 16625510ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9


Thats an impressive sleep time V.


> 2011-01-28 12:12:55,309 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 16629300ms for sessionid 0x32dcb848b98010e, closing socket connection and attempting reconnect
> 2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 16674993ms instead of 60000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-01-28 12:12:55,310 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 16635680ms for sessionid 0x2dcb848b210138, closing socket connection and attempting reconnect
> 2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished
> 2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 16625444 ms for hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615


So, it 'completed'.

I wonder what it was doing over that time period.  If you move aside
the above dir, does it come up?  If you put it back to you repro the
above?  If you repro, can you do the kill -QUIT PID a few times.  It'd
be real interesting to see what its up to.


> 2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296185180336 doesn't belong to a known region server, splitting
> 2011-01-28 12:12:55,315 WARN org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the logsjava.io.IOException: Call to b3110120.yst.yahoo.net/67.195.46.238:4600 failed on local exception: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/67.195.48.110:56229 remote=b3110120.yst.yahoo.net/67.195.46.238:4600]. 59999 millis timeout left.
>        at org.apache.hadoop.ipc.Client.wrapException(Client.java:776)
>        at org.apache.hadoop.ipc.Client.call(Client.java:744)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>        at $Proxy5.getListing(Unknown Source)
>        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>        at $Proxy5.getListing(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:606)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252)
>        at org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:128)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>        at org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:167)
>
> And after that a lot of exceptions like the following:
> 2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110566.yst.yahoo.ne
> t,60020,1296199618571 belongs to an existing region server
> 2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
> t,60020,1296183218152 doesn't belong to a known region server, splitting
> 2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
> ahoo.net,60020,1296183218152
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
>        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
>        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
> 2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
> t,60020,1296185180085 doesn't belong to a known region server, splitting
> 2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
> ahoo.net,60020,1296185180085
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
>        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
>        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>
>


Hmm.. maybe before you restart remove the directory
hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/ completely so no files
to be processed on restart.

I wonder if there is some incompatiblity such that we can't process
WALs written with older hbase (I presume above WALs written with a
version of hbase before 0.90?).

St.Ack



> Cheers
> V
>
>
> On 1/28/11 8:54 AM, "Stack" <st...@duboce.net> wrote:
>
> On Thu, Jan 27, 2011 at 11:56 PM, Vidhyashankar Venkataraman
> <vi...@yahoo-inc.com> wrote:
>> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110270.yst.yahoo.net,60020,1296199618314 belongs to an existing region server
>> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615 doesn't belong to a known region server, splitting
>
> Vidhya:
>
> This looks like a master that is starting up.  Is that right? Is this
> new cluster start or master joining an already running cluster (looks
> like former).
>
>
>> 2011-01-28 07:35:49,867 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
>> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-0,5,main]: starting
>> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-1,5,main]: starting
>> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 1: hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266, length=0
>> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-2,5,main]: starting
>> 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.util.FSUtils: Running on HDFS without append enabled may result in data loss
>
>
> Can you fix this?  Can you run w/ a working append?
>
>
>> 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 might be still open, length is 0
>> 2011-01-28 07:35:49,869 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 for reading. File is emptyjava.io.EOFException
>> 2011-01-28 07:35:49,875 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Archived processed log hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 to hdfs://b3110120.yst.yahoo.net:4600/hbase/.oldlogs/b3110271.yst.yahoo.net%3A60020.1296183219266
>> 2011-01-28 07:35:49,877 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish
>>
>
> Can you do 'kill -QUIT PID' and see if anything shows in the .out file
> for hbase?
>
> It looks like a master is hung for sure.   My guess would be that
> we're making some presumption based on presence of append.  Lets
> figure it and fix.
>
> Good on you V,
> St.Ack
>
>

Re: Unresponsive master in Hbase 0.90.0

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

>> Is this new cluster start or master joining an already running cluster (looks
>> like former).

Either way, I get this problem. In particular, these logs were pulled out after I had done a createTable with boundaries (around 100 empty regions per node) and shut it down and then restarted. A similar thing happened when master was restarted on a running cluster.

>> Can you fix this?  Can you run w/ a working append?
I will run with hadoop append and see what happens.

>> Can you do 'kill -QUIT PID' and see if anything shows in the .out file for hbase?
>> It looks like a master is hung for sure.   My guess would be that
>> we're making some presumption based on presence of append.  Lets
>>figure it and fix.

Some updates on the what happened with the master:
5 hours after the master was sleeping, I saw some activity in the logs (a thread timeout I suppose): Note the first line here was the last line in the log snippet I posted in the previous mail. I checked whether there was any mistake with the gc: I looked at the vmstat/gc logs and I couldn't find any swaps/ serious gc happening. The master has crashed with a number

2011-01-28 07:35:49,877 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish
2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 16625510ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
2011-01-28 12:12:55,309 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 16629300ms for sessionid 0x32dcb848b98010e, closing socket connection and attempting reconnect
2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 16674993ms instead of 60000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
2011-01-28 12:12:55,310 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 16635680ms for sessionid 0x2dcb848b210138, closing socket connection and attempting reconnect
2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished
2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 16625444 ms for hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296185180336 doesn't belong to a known region server, splitting
2011-01-28 12:12:55,315 WARN org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the logsjava.io.IOException: Call to b3110120.yst.yahoo.net/67.195.46.238:4600 failed on local exception: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/67.195.48.110:56229 remote=b3110120.yst.yahoo.net/67.195.46.238:4600]. 59999 millis timeout left.
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:776)
        at org.apache.hadoop.ipc.Client.call(Client.java:744)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy5.getListing(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy5.getListing(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:606)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252)
        at org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:128)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
        at org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:167)

And after that a lot of exceptions like the following:
2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110566.yst.yahoo.ne
t,60020,1296199618571 belongs to an existing region server
2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
t,60020,1296183218152 doesn't belong to a known region server, splitting
2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
ahoo.net,60020,1296183218152
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
t,60020,1296185180085 doesn't belong to a known region server, splitting
2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
ahoo.net,60020,1296185180085
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)


Cheers
V


On 1/28/11 8:54 AM, "Stack" <st...@duboce.net> wrote:

On Thu, Jan 27, 2011 at 11:56 PM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110270.yst.yahoo.net,60020,1296199618314 belongs to an existing region server
> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615 doesn't belong to a known region server, splitting

Vidhya:

This looks like a master that is starting up.  Is that right? Is this
new cluster start or master joining an already running cluster (looks
like former).


> 2011-01-28 07:35:49,867 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-0,5,main]: starting
> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-1,5,main]: starting
> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 1: hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266, length=0
> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-2,5,main]: starting
> 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.util.FSUtils: Running on HDFS without append enabled may result in data loss


Can you fix this?  Can you run w/ a working append?


> 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 might be still open, length is 0
> 2011-01-28 07:35:49,869 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 for reading. File is emptyjava.io.EOFException
> 2011-01-28 07:35:49,875 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Archived processed log hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 to hdfs://b3110120.yst.yahoo.net:4600/hbase/.oldlogs/b3110271.yst.yahoo.net%3A60020.1296183219266
> 2011-01-28 07:35:49,877 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish
>

Can you do 'kill -QUIT PID' and see if anything shows in the .out file
for hbase?

It looks like a master is hung for sure.   My guess would be that
we're making some presumption based on presence of append.  Lets
figure it and fix.

Good on you V,
St.Ack

Re: Unresponsive master in Hbase 0.90.0

Posted by Stack <st...@duboce.net>.

On Thu, Jan 27, 2011 at 11:56 PM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110270.yst.yahoo.net,60020,1296199618314 belongs to an existing region server
> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615 doesn't belong to a known region server, splitting

Vidhya:

This looks like a master that is starting up.  Is that right? Is this
new cluster start or master joining an already running cluster (looks
like former).


> 2011-01-28 07:35:49,867 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-0,5,main]: starting
> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-1,5,main]: starting
> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 1: hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266, length=0
> 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-2,5,main]: starting
> 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.util.FSUtils: Running on HDFS without append enabled may result in data loss


Can you fix this?  Can you run w/ a working append?


> 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 might be still open, length is 0
> 2011-01-28 07:35:49,869 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 for reading. File is emptyjava.io.EOFException
> 2011-01-28 07:35:49,875 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Archived processed log hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 to hdfs://b3110120.yst.yahoo.net:4600/hbase/.oldlogs/b3110271.yst.yahoo.net%3A60020.1296183219266
> 2011-01-28 07:35:49,877 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish
>

Can you do 'kill -QUIT PID' and see if anything shows in the .out file
for hbase?

It looks like a master is hung for sure.   My guess would be that
we're making some presumption based on presence of append.  Lets
figure it and fix.

Good on you V,
St.Ack