You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by lars hofhansl <la...@apache.org> on 2013/05/09 08:39:23 UTC

All region server died due to "Parent directory doesn't exist"

We just had all RegionServers die in a test cluster. All with the following exception.
(This is CDH4.2.1 with HBase 0.94.7 build against it)

Strangely HDFS is up and running (I can ls all directories, create files in it, etc. HDFS's fsck reports that all is well), yet we had the RSs die with this.
This almost looks like a race where the directories under .logs were yanked away while they were still in use.

I plan to investigate this further. In any event, has anybody seen this issue (or anything similar to this) before?
When this happened there was no load on the cluster (other than some write from OTSDB).

Thanks.

-- Lars

2013-05-08 16:02:41,178 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server <host>,60020,1367614452787: IOE in log roller
java.io.IOException: Exception in createWriter
        at org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:715)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:648)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:95)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: cannot get log writer
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:771)
        at org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60)
        ... 4 more
Caused by: java.io.IOException: java.io.FileNotFoundException: Parent directory doesn't exist: /hbase/.logs/<host>,60020,1367614452787
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:768)
        ... 5 more

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

Another symptom is that about 1h before the RSs started dying I get logs like this:
2013-05-08 15:02:50,723 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable to open a reader, sleeping 1000 times


Replication is not the problem here, but it indicates that it suddenly cannot no longer read the log files.
There is nothing interesting in the master log, and as I said HDFS is fine.

-- Lars



----- Original Message -----
From: lars hofhansl <la...@apache.org>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Cc: 
Sent: Thursday, May 9, 2013 9:16 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"

Thanks Ted. I'll do the same.


----- Original Message -----
From: Ted Yu <yu...@gmail.com>
To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
Cc: 
Sent: Thursday, May 9, 2013 9:07 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"

I went through the patch for HBASE-7824 one more time and didn't find
direct correlation to the issue Lars reported.

I am going over the other JIRAs in Lars' list.

Cheers

On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:

> I will try. I do not think this is the issue, though.
>
> The master is up in my case.
> Right now the cluster is in a state where each region server aborts itself
> shortly after being started (which coincides with having it's log directory
> renamed to ...-splitting).
>
>
> This is a test cluster and I could just start from scratch... This appears
> to be a serious enough problem, though, and I would like to track down the
> issue.
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Ted Yu <yu...@gmail.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Thursday, May 9, 2013 2:04 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
> The config came from hbase-7824.
>
> There are other JIRAs in Lars' list which are related to log splitting.
>
> I think more investigation is needed.
>
> Cheers
>
> On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org> wrote:
>
> > So that is HBASE-7824, right?
> >
> > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> hbase.master.wait.for.log.splitting
> >
> >
> >
> >
> > --
> > Best regards,
> >
> >   - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
>
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

Nope. That does not appear to be the problem.


________________________________
 From: Enis Söztutar <en...@gmail.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Thursday, May 9, 2013 10:01 PM
Subject: Re: All region server died due to "Parent directory doesn't exist"
 

But you see the zookeeper session timeout events in RS logs, and the master
says that zk session for the RS's has expired, right?


On Thu, May 9, 2013 at 9:25 PM, lars hofhansl <la...@apache.org> wrote:

> Still looking. Stack and Himanshu are looking too (tanks again!).
>
> What I do know is that it has to do the fencing mechanism during log
> splitting.
> Until I bounced HDFS and ZK (ZK probably being the culprit) each started
> RegionServer would immediately be fenced off (it's log directory renamed).
> Probably by the SSH.
>
> It is not clear what caused the first RS to die. While there is no direct
> evidence, from the logs it looks like the log directory was just suddenly
> renamed.
>
> I'll spend more time in the logs and also watch for this happening again.
>
> We did find another misconfigured cluster that had some services pointed
> at this cluster. It does not look like that was actually a problem - there
> is no evidence in the logs that this actually caused a problem, but it made
> this deploy somewhat "special".
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Enis Söztutar <en...@gmail.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Sent: Thursday, May 9, 2013 6:10 PM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>
>
> Could we able to find the root cause?
>
>
>
> On Thu, May 9, 2013 at 11:28 AM, lars hofhansl <la...@apache.org> wrote:
>
> Good news is that as far as I can tell no data was lost.
> >Eventually all logs were split and replayed.
> >
> >
> >
> >-- Lars
> >
> >
> >
> >----- Original Message -----
> >
> >From: lars hofhansl <la...@apache.org>
> >To: HBase Dev List <de...@hbase.apache.org>
> >
> >Cc:
> >Sent: Thursday, May 9, 2013 11:13 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >Thanks Stack.
> >
> >I sent the logs.
> >Also, I have since bounced HDFS and ZK and the problem is gone now (I can
> start RSs again and they stay up). Something got into a weird state.
> >
> >
> >-- Lars
> >
> >
> >
> >________________________________
> >From: Stack <st...@duboce.net>
> >To: HBase Dev List <de...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> >Sent: Thursday, May 9, 2013 10:34 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >
> >
> >Want to send me a regionserver log Lars? (off-list)
> >St.Ack
> >
> >
> >
> >On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote:
> >
> >Thanks Ted and Varun.
> >>
> >>
> >>Let me check on the .META. server.
> >>
> >>
> >>The majority (13) of the RSs died within 2 minutes. The remaining 3 died
> over the following 10 minutes.
> >>So that would point to general issue. I did not see any ZK issues but
> I'll double check.
> >>
> >>
> >>It is just interesting that even now, if I start and RS it aborts within
> a minute or two, because of this issue.
> >>
> >>
> >>-- Lars
> >>
> >>
> >>----- Original Message -----
> >>From: Ted Yu <yu...@gmail.com>
> >>To: dev@hbase.apache.org
> >>
> >>Cc:
> >>Sent: Thursday, May 9, 2013 9:51 AM
> >>Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >>
> >>Thanks Varun for sharing your experience.
> >>
> >>Lars:
> >>Was the server carrying .META. functioning properly around the time when
> >>you observed the problem ?
> >>
> >>Cheers
> >>
> >>On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>
> >>> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> >>> cluster. I am not sure if you are seeing the exact same issue though.
> We
> >>> did not have mass failures at the same time due to this..
> >>>
> >>> Thanks
> >>> Varun
> >>>
> >>>
> >>> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>>
> >>> > Btw, I am not 100 % sure but I have some seen something like this
> before:
> >>> >
> >>> > 1) ZK connection flakiness causes ephemeral nodes to expire
> >>> > 2) Master detects failure and renames the logs into a splitting
> directory
> >>> > - this is intentional so that in case that region server comes back
> up,
> >>> it
> >>> > cannot write to the logs being split
> >>> > 3) Region server dies because the log is renamed
> >>> >
> >>> > So, the yanking away of files is done by the HBase master and is
> expected
> >>> > if the master feels the server is dead. We found that the Region
> server
> >>> > logs DFS exceptions like crazy (1000s of them) in that case and we
> always
> >>> > suspected that this is some kind of DFS error but when we really go
> upto
> >>> > the point where it started, we found some zookeeper session issues.
> >>> >
> >>> > We had two cases of this - either super high load or NTP/no clock
> >>> > synchronization b/w the clusters causing this issue for us.
> >>> >
> >>> > Thanks
> >>> > Varun
> >>> >
> >>> >
> >>> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org>
> wrote:
> >>> >
> >>> >> Thanks Ted. I'll do the same.
> >>> >>
> >>> >>
> >>> >> ----- Original Message -----
> >>> >> From: Ted Yu <yu...@gmail.com>
> >>> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> >>> >> Cc:
> >>> >> Sent: Thursday, May 9, 2013 9:07 AM
> >>> >> Subject: Re: All region server died due to "Parent directory doesn't
> >>> >> exist"
> >>> >>
> >>> >> I went through the patch for HBASE-7824 one more time and didn't
> find
> >>> >> direct correlation to the issue Lars reported.
> >>> >>
> >>> >> I am going over the other JIRAs in Lars' list.
> >>> >>
> >>> >> Cheers
> >>> >>
> >>> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org>
> wrote:
> >>> >>
> >>> >> > I will try. I do not think this is the issue, though.
> >>> >> >
> >>> >> > The master is up in my case.
> >>> >> > Right now the cluster is in a state where each region server
> aborts
> >>> >> itself
> >>> >> > shortly after being started (which coincides with having it's log
> >>> >> directory
> >>> >> > renamed to ...-splitting).
> >>> >> >
> >>> >> >
> >>> >> > This is a test cluster and I could just start from scratch... This
> >>> >> appears
> >>> >> > to be a serious enough problem, though, and I would like to track
> down
> >>> >> the
> >>> >> > issue.
> >>> >> >
> >>> >> > -- Lars
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > ----- Original Message -----
> >>> >> > From: Ted Yu <yu...@gmail.com>
> >>> >> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >>> >> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >>> >> > Sent: Thursday, May 9, 2013 2:04 AM
> >>> >> > Subject: Re: All region server died due to "Parent directory
> doesn't
> >>> >> exist"
> >>> >> >
> >>> >> > The config came from hbase-7824.
> >>> >> >
> >>> >> > There are other JIRAs in Lars' list which are related to log
> >>> splitting.
> >>> >> >
> >>> >> > I think more investigation is needed.
> >>> >> >
> >>> >> > Cheers
> >>> >> >
> >>> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org>
> >>> wrote:
> >>> >> >
> >>> >> > > So that is HBASE-7824, right?
> >>> >> > >
> >>> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com>
> wrote:
> >>> >> > >
> >>> >> > >> hbase.master.wait.for.log.splitting
> >>> >> > >
> >>> >> > >
> >>> >> > >
> >>> >> > >
> >>> >> > > --
> >>> >> > > Best regards,
> >>> >> > >
> >>> >> > >   - Andy
> >>> >> > >
> >>> >> > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> >>> >> Hein
> >>> >> > > (via Tom White)
> >>> >> >
> >>> >> >
> >>> >>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by Enis Söztutar <en...@gmail.com>.

But you see the zookeeper session timeout events in RS logs, and the master
says that zk session for the RS's has expired, right?


On Thu, May 9, 2013 at 9:25 PM, lars hofhansl <la...@apache.org> wrote:

> Still looking. Stack and Himanshu are looking too (tanks again!).
>
> What I do know is that it has to do the fencing mechanism during log
> splitting.
> Until I bounced HDFS and ZK (ZK probably being the culprit) each started
> RegionServer would immediately be fenced off (it's log directory renamed).
> Probably by the SSH.
>
> It is not clear what caused the first RS to die. While there is no direct
> evidence, from the logs it looks like the log directory was just suddenly
> renamed.
>
> I'll spend more time in the logs and also watch for this happening again.
>
> We did find another misconfigured cluster that had some services pointed
> at this cluster. It does not look like that was actually a problem - there
> is no evidence in the logs that this actually caused a problem, but it made
> this deploy somewhat "special".
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Enis Söztutar <en...@gmail.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Sent: Thursday, May 9, 2013 6:10 PM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>
>
> Could we able to find the root cause?
>
>
>
> On Thu, May 9, 2013 at 11:28 AM, lars hofhansl <la...@apache.org> wrote:
>
> Good news is that as far as I can tell no data was lost.
> >Eventually all logs were split and replayed.
> >
> >
> >
> >-- Lars
> >
> >
> >
> >----- Original Message -----
> >
> >From: lars hofhansl <la...@apache.org>
> >To: HBase Dev List <de...@hbase.apache.org>
> >
> >Cc:
> >Sent: Thursday, May 9, 2013 11:13 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >Thanks Stack.
> >
> >I sent the logs.
> >Also, I have since bounced HDFS and ZK and the problem is gone now (I can
> start RSs again and they stay up). Something got into a weird state.
> >
> >
> >-- Lars
> >
> >
> >
> >________________________________
> >From: Stack <st...@duboce.net>
> >To: HBase Dev List <de...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> >Sent: Thursday, May 9, 2013 10:34 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >
> >
> >Want to send me a regionserver log Lars? (off-list)
> >St.Ack
> >
> >
> >
> >On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote:
> >
> >Thanks Ted and Varun.
> >>
> >>
> >>Let me check on the .META. server.
> >>
> >>
> >>The majority (13) of the RSs died within 2 minutes. The remaining 3 died
> over the following 10 minutes.
> >>So that would point to general issue. I did not see any ZK issues but
> I'll double check.
> >>
> >>
> >>It is just interesting that even now, if I start and RS it aborts within
> a minute or two, because of this issue.
> >>
> >>
> >>-- Lars
> >>
> >>
> >>----- Original Message -----
> >>From: Ted Yu <yu...@gmail.com>
> >>To: dev@hbase.apache.org
> >>
> >>Cc:
> >>Sent: Thursday, May 9, 2013 9:51 AM
> >>Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >>
> >>Thanks Varun for sharing your experience.
> >>
> >>Lars:
> >>Was the server carrying .META. functioning properly around the time when
> >>you observed the problem ?
> >>
> >>Cheers
> >>
> >>On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>
> >>> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> >>> cluster. I am not sure if you are seeing the exact same issue though.
> We
> >>> did not have mass failures at the same time due to this..
> >>>
> >>> Thanks
> >>> Varun
> >>>
> >>>
> >>> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>>
> >>> > Btw, I am not 100 % sure but I have some seen something like this
> before:
> >>> >
> >>> > 1) ZK connection flakiness causes ephemeral nodes to expire
> >>> > 2) Master detects failure and renames the logs into a splitting
> directory
> >>> > - this is intentional so that in case that region server comes back
> up,
> >>> it
> >>> > cannot write to the logs being split
> >>> > 3) Region server dies because the log is renamed
> >>> >
> >>> > So, the yanking away of files is done by the HBase master and is
> expected
> >>> > if the master feels the server is dead. We found that the Region
> server
> >>> > logs DFS exceptions like crazy (1000s of them) in that case and we
> always
> >>> > suspected that this is some kind of DFS error but when we really go
> upto
> >>> > the point where it started, we found some zookeeper session issues.
> >>> >
> >>> > We had two cases of this - either super high load or NTP/no clock
> >>> > synchronization b/w the clusters causing this issue for us.
> >>> >
> >>> > Thanks
> >>> > Varun
> >>> >
> >>> >
> >>> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org>
> wrote:
> >>> >
> >>> >> Thanks Ted. I'll do the same.
> >>> >>
> >>> >>
> >>> >> ----- Original Message -----
> >>> >> From: Ted Yu <yu...@gmail.com>
> >>> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> >>> >> Cc:
> >>> >> Sent: Thursday, May 9, 2013 9:07 AM
> >>> >> Subject: Re: All region server died due to "Parent directory doesn't
> >>> >> exist"
> >>> >>
> >>> >> I went through the patch for HBASE-7824 one more time and didn't
> find
> >>> >> direct correlation to the issue Lars reported.
> >>> >>
> >>> >> I am going over the other JIRAs in Lars' list.
> >>> >>
> >>> >> Cheers
> >>> >>
> >>> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org>
> wrote:
> >>> >>
> >>> >> > I will try. I do not think this is the issue, though.
> >>> >> >
> >>> >> > The master is up in my case.
> >>> >> > Right now the cluster is in a state where each region server
> aborts
> >>> >> itself
> >>> >> > shortly after being started (which coincides with having it's log
> >>> >> directory
> >>> >> > renamed to ...-splitting).
> >>> >> >
> >>> >> >
> >>> >> > This is a test cluster and I could just start from scratch... This
> >>> >> appears
> >>> >> > to be a serious enough problem, though, and I would like to track
> down
> >>> >> the
> >>> >> > issue.
> >>> >> >
> >>> >> > -- Lars
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > ----- Original Message -----
> >>> >> > From: Ted Yu <yu...@gmail.com>
> >>> >> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >>> >> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >>> >> > Sent: Thursday, May 9, 2013 2:04 AM
> >>> >> > Subject: Re: All region server died due to "Parent directory
> doesn't
> >>> >> exist"
> >>> >> >
> >>> >> > The config came from hbase-7824.
> >>> >> >
> >>> >> > There are other JIRAs in Lars' list which are related to log
> >>> splitting.
> >>> >> >
> >>> >> > I think more investigation is needed.
> >>> >> >
> >>> >> > Cheers
> >>> >> >
> >>> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org>
> >>> wrote:
> >>> >> >
> >>> >> > > So that is HBASE-7824, right?
> >>> >> > >
> >>> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com>
> wrote:
> >>> >> > >
> >>> >> > >> hbase.master.wait.for.log.splitting
> >>> >> > >
> >>> >> > >
> >>> >> > >
> >>> >> > >
> >>> >> > > --
> >>> >> > > Best regards,
> >>> >> > >
> >>> >> > >   - Andy
> >>> >> > >
> >>> >> > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> >>> >> Hein
> >>> >> > > (via Tom White)
> >>> >> >
> >>> >> >
> >>> >>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

Still looking. Stack and Himanshu are looking too (tanks again!).

What I do know is that it has to do the fencing mechanism during log splitting.
Until I bounced HDFS and ZK (ZK probably being the culprit) each started RegionServer would immediately be fenced off (it's log directory renamed).
Probably by the SSH.

It is not clear what caused the first RS to die. While there is no direct evidence, from the logs it looks like the log directory was just suddenly renamed.

I'll spend more time in the logs and also watch for this happening again.

We did find another misconfigured cluster that had some services pointed at this cluster. It does not look like that was actually a problem - there is no evidence in the logs that this actually caused a problem, but it made this deploy somewhat "special".


-- Lars



________________________________
 From: Enis Söztutar <en...@gmail.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Thursday, May 9, 2013 6:10 PM
Subject: Re: All region server died due to "Parent directory doesn't exist"
 


Could we able to find the root cause? 



On Thu, May 9, 2013 at 11:28 AM, lars hofhansl <la...@apache.org> wrote:

Good news is that as far as I can tell no data was lost.
>Eventually all logs were split and replayed.
>
>
>
>-- Lars
>
>
>
>----- Original Message -----
>
>From: lars hofhansl <la...@apache.org>
>To: HBase Dev List <de...@hbase.apache.org>
>
>Cc:
>Sent: Thursday, May 9, 2013 11:13 AM
>Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>Thanks Stack.
>
>I sent the logs.
>Also, I have since bounced HDFS and ZK and the problem is gone now (I can start RSs again and they stay up). Something got into a weird state.
>
>
>-- Lars
>
>
>
>________________________________
>From: Stack <st...@duboce.net>
>To: HBase Dev List <de...@hbase.apache.org>; lars hofhansl <la...@apache.org>
>Sent: Thursday, May 9, 2013 10:34 AM
>Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>
>
>Want to send me a regionserver log Lars? (off-list)
>St.Ack
>
>
>
>On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote:
>
>Thanks Ted and Varun.
>>
>>
>>Let me check on the .META. server.
>>
>>
>>The majority (13) of the RSs died within 2 minutes. The remaining 3 died over the following 10 minutes.
>>So that would point to general issue. I did not see any ZK issues but I'll double check.
>>
>>
>>It is just interesting that even now, if I start and RS it aborts within a minute or two, because of this issue.
>>
>>
>>-- Lars
>>
>>
>>----- Original Message -----
>>From: Ted Yu <yu...@gmail.com>
>>To: dev@hbase.apache.org
>>
>>Cc:
>>Sent: Thursday, May 9, 2013 9:51 AM
>>Subject: Re: All region server died due to "Parent directory doesn't exist"
>>
>>Thanks Varun for sharing your experience.
>>
>>Lars:
>>Was the server carrying .META. functioning properly around the time when
>>you observed the problem ?
>>
>>Cheers
>>
>>On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote:
>>
>>> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
>>> cluster. I am not sure if you are seeing the exact same issue though. We
>>> did not have mass failures at the same time due to this..
>>>
>>> Thanks
>>> Varun
>>>
>>>
>>> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com> wrote:
>>>
>>> > Btw, I am not 100 % sure but I have some seen something like this before:
>>> >
>>> > 1) ZK connection flakiness causes ephemeral nodes to expire
>>> > 2) Master detects failure and renames the logs into a splitting directory
>>> > - this is intentional so that in case that region server comes back up,
>>> it
>>> > cannot write to the logs being split
>>> > 3) Region server dies because the log is renamed
>>> >
>>> > So, the yanking away of files is done by the HBase master and is expected
>>> > if the master feels the server is dead. We found that the Region server
>>> > logs DFS exceptions like crazy (1000s of them) in that case and we always
>>> > suspected that this is some kind of DFS error but when we really go upto
>>> > the point where it started, we found some zookeeper session issues.
>>> >
>>> > We had two cases of this - either super high load or NTP/no clock
>>> > synchronization b/w the clusters causing this issue for us.
>>> >
>>> > Thanks
>>> > Varun
>>> >
>>> >
>>> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> wrote:
>>> >
>>> >> Thanks Ted. I'll do the same.
>>> >>
>>> >>
>>> >> ----- Original Message -----
>>> >> From: Ted Yu <yu...@gmail.com>
>>> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
>>> >> Cc:
>>> >> Sent: Thursday, May 9, 2013 9:07 AM
>>> >> Subject: Re: All region server died due to "Parent directory doesn't
>>> >> exist"
>>> >>
>>> >> I went through the patch for HBASE-7824 one more time and didn't find
>>> >> direct correlation to the issue Lars reported.
>>> >>
>>> >> I am going over the other JIRAs in Lars' list.
>>> >>
>>> >> Cheers
>>> >>
>>> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:
>>> >>
>>> >> > I will try. I do not think this is the issue, though.
>>> >> >
>>> >> > The master is up in my case.
>>> >> > Right now the cluster is in a state where each region server aborts
>>> >> itself
>>> >> > shortly after being started (which coincides with having it's log
>>> >> directory
>>> >> > renamed to ...-splitting).
>>> >> >
>>> >> >
>>> >> > This is a test cluster and I could just start from scratch... This
>>> >> appears
>>> >> > to be a serious enough problem, though, and I would like to track down
>>> >> the
>>> >> > issue.
>>> >> >
>>> >> > -- Lars
>>> >> >
>>> >> >
>>> >> >
>>> >> > ----- Original Message -----
>>> >> > From: Ted Yu <yu...@gmail.com>
>>> >> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
>>> >> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
>>> >> > Sent: Thursday, May 9, 2013 2:04 AM
>>> >> > Subject: Re: All region server died due to "Parent directory doesn't
>>> >> exist"
>>> >> >
>>> >> > The config came from hbase-7824.
>>> >> >
>>> >> > There are other JIRAs in Lars' list which are related to log
>>> splitting.
>>> >> >
>>> >> > I think more investigation is needed.
>>> >> >
>>> >> > Cheers
>>> >> >
>>> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org>
>>> wrote:
>>> >> >
>>> >> > > So that is HBASE-7824, right?
>>> >> > >
>>> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
>>> >> > >
>>> >> > >> hbase.master.wait.for.log.splitting
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > > --
>>> >> > > Best regards,
>>> >> > >
>>> >> > >   - Andy
>>> >> > >
>>> >> > > Problems worthy of attack prove their worth by hitting back. - Piet
>>> >> Hein
>>> >> > > (via Tom White)
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by Enis Söztutar <en...@gmail.com>.

Could we able to find the root cause?


On Thu, May 9, 2013 at 11:28 AM, lars hofhansl <la...@apache.org> wrote:

> Good news is that as far as I can tell no data was lost.
> Eventually all logs were split and replayed.
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: lars hofhansl <la...@apache.org>
> To: HBase Dev List <de...@hbase.apache.org>
> Cc:
> Sent: Thursday, May 9, 2013 11:13 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
> Thanks Stack.
>
> I sent the logs.
> Also, I have since bounced HDFS and ZK and the problem is gone now (I can
> start RSs again and they stay up). Something got into a weird state.
>
>
> -- Lars
>
>
>
> ________________________________
> From: Stack <st...@duboce.net>
> To: HBase Dev List <de...@hbase.apache.org>; lars hofhansl <larsh@apache.org
> >
> Sent: Thursday, May 9, 2013 10:34 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>
>
> Want to send me a regionserver log Lars? (off-list)
> St.Ack
>
>
>
> On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote:
>
> Thanks Ted and Varun.
> >
> >
> >Let me check on the .META. server.
> >
> >
> >The majority (13) of the RSs died within 2 minutes. The remaining 3 died
> over the following 10 minutes.
> >So that would point to general issue. I did not see any ZK issues but
> I'll double check.
> >
> >
> >It is just interesting that even now, if I start and RS it aborts within
> a minute or two, because of this issue.
> >
> >
> >-- Lars
> >
> >
> >----- Original Message -----
> >From: Ted Yu <yu...@gmail.com>
> >To: dev@hbase.apache.org
> >
> >Cc:
> >Sent: Thursday, May 9, 2013 9:51 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >Thanks Varun for sharing your experience.
> >
> >Lars:
> >Was the server carrying .META. functioning properly around the time when
> >you observed the problem ?
> >
> >Cheers
> >
> >On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote:
> >
> >> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> >> cluster. I am not sure if you are seeing the exact same issue though. We
> >> did not have mass failures at the same time due to this..
> >>
> >> Thanks
> >> Varun
> >>
> >>
> >> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>
> >> > Btw, I am not 100 % sure but I have some seen something like this
> before:
> >> >
> >> > 1) ZK connection flakiness causes ephemeral nodes to expire
> >> > 2) Master detects failure and renames the logs into a splitting
> directory
> >> > - this is intentional so that in case that region server comes back
> up,
> >> it
> >> > cannot write to the logs being split
> >> > 3) Region server dies because the log is renamed
> >> >
> >> > So, the yanking away of files is done by the HBase master and is
> expected
> >> > if the master feels the server is dead. We found that the Region
> server
> >> > logs DFS exceptions like crazy (1000s of them) in that case and we
> always
> >> > suspected that this is some kind of DFS error but when we really go
> upto
> >> > the point where it started, we found some zookeeper session issues.
> >> >
> >> > We had two cases of this - either super high load or NTP/no clock
> >> > synchronization b/w the clusters causing this issue for us.
> >> >
> >> > Thanks
> >> > Varun
> >> >
> >> >
> >> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org>
> wrote:
> >> >
> >> >> Thanks Ted. I'll do the same.
> >> >>
> >> >>
> >> >> ----- Original Message -----
> >> >> From: Ted Yu <yu...@gmail.com>
> >> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> >> >> Cc:
> >> >> Sent: Thursday, May 9, 2013 9:07 AM
> >> >> Subject: Re: All region server died due to "Parent directory doesn't
> >> >> exist"
> >> >>
> >> >> I went through the patch for HBASE-7824 one more time and didn't find
> >> >> direct correlation to the issue Lars reported.
> >> >>
> >> >> I am going over the other JIRAs in Lars' list.
> >> >>
> >> >> Cheers
> >> >>
> >> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org>
> wrote:
> >> >>
> >> >> > I will try. I do not think this is the issue, though.
> >> >> >
> >> >> > The master is up in my case.
> >> >> > Right now the cluster is in a state where each region server aborts
> >> >> itself
> >> >> > shortly after being started (which coincides with having it's log
> >> >> directory
> >> >> > renamed to ...-splitting).
> >> >> >
> >> >> >
> >> >> > This is a test cluster and I could just start from scratch... This
> >> >> appears
> >> >> > to be a serious enough problem, though, and I would like to track
> down
> >> >> the
> >> >> > issue.
> >> >> >
> >> >> > -- Lars
> >> >> >
> >> >> >
> >> >> >
> >> >> > ----- Original Message -----
> >> >> > From: Ted Yu <yu...@gmail.com>
> >> >> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >> >> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >> >> > Sent: Thursday, May 9, 2013 2:04 AM
> >> >> > Subject: Re: All region server died due to "Parent directory
> doesn't
> >> >> exist"
> >> >> >
> >> >> > The config came from hbase-7824.
> >> >> >
> >> >> > There are other JIRAs in Lars' list which are related to log
> >> splitting.
> >> >> >
> >> >> > I think more investigation is needed.
> >> >> >
> >> >> > Cheers
> >> >> >
> >> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org>
> >> wrote:
> >> >> >
> >> >> > > So that is HBASE-7824, right?
> >> >> > >
> >> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com>
> wrote:
> >> >> > >
> >> >> > >> hbase.master.wait.for.log.splitting
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Best regards,
> >> >> > >
> >> >> > >   - Andy
> >> >> > >
> >> >> > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> >> >> Hein
> >> >> > > (via Tom White)
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >
> >>
> >
> >
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

Good news is that as far as I can tell no data was lost.
Eventually all logs were split and replayed.


-- Lars



----- Original Message -----
From: lars hofhansl <la...@apache.org>
To: HBase Dev List <de...@hbase.apache.org>
Cc: 
Sent: Thursday, May 9, 2013 11:13 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"

Thanks Stack.

I sent the logs.
Also, I have since bounced HDFS and ZK and the problem is gone now (I can start RSs again and they stay up). Something got into a weird state.


-- Lars



________________________________
From: Stack <st...@duboce.net>
To: HBase Dev List <de...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Thursday, May 9, 2013 10:34 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"



Want to send me a regionserver log Lars? (off-list)
St.Ack



On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote:

Thanks Ted and Varun.
>
>
>Let me check on the .META. server.
>
>
>The majority (13) of the RSs died within 2 minutes. The remaining 3 died over the following 10 minutes.
>So that would point to general issue. I did not see any ZK issues but I'll double check.
>
>
>It is just interesting that even now, if I start and RS it aborts within a minute or two, because of this issue.
>
>
>-- Lars
>
>
>----- Original Message -----
>From: Ted Yu <yu...@gmail.com>
>To: dev@hbase.apache.org
>
>Cc:
>Sent: Thursday, May 9, 2013 9:51 AM
>Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>Thanks Varun for sharing your experience.
>
>Lars:
>Was the server carrying .META. functioning properly around the time when
>you observed the problem ?
>
>Cheers
>
>On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote:
>
>> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
>> cluster. I am not sure if you are seeing the exact same issue though. We
>> did not have mass failures at the same time due to this..
>>
>> Thanks
>> Varun
>>
>>
>> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com> wrote:
>>
>> > Btw, I am not 100 % sure but I have some seen something like this before:
>> >
>> > 1) ZK connection flakiness causes ephemeral nodes to expire
>> > 2) Master detects failure and renames the logs into a splitting directory
>> > - this is intentional so that in case that region server comes back up,
>> it
>> > cannot write to the logs being split
>> > 3) Region server dies because the log is renamed
>> >
>> > So, the yanking away of files is done by the HBase master and is expected
>> > if the master feels the server is dead. We found that the Region server
>> > logs DFS exceptions like crazy (1000s of them) in that case and we always
>> > suspected that this is some kind of DFS error but when we really go upto
>> > the point where it started, we found some zookeeper session issues.
>> >
>> > We had two cases of this - either super high load or NTP/no clock
>> > synchronization b/w the clusters causing this issue for us.
>> >
>> > Thanks
>> > Varun
>> >
>> >
>> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> wrote:
>> >
>> >> Thanks Ted. I'll do the same.
>> >>
>> >>
>> >> ----- Original Message -----
>> >> From: Ted Yu <yu...@gmail.com>
>> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
>> >> Cc:
>> >> Sent: Thursday, May 9, 2013 9:07 AM
>> >> Subject: Re: All region server died due to "Parent directory doesn't
>> >> exist"
>> >>
>> >> I went through the patch for HBASE-7824 one more time and didn't find
>> >> direct correlation to the issue Lars reported.
>> >>
>> >> I am going over the other JIRAs in Lars' list.
>> >>
>> >> Cheers
>> >>
>> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:
>> >>
>> >> > I will try. I do not think this is the issue, though.
>> >> >
>> >> > The master is up in my case.
>> >> > Right now the cluster is in a state where each region server aborts
>> >> itself
>> >> > shortly after being started (which coincides with having it's log
>> >> directory
>> >> > renamed to ...-splitting).
>> >> >
>> >> >
>> >> > This is a test cluster and I could just start from scratch... This
>> >> appears
>> >> > to be a serious enough problem, though, and I would like to track down
>> >> the
>> >> > issue.
>> >> >
>> >> > -- Lars
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message -----
>> >> > From: Ted Yu <yu...@gmail.com>
>> >> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
>> >> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
>> >> > Sent: Thursday, May 9, 2013 2:04 AM
>> >> > Subject: Re: All region server died due to "Parent directory doesn't
>> >> exist"
>> >> >
>> >> > The config came from hbase-7824.
>> >> >
>> >> > There are other JIRAs in Lars' list which are related to log
>> splitting.
>> >> >
>> >> > I think more investigation is needed.
>> >> >
>> >> > Cheers
>> >> >
>> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org>
>> wrote:
>> >> >
>> >> > > So that is HBASE-7824, right?
>> >> > >
>> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
>> >> > >
>> >> > >> hbase.master.wait.for.log.splitting
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Best regards,
>> >> > >
>> >> > >   - Andy
>> >> > >
>> >> > > Problems worthy of attack prove their worth by hitting back. - Piet
>> >> Hein
>> >> > > (via Tom White)
>> >> >
>> >> >
>> >>
>> >>
>> >
>>
>
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

Thanks Stack.

I sent the logs.
Also, I have since bounced HDFS and ZK and the problem is gone now (I can start RSs again and they stay up). Something got into a weird state.


-- Lars



________________________________
 From: Stack <st...@duboce.net>
To: HBase Dev List <de...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Thursday, May 9, 2013 10:34 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"
 


Want to send me a regionserver log Lars? (off-list)
St.Ack



On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote:

Thanks Ted and Varun.
>
>
>Let me check on the .META. server.
>
>
>The majority (13) of the RSs died within 2 minutes. The remaining 3 died over the following 10 minutes.
>So that would point to general issue. I did not see any ZK issues but I'll double check.
>
>
>It is just interesting that even now, if I start and RS it aborts within a minute or two, because of this issue.
>
>
>-- Lars
>
>
>----- Original Message -----
>From: Ted Yu <yu...@gmail.com>
>To: dev@hbase.apache.org
>
>Cc:
>Sent: Thursday, May 9, 2013 9:51 AM
>Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>Thanks Varun for sharing your experience.
>
>Lars:
>Was the server carrying .META. functioning properly around the time when
>you observed the problem ?
>
>Cheers
>
>On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote:
>
>> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
>> cluster. I am not sure if you are seeing the exact same issue though. We
>> did not have mass failures at the same time due to this..
>>
>> Thanks
>> Varun
>>
>>
>> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com> wrote:
>>
>> > Btw, I am not 100 % sure but I have some seen something like this before:
>> >
>> > 1) ZK connection flakiness causes ephemeral nodes to expire
>> > 2) Master detects failure and renames the logs into a splitting directory
>> > - this is intentional so that in case that region server comes back up,
>> it
>> > cannot write to the logs being split
>> > 3) Region server dies because the log is renamed
>> >
>> > So, the yanking away of files is done by the HBase master and is expected
>> > if the master feels the server is dead. We found that the Region server
>> > logs DFS exceptions like crazy (1000s of them) in that case and we always
>> > suspected that this is some kind of DFS error but when we really go upto
>> > the point where it started, we found some zookeeper session issues.
>> >
>> > We had two cases of this - either super high load or NTP/no clock
>> > synchronization b/w the clusters causing this issue for us.
>> >
>> > Thanks
>> > Varun
>> >
>> >
>> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> wrote:
>> >
>> >> Thanks Ted. I'll do the same.
>> >>
>> >>
>> >> ----- Original Message -----
>> >> From: Ted Yu <yu...@gmail.com>
>> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
>> >> Cc:
>> >> Sent: Thursday, May 9, 2013 9:07 AM
>> >> Subject: Re: All region server died due to "Parent directory doesn't
>> >> exist"
>> >>
>> >> I went through the patch for HBASE-7824 one more time and didn't find
>> >> direct correlation to the issue Lars reported.
>> >>
>> >> I am going over the other JIRAs in Lars' list.
>> >>
>> >> Cheers
>> >>
>> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:
>> >>
>> >> > I will try. I do not think this is the issue, though.
>> >> >
>> >> > The master is up in my case.
>> >> > Right now the cluster is in a state where each region server aborts
>> >> itself
>> >> > shortly after being started (which coincides with having it's log
>> >> directory
>> >> > renamed to ...-splitting).
>> >> >
>> >> >
>> >> > This is a test cluster and I could just start from scratch... This
>> >> appears
>> >> > to be a serious enough problem, though, and I would like to track down
>> >> the
>> >> > issue.
>> >> >
>> >> > -- Lars
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message -----
>> >> > From: Ted Yu <yu...@gmail.com>
>> >> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
>> >> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
>> >> > Sent: Thursday, May 9, 2013 2:04 AM
>> >> > Subject: Re: All region server died due to "Parent directory doesn't
>> >> exist"
>> >> >
>> >> > The config came from hbase-7824.
>> >> >
>> >> > There are other JIRAs in Lars' list which are related to log
>> splitting.
>> >> >
>> >> > I think more investigation is needed.
>> >> >
>> >> > Cheers
>> >> >
>> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org>
>> wrote:
>> >> >
>> >> > > So that is HBASE-7824, right?
>> >> > >
>> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
>> >> > >
>> >> > >> hbase.master.wait.for.log.splitting
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Best regards,
>> >> > >
>> >> > >   - Andy
>> >> > >
>> >> > > Problems worthy of attack prove their worth by hitting back. - Piet
>> >> Hein
>> >> > > (via Tom White)
>> >> >
>> >> >
>> >>
>> >>
>> >
>>
>
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by Stack <st...@duboce.net>.

Want to send me a regionserver log Lars? (off-list)
St.Ack


On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote:

> Thanks Ted and Varun.
>
>
> Let me check on the .META. server.
>
>
> The majority (13) of the RSs died within 2 minutes. The remaining 3 died
> over the following 10 minutes.
> So that would point to general issue. I did not see any ZK issues but I'll
> double check.
>
>
> It is just interesting that even now, if I start and RS it aborts within a
> minute or two, because of this issue.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Ted Yu <yu...@gmail.com>
> To: dev@hbase.apache.org
> Cc:
> Sent: Thursday, May 9, 2013 9:51 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
> Thanks Varun for sharing your experience.
>
> Lars:
> Was the server carrying .META. functioning properly around the time when
> you observed the problem ?
>
> Cheers
>
> On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> > cluster. I am not sure if you are seeing the exact same issue though. We
> > did not have mass failures at the same time due to this..
> >
> > Thanks
> > Varun
> >
> >
> > On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > Btw, I am not 100 % sure but I have some seen something like this
> before:
> > >
> > > 1) ZK connection flakiness causes ephemeral nodes to expire
> > > 2) Master detects failure and renames the logs into a splitting
> directory
> > > - this is intentional so that in case that region server comes back up,
> > it
> > > cannot write to the logs being split
> > > 3) Region server dies because the log is renamed
> > >
> > > So, the yanking away of files is done by the HBase master and is
> expected
> > > if the master feels the server is dead. We found that the Region server
> > > logs DFS exceptions like crazy (1000s of them) in that case and we
> always
> > > suspected that this is some kind of DFS error but when we really go
> upto
> > > the point where it started, we found some zookeeper session issues.
> > >
> > > We had two cases of this - either super high load or NTP/no clock
> > > synchronization b/w the clusters causing this issue for us.
> > >
> > > Thanks
> > > Varun
> > >
> > >
> > > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org>
> wrote:
> > >
> > >> Thanks Ted. I'll do the same.
> > >>
> > >>
> > >> ----- Original Message -----
> > >> From: Ted Yu <yu...@gmail.com>
> > >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> > >> Cc:
> > >> Sent: Thursday, May 9, 2013 9:07 AM
> > >> Subject: Re: All region server died due to "Parent directory doesn't
> > >> exist"
> > >>
> > >> I went through the patch for HBASE-7824 one more time and didn't find
> > >> direct correlation to the issue Lars reported.
> > >>
> > >> I am going over the other JIRAs in Lars' list.
> > >>
> > >> Cheers
> > >>
> > >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org>
> wrote:
> > >>
> > >> > I will try. I do not think this is the issue, though.
> > >> >
> > >> > The master is up in my case.
> > >> > Right now the cluster is in a state where each region server aborts
> > >> itself
> > >> > shortly after being started (which coincides with having it's log
> > >> directory
> > >> > renamed to ...-splitting).
> > >> >
> > >> >
> > >> > This is a test cluster and I could just start from scratch... This
> > >> appears
> > >> > to be a serious enough problem, though, and I would like to track
> down
> > >> the
> > >> > issue.
> > >> >
> > >> > -- Lars
> > >> >
> > >> >
> > >> >
> > >> > ----- Original Message -----
> > >> > From: Ted Yu <yu...@gmail.com>
> > >> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > >> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > >> > Sent: Thursday, May 9, 2013 2:04 AM
> > >> > Subject: Re: All region server died due to "Parent directory doesn't
> > >> exist"
> > >> >
> > >> > The config came from hbase-7824.
> > >> >
> > >> > There are other JIRAs in Lars' list which are related to log
> > splitting.
> > >> >
> > >> > I think more investigation is needed.
> > >> >
> > >> > Cheers
> > >> >
> > >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org>
> > wrote:
> > >> >
> > >> > > So that is HBASE-7824, right?
> > >> > >
> > >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > >> > >
> > >> > >> hbase.master.wait.for.log.splitting
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Best regards,
> > >> > >
> > >> > >   - Andy
> > >> > >
> > >> > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> > >> Hein
> > >> > > (via Tom White)
> > >> >
> > >> >
> > >>
> > >>
> > >
> >
>
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

Thanks Ted and Varun.


Let me check on the .META. server.


The majority (13) of the RSs died within 2 minutes. The remaining 3 died over the following 10 minutes.
So that would point to general issue. I did not see any ZK issues but I'll double check.


It is just interesting that even now, if I start and RS it aborts within a minute or two, because of this issue.

-- Lars


----- Original Message -----
From: Ted Yu <yu...@gmail.com>
To: dev@hbase.apache.org
Cc: 
Sent: Thursday, May 9, 2013 9:51 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"

Thanks Varun for sharing your experience.

Lars:
Was the server carrying .META. functioning properly around the time when
you observed the problem ?

Cheers

On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote:

> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> cluster. I am not sure if you are seeing the exact same issue though. We
> did not have mass failures at the same time due to this..
>
> Thanks
> Varun
>
>
> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Btw, I am not 100 % sure but I have some seen something like this before:
> >
> > 1) ZK connection flakiness causes ephemeral nodes to expire
> > 2) Master detects failure and renames the logs into a splitting directory
> > - this is intentional so that in case that region server comes back up,
> it
> > cannot write to the logs being split
> > 3) Region server dies because the log is renamed
> >
> > So, the yanking away of files is done by the HBase master and is expected
> > if the master feels the server is dead. We found that the Region server
> > logs DFS exceptions like crazy (1000s of them) in that case and we always
> > suspected that this is some kind of DFS error but when we really go upto
> > the point where it started, we found some zookeeper session issues.
> >
> > We had two cases of this - either super high load or NTP/no clock
> > synchronization b/w the clusters causing this issue for us.
> >
> > Thanks
> > Varun
> >
> >
> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> wrote:
> >
> >> Thanks Ted. I'll do the same.
> >>
> >>
> >> ----- Original Message -----
> >> From: Ted Yu <yu...@gmail.com>
> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> >> Cc:
> >> Sent: Thursday, May 9, 2013 9:07 AM
> >> Subject: Re: All region server died due to "Parent directory doesn't
> >> exist"
> >>
> >> I went through the patch for HBASE-7824 one more time and didn't find
> >> direct correlation to the issue Lars reported.
> >>
> >> I am going over the other JIRAs in Lars' list.
> >>
> >> Cheers
> >>
> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:
> >>
> >> > I will try. I do not think this is the issue, though.
> >> >
> >> > The master is up in my case.
> >> > Right now the cluster is in a state where each region server aborts
> >> itself
> >> > shortly after being started (which coincides with having it's log
> >> directory
> >> > renamed to ...-splitting).
> >> >
> >> >
> >> > This is a test cluster and I could just start from scratch... This
> >> appears
> >> > to be a serious enough problem, though, and I would like to track down
> >> the
> >> > issue.
> >> >
> >> > -- Lars
> >> >
> >> >
> >> >
> >> > ----- Original Message -----
> >> > From: Ted Yu <yu...@gmail.com>
> >> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >> > Sent: Thursday, May 9, 2013 2:04 AM
> >> > Subject: Re: All region server died due to "Parent directory doesn't
> >> exist"
> >> >
> >> > The config came from hbase-7824.
> >> >
> >> > There are other JIRAs in Lars' list which are related to log
> splitting.
> >> >
> >> > I think more investigation is needed.
> >> >
> >> > Cheers
> >> >
> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org>
> wrote:
> >> >
> >> > > So that is HBASE-7824, right?
> >> > >
> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > >
> >> > >> hbase.master.wait.for.log.splitting
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best regards,
> >> > >
> >> > >   - Andy
> >> > >
> >> > > Problems worthy of attack prove their worth by hitting back. - Piet
> >> Hein
> >> > > (via Tom White)
> >> >
> >> >
> >>
> >>
> >
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by Ted Yu <yu...@gmail.com>.

Thanks Varun for sharing your experience.

Lars:
Was the server carrying .META. functioning properly around the time when
you observed the problem ?

Cheers

On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote:

> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> cluster. I am not sure if you are seeing the exact same issue though. We
> did not have mass failures at the same time due to this..
>
> Thanks
> Varun
>
>
> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Btw, I am not 100 % sure but I have some seen something like this before:
> >
> > 1) ZK connection flakiness causes ephemeral nodes to expire
> > 2) Master detects failure and renames the logs into a splitting directory
> > - this is intentional so that in case that region server comes back up,
> it
> > cannot write to the logs being split
> > 3) Region server dies because the log is renamed
> >
> > So, the yanking away of files is done by the HBase master and is expected
> > if the master feels the server is dead. We found that the Region server
> > logs DFS exceptions like crazy (1000s of them) in that case and we always
> > suspected that this is some kind of DFS error but when we really go upto
> > the point where it started, we found some zookeeper session issues.
> >
> > We had two cases of this - either super high load or NTP/no clock
> > synchronization b/w the clusters causing this issue for us.
> >
> > Thanks
> > Varun
> >
> >
> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> wrote:
> >
> >> Thanks Ted. I'll do the same.
> >>
> >>
> >> ----- Original Message -----
> >> From: Ted Yu <yu...@gmail.com>
> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> >> Cc:
> >> Sent: Thursday, May 9, 2013 9:07 AM
> >> Subject: Re: All region server died due to "Parent directory doesn't
> >> exist"
> >>
> >> I went through the patch for HBASE-7824 one more time and didn't find
> >> direct correlation to the issue Lars reported.
> >>
> >> I am going over the other JIRAs in Lars' list.
> >>
> >> Cheers
> >>
> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:
> >>
> >> > I will try. I do not think this is the issue, though.
> >> >
> >> > The master is up in my case.
> >> > Right now the cluster is in a state where each region server aborts
> >> itself
> >> > shortly after being started (which coincides with having it's log
> >> directory
> >> > renamed to ...-splitting).
> >> >
> >> >
> >> > This is a test cluster and I could just start from scratch... This
> >> appears
> >> > to be a serious enough problem, though, and I would like to track down
> >> the
> >> > issue.
> >> >
> >> > -- Lars
> >> >
> >> >
> >> >
> >> > ----- Original Message -----
> >> > From: Ted Yu <yu...@gmail.com>
> >> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> >> > Sent: Thursday, May 9, 2013 2:04 AM
> >> > Subject: Re: All region server died due to "Parent directory doesn't
> >> exist"
> >> >
> >> > The config came from hbase-7824.
> >> >
> >> > There are other JIRAs in Lars' list which are related to log
> splitting.
> >> >
> >> > I think more investigation is needed.
> >> >
> >> > Cheers
> >> >
> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org>
> wrote:
> >> >
> >> > > So that is HBASE-7824, right?
> >> > >
> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > >
> >> > >> hbase.master.wait.for.log.splitting
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best regards,
> >> > >
> >> > >   - Andy
> >> > >
> >> > > Problems worthy of attack prove their worth by hitting back. - Piet
> >> Hein
> >> > > (via Tom White)
> >> >
> >> >
> >>
> >>
> >
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by Varun Sharma <va...@pinterest.com>.

I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
cluster. I am not sure if you are seeing the exact same issue though. We
did not have mass failures at the same time due to this..

Thanks
Varun


On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com> wrote:

> Btw, I am not 100 % sure but I have some seen something like this before:
>
> 1) ZK connection flakiness causes ephemeral nodes to expire
> 2) Master detects failure and renames the logs into a splitting directory
> - this is intentional so that in case that region server comes back up, it
> cannot write to the logs being split
> 3) Region server dies because the log is renamed
>
> So, the yanking away of files is done by the HBase master and is expected
> if the master feels the server is dead. We found that the Region server
> logs DFS exceptions like crazy (1000s of them) in that case and we always
> suspected that this is some kind of DFS error but when we really go upto
> the point where it started, we found some zookeeper session issues.
>
> We had two cases of this - either super high load or NTP/no clock
> synchronization b/w the clusters causing this issue for us.
>
> Thanks
> Varun
>
>
> On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> wrote:
>
>> Thanks Ted. I'll do the same.
>>
>>
>> ----- Original Message -----
>> From: Ted Yu <yu...@gmail.com>
>> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
>> Cc:
>> Sent: Thursday, May 9, 2013 9:07 AM
>> Subject: Re: All region server died due to "Parent directory doesn't
>> exist"
>>
>> I went through the patch for HBASE-7824 one more time and didn't find
>> direct correlation to the issue Lars reported.
>>
>> I am going over the other JIRAs in Lars' list.
>>
>> Cheers
>>
>> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:
>>
>> > I will try. I do not think this is the issue, though.
>> >
>> > The master is up in my case.
>> > Right now the cluster is in a state where each region server aborts
>> itself
>> > shortly after being started (which coincides with having it's log
>> directory
>> > renamed to ...-splitting).
>> >
>> >
>> > This is a test cluster and I could just start from scratch... This
>> appears
>> > to be a serious enough problem, though, and I would like to track down
>> the
>> > issue.
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ----- Original Message -----
>> > From: Ted Yu <yu...@gmail.com>
>> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
>> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
>> > Sent: Thursday, May 9, 2013 2:04 AM
>> > Subject: Re: All region server died due to "Parent directory doesn't
>> exist"
>> >
>> > The config came from hbase-7824.
>> >
>> > There are other JIRAs in Lars' list which are related to log splitting.
>> >
>> > I think more investigation is needed.
>> >
>> > Cheers
>> >
>> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org> wrote:
>> >
>> > > So that is HBASE-7824, right?
>> > >
>> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
>> > >
>> > >> hbase.master.wait.for.log.splitting
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > >
>> > >   - Andy
>> > >
>> > > Problems worthy of attack prove their worth by hitting back. - Piet
>> Hein
>> > > (via Tom White)
>> >
>> >
>>
>>
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by Varun Sharma <va...@pinterest.com>.

Btw, I am not 100 % sure but I have some seen something like this before:

1) ZK connection flakiness causes ephemeral nodes to expire
2) Master detects failure and renames the logs into a splitting directory -
this is intentional so that in case that region server comes back up, it
cannot write to the logs being split
3) Region server dies because the log is renamed

So, the yanking away of files is done by the HBase master and is expected
if the master feels the server is dead. We found that the Region server
logs DFS exceptions like crazy (1000s of them) in that case and we always
suspected that this is some kind of DFS error but when we really go upto
the point where it started, we found some zookeeper session issues.

We had two cases of this - either super high load or NTP/no clock
synchronization b/w the clusters causing this issue for us.

Thanks
Varun


On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> wrote:

> Thanks Ted. I'll do the same.
>
>
> ----- Original Message -----
> From: Ted Yu <yu...@gmail.com>
> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> Cc:
> Sent: Thursday, May 9, 2013 9:07 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
> I went through the patch for HBASE-7824 one more time and didn't find
> direct correlation to the issue Lars reported.
>
> I am going over the other JIRAs in Lars' list.
>
> Cheers
>
> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:
>
> > I will try. I do not think this is the issue, though.
> >
> > The master is up in my case.
> > Right now the cluster is in a state where each region server aborts
> itself
> > shortly after being started (which coincides with having it's log
> directory
> > renamed to ...-splitting).
> >
> >
> > This is a test cluster and I could just start from scratch... This
> appears
> > to be a serious enough problem, though, and I would like to track down
> the
> > issue.
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Ted Yu <yu...@gmail.com>
> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > Sent: Thursday, May 9, 2013 2:04 AM
> > Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> > The config came from hbase-7824.
> >
> > There are other JIRAs in Lars' list which are related to log splitting.
> >
> > I think more investigation is needed.
> >
> > Cheers
> >
> > On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org> wrote:
> >
> > > So that is HBASE-7824, right?
> > >
> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > >> hbase.master.wait.for.log.splitting
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >   - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> >
> >
>
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

Thanks Ted. I'll do the same.


----- Original Message -----
From: Ted Yu <yu...@gmail.com>
To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
Cc: 
Sent: Thursday, May 9, 2013 9:07 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"

I went through the patch for HBASE-7824 one more time and didn't find
direct correlation to the issue Lars reported.

I am going over the other JIRAs in Lars' list.

Cheers

On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:

> I will try. I do not think this is the issue, though.
>
> The master is up in my case.
> Right now the cluster is in a state where each region server aborts itself
> shortly after being started (which coincides with having it's log directory
> renamed to ...-splitting).
>
>
> This is a test cluster and I could just start from scratch... This appears
> to be a serious enough problem, though, and I would like to track down the
> issue.
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Ted Yu <yu...@gmail.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Thursday, May 9, 2013 2:04 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
> The config came from hbase-7824.
>
> There are other JIRAs in Lars' list which are related to log splitting.
>
> I think more investigation is needed.
>
> Cheers
>
> On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org> wrote:
>
> > So that is HBASE-7824, right?
> >
> > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> hbase.master.wait.for.log.splitting
> >
> >
> >
> >
> > --
> > Best regards,
> >
> >   - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
>
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by Ted Yu <yu...@gmail.com>.

I went through the patch for HBASE-7824 one more time and didn't find
direct correlation to the issue Lars reported.

I am going over the other JIRAs in Lars' list.

Cheers

On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote:

> I will try. I do not think this is the issue, though.
>
> The master is up in my case.
> Right now the cluster is in a state where each region server aborts itself
> shortly after being started (which coincides with having it's log directory
> renamed to ...-splitting).
>
>
> This is a test cluster and I could just start from scratch... This appears
> to be a serious enough problem, though, and I would like to track down the
> issue.
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Ted Yu <yu...@gmail.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Thursday, May 9, 2013 2:04 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
> The config came from hbase-7824.
>
> There are other JIRAs in Lars' list which are related to log splitting.
>
> I think more investigation is needed.
>
> Cheers
>
> On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org> wrote:
>
> > So that is HBASE-7824, right?
> >
> > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> hbase.master.wait.for.log.splitting
> >
> >
> >
> >
> > --
> > Best regards,
> >
> >   - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
>
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

I will try. I do not think this is the issue, though.

The master is up in my case.
Right now the cluster is in a state where each region server aborts itself shortly after being started (which coincides with having it's log directory renamed to ...-splitting).

This is a test cluster and I could just start from scratch... This appears to be a serious enough problem, though, and I would like to track down the issue.

-- Lars

----- Original Message -----
From: Ted Yu <yu...@gmail.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Cc: "dev@hbase.apache.org" <de...@hbase.apache.org>
Sent: Thursday, May 9, 2013 2:04 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"

The config came from hbase-7824. 

There are other JIRAs in Lars' list which are related to log splitting. 

I think more investigation is needed. 

Cheers

On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org> wrote:

> So that is HBASE-7824, right?
> 
> On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> hbase.master.wait.for.log.splitting
> 
> 
> 
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: All region server died due to "Parent directory doesn't exist"

Posted by Andrew Purtell <ap...@apache.org>.

Yes, I was asking about the patch that introduced that configuration
setting.


On Thu, May 9, 2013 at 5:04 PM, Ted Yu <yu...@gmail.com> wrote:

> The config came from hbase-7824.
>
> There are other JIRAs in Lars' list which are related to log splitting.
>
> I think more investigation is needed.
>
> Cheers
>
> On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org> wrote:
>
> > So that is HBASE-7824, right?
> >
> > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> hbase.master.wait.for.log.splitting
> >
> >
> >
> >
> > --
> > Best regards,
> >
> >   - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: All region server died due to "Parent directory doesn't exist"

Posted by Ted Yu <yu...@gmail.com>.

The config came from hbase-7824. 

There are other JIRAs in Lars' list which are related to log splitting. 

I think more investigation is needed. 

Cheers

On May 9, 2013, at 1:59 AM, Andrew Purtell <ap...@apache.org> wrote:

> So that is HBASE-7824, right?
> 
> On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> hbase.master.wait.for.log.splitting
> 
> 
> 
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: All region server died due to "Parent directory doesn't exist"

Posted by Andrew Purtell <ap...@apache.org>.

So that is HBASE-7824, right?

On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yu...@gmail.com> wrote:

> hbase.master.wait.for.log.splitting
>




-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: All region server died due to "Parent directory doesn't exist"

Posted by Ted Yu <yu...@gmail.com>.

What was the value for hbase.master.wait.for.log.splitting config parameter ?
Default value is false. 

Cheers

On May 9, 2013, at 12:41 AM, lars hofhansl <la...@apache.org> wrote:

> Potential jiras that went into 0.94.7 that could be responsible:
> HBASE-7824
> HBASE-8246
> HBASE-8276
> HBASE-8288
> HBASE-8212
> HBASE-8081
> HBASE-8211
> HBASE-8211
> 
> 
> -- Lars
> 
> ----- Original Message -----
> From: lars hofhansl <la...@apache.org>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <la...@apache.org>
> Cc: 
> Sent: Thursday, May 9, 2013 12:23 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
> 
> All the directories in .logs have the -splitting suffix, so this seems by design.
> The problem is that even though all logs are split, each time I startup a region server now, its log dir is renamed to ...-splitting and the region server shuts itself down.
> 
> -- Lars
> 
> 
> 
> ----- Original Message -----
> From: lars hofhansl <la...@apache.org>
> To: hbase-dev <de...@hbase.apache.org>
> Cc: 
> Sent: Wednesday, May 8, 2013 11:39 PM
> Subject: All region server died due to "Parent directory doesn't exist"
> 
> We just had all RegionServers die in a test cluster. All with the following exception.
> (This is CDH4.2.1 with HBase 0.94.7 build against it)
> 
> Strangely HDFS is up and running (I can ls all directories, create files in it, etc. HDFS's fsck reports that all is well), yet we had the RSs die with this.
> This almost looks like a race where the directories under .logs were yanked away while they were still in use.
> 
> I plan to investigate this further. In any event, has anybody seen this issue (or anything similar to this) before?
> When this happened there was no load on the cluster (other than some write from OTSDB).
> 
> Thanks.
> 
> -- Lars
> 
> 2013-05-08 16:02:41,178 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server <host>,60020,1367614452787: IOE in log roller
> java.io.IOException: Exception in createWriter
>         at org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66)
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:715)
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:648)
>         at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:95)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException: cannot get log writer
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:771)
>         at org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60)
>         ... 4 more
> Caused by: java.io.IOException: java.io.FileNotFoundException: Parent directory doesn't exist: /hbase/.logs/<host>,60020,1367614452787
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> 
>         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173)
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:768)
>         ... 5 more
>

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

Potential jiras that went into 0.94.7 that could be responsible:
HBASE-7824
HBASE-8246
HBASE-8276
HBASE-8288
HBASE-8212
HBASE-8081
HBASE-8211
HBASE-8211


-- Lars

----- Original Message -----
From: lars hofhansl <la...@apache.org>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <la...@apache.org>
Cc: 
Sent: Thursday, May 9, 2013 12:23 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"

All the directories in .logs have the -splitting suffix, so this seems by design.
The problem is that even though all logs are split, each time I startup a region server now, its log dir is renamed to ...-splitting and the region server shuts itself down.

-- Lars



----- Original Message -----
From: lars hofhansl <la...@apache.org>
To: hbase-dev <de...@hbase.apache.org>
Cc: 
Sent: Wednesday, May 8, 2013 11:39 PM
Subject: All region server died due to "Parent directory doesn't exist"

We just had all RegionServers die in a test cluster. All with the following exception.
(This is CDH4.2.1 with HBase 0.94.7 build against it)

Strangely HDFS is up and running (I can ls all directories, create files in it, etc. HDFS's fsck reports that all is well), yet we had the RSs die with this.
This almost looks like a race where the directories under .logs were yanked away while they were still in use.

I plan to investigate this further. In any event, has anybody seen this issue (or anything similar to this) before?
When this happened there was no load on the cluster (other than some write from OTSDB).

Thanks.

-- Lars

2013-05-08 16:02:41,178 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server <host>,60020,1367614452787: IOE in log roller
java.io.IOException: Exception in createWriter
        at org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:715)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:648)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:95)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: cannot get log writer
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:771)
        at org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60)
        ... 4 more
Caused by: java.io.IOException: java.io.FileNotFoundException: Parent directory doesn't exist: /hbase/.logs/<host>,60020,1367614452787
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:768)
        ... 5 more

Re: All region server died due to "Parent directory doesn't exist"

Posted by lars hofhansl <la...@apache.org>.

All the directories in .logs have the -splitting suffix, so this seems by design.
The problem is that even though all logs are split, each time I startup a region server now, its log dir is renamed to ...-splitting and the region server shuts itself down.

-- Lars

----- Original Message -----
From: lars hofhansl <la...@apache.org>
To: hbase-dev <de...@hbase.apache.org>
Cc: 
Sent: Wednesday, May 8, 2013 11:39 PM
Subject: All region server died due to "Parent directory doesn't exist"

We just had all RegionServers die in a test cluster. All with the following exception.
(This is CDH4.2.1 with HBase 0.94.7 build against it)

Strangely HDFS is up and running (I can ls all directories, create files in it, etc. HDFS's fsck reports that all is well), yet we had the RSs die with this.
This almost looks like a race where the directories under .logs were yanked away while they were still in use.

I plan to investigate this further. In any event, has anybody seen this issue (or anything similar to this) before?
When this happened there was no load on the cluster (other than some write from OTSDB).

Thanks.

-- Lars

2013-05-08 16:02:41,178 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server <host>,60020,1367614452787: IOE in log roller
java.io.IOException: Exception in createWriter
        at org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:715)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:648)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:95)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: cannot get log writer
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:771)
        at org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60)
        ... 4 more
Caused by: java.io.IOException: java.io.FileNotFoundException: Parent directory doesn't exist: /hbase/.logs/<host>,60020,1367614452787
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:768)
        ... 5 more