You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Zhoushuaifeng <zh...@huawei.com> on 2011/08/10 04:09:16 UTC

TestRollingRestart fail occasionally

Hi,
I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log shows that split log runs in to a circle, the recoverFileLease fail and the while() never end and the test timeout and fail.

Here are some of the logs:
After restarting primary master, not all the RSs connected before the master stop waiting:
TRR: Restarting primary master
INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on regionserver(s) count to settle; currently=3
2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on regionserver(s) count to settle; currently=3
2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(648): Finished waiting for regionserver count to settle; count=3, sleptFor=4500
2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(674): Exiting wait on regionserver(s) to checkin; count=3, stopped=false, count of regions out on cluster=22
2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977] master.MasterFileSystem(180): Log folder hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108 doesn't belong to a known region server, splitting

But after master starting split, another RS connected:
2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1456): Attempting connect to Master server at linux1.site:35977
2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1475): Connected to master at linux1.site:35977

Then, split log recover lease may  encounter AlreadyBeingCreatedException and show this log:
2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977] util.FSUtils(715): Waited 60087ms for lease recovery on hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175 for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1, because this file is already being created by DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on 127.0.0.1
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
                at org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:396)
                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)

This log shows continuing about 14 minutes and test fail.

This test fail occasionally may because the master default waiting time is 4500ms, usually it's enouth for all the RS to check in, but some times it's not, and the RS check in later may disturb the recover lease.
This may be a bug, And may have some relation to HBASE-4177.


Zhou Shuaifeng(Frank)

RE: TestRollingRestart fail occasionally

Posted by Zhoushuaifeng <zh...@huawei.com>.

I run this test of 0.90.4, it also fails sometimes. 
I run 212 times, fails 5.

linux10:/opt/buildhbase/hbase-0.90.4 # grep "BUILD SUCCESS" TestRollingRestart1.log -c
207
linux10:/opt/buildhbase/hbase-0.90.4 # grep "BUILD F" TestRollingRestart1.log -c
5



Zhou Shuaifeng(Frank)

-----Original Message-----
From: Zhoushuaifeng [mailto:zhoushuaifeng@huawei.com] 
Sent: Friday, August 12, 2011 9:00 AM
To: dev@hbase.apache.org
Subject: RE: TestRollingRestart fail occasionally

OK, Thanks, I download the latest 0.90.4 and run more test on it. If encounter this problem again, I will share more info. 

Zhou Shuaifeng(Frank)

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Friday, August 12, 2011 12:13 AM
To: dev@hbase.apache.org
Subject: Re: TestRollingRestart fail occasionally

I will give this higher priority if test fails on Jenkins.

FYI

RE: TestRollingRestart fail occasionally

Posted by Zhoushuaifeng <zh...@huawei.com>.

OK, Thanks, I download the latest 0.90.4 and run more test on it. If encounter this problem again, I will share more info. 

Zhou Shuaifeng(Frank)

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Friday, August 12, 2011 12:13 AM
To: dev@hbase.apache.org
Subject: Re: TestRollingRestart fail occasionally

I will give this higher priority if test fails on Jenkins.

FYI

Re: TestRollingRestart fail occasionally

Posted by Ted Yu <yu...@gmail.com>.

I will give this higher priority if test fails on Jenkins.

FYI

On Wed, Aug 10, 2011 at 7:01 PM, Zhoushuaifeng <zh...@huawei.com>wrote:

> Thanks Ted,
> It's occasionally, testing many times may occur once.
> Can you analysis the code and log and check if it's a problem?
> Whatever, I will do more test.
>
> Zhou Shuaifeng(Frank)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Thursday, August 11, 2011 9:53 AM
> To: dev@hbase.apache.org
> Subject: Re: TestRollingRestart fail occasionally
>
> 0.90.4 was just released. Can you run TestRollingRestart using latest 0.90
> branch ?
>
> This test didn't fail on
> https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/
>
> Cheers
>
> On Wed, Aug 10, 2011 at 6:21 PM, Zhoushuaifeng <zhoushuaifeng@huawei.com
> >wrote:
>
> > Is there anyone else do this test and encounter the failure? And what's
> > your opinion?
> >
> > Zhou Shuaifeng(Frank)
> >
> > -----Original Message-----
> > From: Zhoushuaifeng [mailto:zhoushuaifeng@huawei.com]
> > Sent: Wednesday, August 10, 2011 10:09 AM
> > To: dev@hbase.apache.org
> > Subject: TestRollingRestart fail occasionally
> >
> > Hi,
> > I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log
> > shows that split log runs in to a circle, the recoverFileLease fail and
> the
> > while() never end and the test timeout and fail.
> >
> > Here are some of the logs:
> > After restarting primary master, not all the RSs connected before the
> > master stop waiting:
> > TRR: Restarting primary master
> > INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on
> > regionserver(s) count to settle; currently=3
> > 2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977]
> > master.ServerManager(660): Waiting on regionserver(s) count to settle;
> > currently=3
> > 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> > master.ServerManager(648): Finished waiting for regionserver count to
> > settle; count=3, sleptFor=4500
> > 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> > master.ServerManager(674): Exiting wait on regionserver(s) to checkin;
> > count=3, stopped=false, count of regions out on cluster=22
> > 2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977]
> > master.MasterFileSystem(180): Log folder
> > hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108
> > doesn't belong to a known region server, splitting
> >
> > But after master starting split, another RS connected:
> > 2011-07-06 09:13:54,243 INFO
> >  [RegionServer:3;linux1.site,54949,1309914772108]
> > regionserver.HRegionServer(1456): Attempting connect to Master server at
> > linux1.site:35977
> > 2011-07-06 09:13:54,243 INFO
> >  [RegionServer:3;linux1.site,54949,1309914772108]
> > regionserver.HRegionServer(1475): Connected to master at
> linux1.site:35977
> >
> > Then, split log recover lease may  encounter AlreadyBeingCreatedException
> > and show this log:
> > 2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977]
> > util.FSUtils(715): Waited 60087ms for lease recovery on
> >
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> > failed to create file
> >
> /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175
> > for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1,
> > because this file is already being created by
> > DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on
> 127.0.0.1
> >                at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
> >                at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
> >                at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
> >                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown
> > Source)
> >                at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >                at java.lang.reflect.Method.invoke(Method.java:597)
> >                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> >                at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
> >                at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
> >                at java.security.AccessController.doPrivileged(Native
> > Method)
> >                at javax.security.auth.Subject.doAs(Subject.java:396)
> >                at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
> >
> > This log shows continuing about 14 minutes and test fail.
> >
> > This test fail occasionally may because the master default waiting time
> is
> > 4500ms, usually it's enouth for all the RS to check in, but some times
> it's
> > not, and the RS check in later may disturb the recover lease.
> > This may be a bug, And may have some relation to HBASE-4177.
> >
> >
> > Zhou Shuaifeng(Frank)
> >
> >
> >
>

RE: TestRollingRestart fail occasionally

Posted by Zhoushuaifeng <zh...@huawei.com>.

Thanks Ted,
It's occasionally, testing many times may occur once.
Can you analysis the code and log and check if it's a problem?
Whatever, I will do more test.

Zhou Shuaifeng(Frank)


-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Thursday, August 11, 2011 9:53 AM
To: dev@hbase.apache.org
Subject: Re: TestRollingRestart fail occasionally

0.90.4 was just released. Can you run TestRollingRestart using latest 0.90
branch ?

This test didn't fail on
https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/

Cheers

On Wed, Aug 10, 2011 at 6:21 PM, Zhoushuaifeng <zh...@huawei.com>wrote:

> Is there anyone else do this test and encounter the failure? And what's
> your opinion?
>
> Zhou Shuaifeng(Frank)
>
> -----Original Message-----
> From: Zhoushuaifeng [mailto:zhoushuaifeng@huawei.com]
> Sent: Wednesday, August 10, 2011 10:09 AM
> To: dev@hbase.apache.org
> Subject: TestRollingRestart fail occasionally
>
> Hi,
> I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log
> shows that split log runs in to a circle, the recoverFileLease fail and the
> while() never end and the test timeout and fail.
>
> Here are some of the logs:
> After restarting primary master, not all the RSs connected before the
> master stop waiting:
> TRR: Restarting primary master
> INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on
> regionserver(s) count to settle; currently=3
> 2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977]
> master.ServerManager(660): Waiting on regionserver(s) count to settle;
> currently=3
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> master.ServerManager(648): Finished waiting for regionserver count to
> settle; count=3, sleptFor=4500
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> master.ServerManager(674): Exiting wait on regionserver(s) to checkin;
> count=3, stopped=false, count of regions out on cluster=22
> 2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977]
> master.MasterFileSystem(180): Log folder
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108
> doesn't belong to a known region server, splitting
>
> But after master starting split, another RS connected:
> 2011-07-06 09:13:54,243 INFO
>  [RegionServer:3;linux1.site,54949,1309914772108]
> regionserver.HRegionServer(1456): Attempting connect to Master server at
> linux1.site:35977
> 2011-07-06 09:13:54,243 INFO
>  [RegionServer:3;linux1.site,54949,1309914772108]
> regionserver.HRegionServer(1475): Connected to master at linux1.site:35977
>
> Then, split log recover lease may  encounter AlreadyBeingCreatedException
> and show this log:
> 2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977]
> util.FSUtils(715): Waited 60087ms for lease recovery on
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> failed to create file
> /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175
> for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1,
> because this file is already being created by
> DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on 127.0.0.1
>                at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
>                at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
>                at
> org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
>                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown
> Source)
>                at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>                at java.lang.reflect.Method.invoke(Method.java:597)
>                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>                at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
>                at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
>                at java.security.AccessController.doPrivileged(Native
> Method)
>                at javax.security.auth.Subject.doAs(Subject.java:396)
>                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
>
> This log shows continuing about 14 minutes and test fail.
>
> This test fail occasionally may because the master default waiting time is
> 4500ms, usually it's enouth for all the RS to check in, but some times it's
> not, and the RS check in later may disturb the recover lease.
> This may be a bug, And may have some relation to HBASE-4177.
>
>
> Zhou Shuaifeng(Frank)
>
>
>

Re: TestRollingRestart fail occasionally

Posted by Ted Yu <yu...@gmail.com>.

0.90.4 was just released. Can you run TestRollingRestart using latest 0.90
branch ?

This test didn't fail on
https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/

Cheers

On Wed, Aug 10, 2011 at 6:21 PM, Zhoushuaifeng <zh...@huawei.com>wrote:

> Is there anyone else do this test and encounter the failure? And what's
> your opinion?
>
> Zhou Shuaifeng(Frank)
>
> -----Original Message-----
> From: Zhoushuaifeng [mailto:zhoushuaifeng@huawei.com]
> Sent: Wednesday, August 10, 2011 10:09 AM
> To: dev@hbase.apache.org
> Subject: TestRollingRestart fail occasionally
>
> Hi,
> I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log
> shows that split log runs in to a circle, the recoverFileLease fail and the
> while() never end and the test timeout and fail.
>
> Here are some of the logs:
> After restarting primary master, not all the RSs connected before the
> master stop waiting:
> TRR: Restarting primary master
> INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on
> regionserver(s) count to settle; currently=3
> 2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977]
> master.ServerManager(660): Waiting on regionserver(s) count to settle;
> currently=3
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> master.ServerManager(648): Finished waiting for regionserver count to
> settle; count=3, sleptFor=4500
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> master.ServerManager(674): Exiting wait on regionserver(s) to checkin;
> count=3, stopped=false, count of regions out on cluster=22
> 2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977]
> master.MasterFileSystem(180): Log folder
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108
> doesn't belong to a known region server, splitting
>
> But after master starting split, another RS connected:
> 2011-07-06 09:13:54,243 INFO
>  [RegionServer:3;linux1.site,54949,1309914772108]
> regionserver.HRegionServer(1456): Attempting connect to Master server at
> linux1.site:35977
> 2011-07-06 09:13:54,243 INFO
>  [RegionServer:3;linux1.site,54949,1309914772108]
> regionserver.HRegionServer(1475): Connected to master at linux1.site:35977
>
> Then, split log recover lease may  encounter AlreadyBeingCreatedException
> and show this log:
> 2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977]
> util.FSUtils(715): Waited 60087ms for lease recovery on
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> failed to create file
> /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175
> for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1,
> because this file is already being created by
> DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on 127.0.0.1
>                at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
>                at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
>                at
> org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
>                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown
> Source)
>                at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>                at java.lang.reflect.Method.invoke(Method.java:597)
>                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>                at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
>                at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
>                at java.security.AccessController.doPrivileged(Native
> Method)
>                at javax.security.auth.Subject.doAs(Subject.java:396)
>                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
>
> This log shows continuing about 14 minutes and test fail.
>
> This test fail occasionally may because the master default waiting time is
> 4500ms, usually it's enouth for all the RS to check in, but some times it's
> not, and the RS check in later may disturb the recover lease.
> This may be a bug, And may have some relation to HBASE-4177.
>
>
> Zhou Shuaifeng(Frank)
>
>
>

RE: TestRollingRestart fail occasionally

Posted by Zhoushuaifeng <zh...@huawei.com>.

Is there anyone else do this test and encounter the failure? And what's your opinion?

Zhou Shuaifeng(Frank)

-----Original Message-----
From: Zhoushuaifeng [mailto:zhoushuaifeng@huawei.com] 
Sent: Wednesday, August 10, 2011 10:09 AM
To: dev@hbase.apache.org
Subject: TestRollingRestart fail occasionally

Hi,
I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log shows that split log runs in to a circle, the recoverFileLease fail and the while() never end and the test timeout and fail.

Here are some of the logs:
After restarting primary master, not all the RSs connected before the master stop waiting:
TRR: Restarting primary master
INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on regionserver(s) count to settle; currently=3
2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on regionserver(s) count to settle; currently=3
2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(648): Finished waiting for regionserver count to settle; count=3, sleptFor=4500
2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(674): Exiting wait on regionserver(s) to checkin; count=3, stopped=false, count of regions out on cluster=22
2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977] master.MasterFileSystem(180): Log folder hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108 doesn't belong to a known region server, splitting

But after master starting split, another RS connected:
2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1456): Attempting connect to Master server at linux1.site:35977
2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1475): Connected to master at linux1.site:35977

Then, split log recover lease may  encounter AlreadyBeingCreatedException and show this log:
2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977] util.FSUtils(715): Waited 60087ms for lease recovery on hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175 for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1, because this file is already being created by DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on 127.0.0.1
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
                at org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:396)
                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)

This log shows continuing about 14 minutes and test fail.

This test fail occasionally may because the master default waiting time is 4500ms, usually it's enouth for all the RS to check in, but some times it's not, and the RS check in later may disturb the recover lease.
This may be a bug, And may have some relation to HBASE-4177.


Zhou Shuaifeng(Frank)

RE: TestRollingRestart fail occasionally

Posted by Zhoushuaifeng <zh...@huawei.com>.

OK, I will update the hadoop jar and test more.

Zhou Shuaifeng(Frank)

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, August 17, 2011 12:02 AM
To: dev@hbase.apache.org
Subject: Re: TestRollingRestart fail occasionally

On Tue, Aug 16, 2011 at 4:49 AM, Zhoushuaifeng <zh...@huawei.com> wrote:
> My hadoop core is 0.20-append-r1056497
> Shall we need a issue and I attach more log to analysis?
>

I'd suggest that you build an hadoop jar from the tip of the
branch-0.20-append, i.e. r1069242.
St.Ack

Re: TestRollingRestart fail occasionally

Posted by Stack <st...@duboce.net>.

On Tue, Aug 16, 2011 at 4:49 AM, Zhoushuaifeng <zh...@huawei.com> wrote:
> My hadoop core is 0.20-append-r1056497
> Shall we need a issue and I attach more log to analysis?
>

I'd suggest that you build an hadoop jar from the tip of the
branch-0.20-append, i.e. r1069242.
St.Ack

RE: TestRollingRestart fail occasionally

Posted by Zhoushuaifeng <zh...@huawei.com>.

My hadoop core is 0.20-append-r1056497
Shall we need a issue and I attach more log to analysis?

Zhou Shuaifeng(Frank)


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Tuesday, August 16, 2011 12:13 PM
To: dev@hbase.apache.org
Subject: Re: TestRollingRestart fail occasionally

If you update the hadoop that hbase ships with to the tip of the
branch-0.20-append, does it fail then?  The tip has hdfs-1554,
hdfs-1554 whereas what hbase ships with does not.

St.Ack

On Tue, Aug 9, 2011 at 7:09 PM, Zhoushuaifeng <zh...@huawei.com> wrote:
> Hi,
> I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log shows that split log runs in to a circle, the recoverFileLease fail and the while() never end and the test timeout and fail.
>
> Here are some of the logs:
> After restarting primary master, not all the RSs connected before the master stop waiting:
> TRR: Restarting primary master
> INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on regionserver(s) count to settle; currently=3
> 2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on regionserver(s) count to settle; currently=3
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(648): Finished waiting for regionserver count to settle; count=3, sleptFor=4500
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(674): Exiting wait on regionserver(s) to checkin; count=3, stopped=false, count of regions out on cluster=22
> 2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977] master.MasterFileSystem(180): Log folder hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108 doesn't belong to a known region server, splitting
>
> But after master starting split, another RS connected:
> 2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1456): Attempting connect to Master server at linux1.site:35977
> 2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1475): Connected to master at linux1.site:35977
>
> Then, split log recover lease may  encounter AlreadyBeingCreatedException and show this log:
> 2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977] util.FSUtils(715): Waited 60087ms for lease recovery on hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175 for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1, because this file is already being created by DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on 127.0.0.1
>                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
>                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
>                at org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
>                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>                at java.lang.reflect.Method.invoke(Method.java:597)
>                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
>                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
>                at java.security.AccessController.doPrivileged(Native Method)
>                at javax.security.auth.Subject.doAs(Subject.java:396)
>                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
>
> This log shows continuing about 14 minutes and test fail.
>
> This test fail occasionally may because the master default waiting time is 4500ms, usually it's enouth for all the RS to check in, but some times it's not, and the RS check in later may disturb the recover lease.
> This may be a bug, And may have some relation to HBASE-4177.
>
>
> Zhou Shuaifeng(Frank)
>
>
>

Re: TestRollingRestart fail occasionally

Posted by Stack <st...@duboce.net>.

If you update the hadoop that hbase ships with to the tip of the
branch-0.20-append, does it fail then?  The tip has hdfs-1554,
hdfs-1554 whereas what hbase ships with does not.

St.Ack

On Tue, Aug 9, 2011 at 7:09 PM, Zhoushuaifeng <zh...@huawei.com> wrote:
> Hi,
> I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log shows that split log runs in to a circle, the recoverFileLease fail and the while() never end and the test timeout and fail.
>
> Here are some of the logs:
> After restarting primary master, not all the RSs connected before the master stop waiting:
> TRR: Restarting primary master
> INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on regionserver(s) count to settle; currently=3
> 2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on regionserver(s) count to settle; currently=3
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(648): Finished waiting for regionserver count to settle; count=3, sleptFor=4500
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(674): Exiting wait on regionserver(s) to checkin; count=3, stopped=false, count of regions out on cluster=22
> 2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977] master.MasterFileSystem(180): Log folder hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108 doesn't belong to a known region server, splitting
>
> But after master starting split, another RS connected:
> 2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1456): Attempting connect to Master server at linux1.site:35977
> 2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1475): Connected to master at linux1.site:35977
>
> Then, split log recover lease may  encounter AlreadyBeingCreatedException and show this log:
> 2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977] util.FSUtils(715): Waited 60087ms for lease recovery on hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175 for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1, because this file is already being created by DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on 127.0.0.1
>                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
>                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
>                at org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
>                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>                at java.lang.reflect.Method.invoke(Method.java:597)
>                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
>                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
>                at java.security.AccessController.doPrivileged(Native Method)
>                at javax.security.auth.Subject.doAs(Subject.java:396)
>                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
>
> This log shows continuing about 14 minutes and test fail.
>
> This test fail occasionally may because the master default waiting time is 4500ms, usually it's enouth for all the RS to check in, but some times it's not, and the RS check in later may disturb the recover lease.
> This may be a bug, And may have some relation to HBASE-4177.
>
>
> Zhou Shuaifeng(Frank)
>
>
>