You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Sean MacDonald <se...@opendns.com> on 2013/04/22 18:22:20 UTC

Snapshot Export Problem

Hello, 

I am using HBase 0.94.6 on CDH 4.2 and trying to export a snapshot to another cluster (also CDH 4.2), but this is failing repeatedly. The table I am trying to export is approximately 4TB in size and has 10GB regions. Each of the map jobs runs for about 6 minutes and appears to be running properly, but then fails with a message like the following:

2013-04-22 16:12:50,699 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /hbase/.archive/queries/533fcbb7858ef34b103a4f8804fa8719/d/651e974dafb64eefb9c49032aec4a35b File does not exist. Holder DFSClient_NONMAPREDUCE_-192704511_1 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol
$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

I was able to see the file that the LeaseExpiredException mentions on the destination cluster before the exception happened (it is gone afterwards).

Any help that could be provided in resolving this would be greatly appreciated.

Thanks and have a great day,

Sean 



Re: Snapshot Export Problem

Posted by Matteo Bertozzi <th...@gmail.com>.
The chown it not the main problem. The export can go on even without
changing rights
I've filed HBASE-8455 to solve the main problem that is related to the
reference file link names.
Thanks Sean for the logs!

Matteo



On Mon, Apr 29, 2013 at 5:53 PM, Ted Yu <yu...@gmail.com> wrote:

> Looks like permission issue. Can you try running ExportSnapshot as user who
> has enough privilege ?
>
>
>    1. 2013-04-29 16:40:38,059 ERROR
>    org.apache.hadoop.hbase.snapshot.ExportSnapshot: Unable to set the
>    owner/group for
>
>  file=hdfs://namenode-backup:8020/users/sean/hbase_test/.archive/queries/991625ef6c2a3db259dc984c990e823d/d/29384f58e6964b1a9044590988a390d3
>    2. org.apache.hadoop.security.AccessControlException: Non-super user
>    cannot change owner.
>
>
> On Mon, Apr 29, 2013 at 9:50 AM, Sean MacDonald <se...@opendns.com> wrote:
>
> > Hi Matteo,
> >
> > I've posted the snapshot information here:
> >
> > http://pastebin.com/ZgDfH2pT
> >
> > and the stack trace here:
> >
> > http://pastebin.com/GBQT3zdd
> >
> > Thanks,
> >
> > Sean
> >
> >
> > On Friday, 26 April, 2013 at 2:16 PM, Matteo Bertozzi wrote:
> >
> > > Hey Sean,
> > >
> > > could you provide us the full stack trace of the FileNotFoundException
> > > Unable to open link
> > > and also the output of: hbase
> > org.apache.hadoop.hbase.snapshot.SnapshotInfo
> > > -files -stats -snapshot SNAPSHOT_NAME
> > > to give us a better idea of what is the state of the snapshot
> > >
> > > Thanks!
> > >
> > >
> > > On Fri, Apr 26, 2013 at 9:51 PM, Sean MacDonald <sean@opendns.com
> (mailto:
> > sean@opendns.com)> wrote:
> > >
> > > > Hi Jon,
> > > >
> > > > I've actually discovered another issue with snapshot export. If you
> > have a
> > > > region that has recently split and you take a snapshot of that table
> > and
> > > > try to export it while the children still have references to the
> files
> > in
> > > > the split parent, the files will not be transferred and will be
> > counted in
> > > > the missing total. You end with error messages like:
> > > >
> > > > java.io.FileNotFoundException: Unable to open link:
> > > > org.apache.hadoop.hbase.io.HLogLink
> > > >
> > > > Please let me know if you would like any additional information.
> > > >
> > > > Thanks and have a great day,
> > > >
> > > > Sean
> > > >
> > > >
> > > > On Wednesday, 24 April, 2013 at 9:19 AM, Sean MacDonald wrote:
> > > >
> > > > > Hi Jon,
> > > > >
> > > > > No problem. We do have snapshots enabled on the target cluster, and
> > we
> > > > are using the default hfile archiver settings on both clusters.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Sean
> > > > >
> > > > >
> > > > > On Tuesday, 23 April, 2013 at 1:54 PM, Jonathan Hsieh wrote:
> > > > >
> > > > > > Sean,
> > > > > >
> > > > > > Thanks for finding this problem. Can you provide some more
> > information
> > > > so
> > > > > > that we can try to duplicate and fix this problem?
> > > > > >
> > > > > > Are snapshots on on the target cluster?
> > > > > > What are the hfile archiver settings in your hbase-site.xml on
> both
> > > > > > clusters?
> > > > > >
> > > > > > Thanks,
> > > > > > Jon.
> > > > > >
> > > > > >
> > > > > > On Mon, Apr 22, 2013 at 4:47 PM, Sean MacDonald <
> sean@opendns.com(mailto:
> > sean@opendns.com)(mailto:
> > > > sean@opendns.com (mailto:sean@opendns.com))> wrote:
> > > > > >
> > > > > > > It looks like you can't export a snapshot to a running cluster
> > or it
> > > > will
> > > > > > > start cleaning up files from the archive after a period of
> time.
> > I
> > > > > >
> > > > >
> > > >
> > > >
> > > > have
> > > > > > > turned off HBase on the destination cluster and the export is
> > > > > >
> > > > >
> > > >
> > > >
> > > > working as
> > > > > > > expected now.
> > > > > > >
> > > > > > > Sean
> > > > > > >
> > > > > > >
> > > > > > > On Monday, 22 April, 2013 at 9:22 AM, Sean MacDonald wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I am using HBase 0.94.6 on CDH 4.2 and trying to export a
> > snapshot
> > > > to
> > > > > > > another cluster (also CDH 4.2), but this is failing repeatedly.
> > The
> > > > > >
> > > > >
> > > >
> > > >
> > > > table I
> > > > > > > am trying to export is approximately 4TB in size and has 10GB
> > > > > >
> > > > >
> > > >
> > > >
> > > > regions. Each
> > > > > > > of the map jobs runs for about 6 minutes and appears to be
> > running
> > > > > > > properly, but then fails with a message like the following:
> > > > > > > >
> > > > > > > > 2013-04-22 16:12:50,699 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > > > > > DataStreamer Exception
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > > > > No lease on
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> /hbase/.archive/queries/533fcbb7858ef34b103a4f8804fa8719/d/651e974dafb64eefb9c49032aec4a35b
> > > > > > > File does not exist. Holder DFSClient_NONMAPREDUCE_-192704511_1
> > does
> > > > > >
> > > > >
> > > >
> > > >
> > > > not
> > > > > > > have any open files. at
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > > > > > > at
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > > > > > > at
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
> > > > > > > at
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
> > > > > > > at
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> > > > > > > at
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtoc
> > > > > > > ol
> > > > > > > >
> $2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
> > at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> > > > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at
> > > > > > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at
> > > > > > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at
> > > > > > > java.security.AccessController.doPrivileged(Native Method) at
> > > > > > > javax.security.auth.Subject.doAs(Subject.java:396) at
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > > > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> > > > > > > >
> > > > > > > > I was able to see the file that the LeaseExpiredException
> > mentions
> > > > on
> > > > > > > the destination cluster before the exception happened (it is
> gone
> > > > > > > afterwards).
> > > > > > > >
> > > > > > > > Any help that could be provided in resolving this would be
> > greatly
> > > > > > > appreciated.
> > > > > > > >
> > > > > > > > Thanks and have a great day,
> > > > > > > >
> > > > > > > > Sean
> > > > > >
> > > > > >
> > > > > > --
> > > > > > // Jonathan Hsieh (shay)
> > > > > > // Software Engineer, Cloudera
> > > > > > // jon@cloudera.com (mailto:jon@cloudera.com)
> > > > >
> > > >
> > >
> >
> >
> >
> >
>

Re: Snapshot Export Problem

Posted by Ted Yu <yu...@gmail.com>.
Looks like permission issue. Can you try running ExportSnapshot as user who
has enough privilege ?


   1. 2013-04-29 16:40:38,059 ERROR
   org.apache.hadoop.hbase.snapshot.ExportSnapshot: Unable to set the
   owner/group for
   file=hdfs://namenode-backup:8020/users/sean/hbase_test/.archive/queries/991625ef6c2a3db259dc984c990e823d/d/29384f58e6964b1a9044590988a390d3
   2. org.apache.hadoop.security.AccessControlException: Non-super user
   cannot change owner.


On Mon, Apr 29, 2013 at 9:50 AM, Sean MacDonald <se...@opendns.com> wrote:

> Hi Matteo,
>
> I've posted the snapshot information here:
>
> http://pastebin.com/ZgDfH2pT
>
> and the stack trace here:
>
> http://pastebin.com/GBQT3zdd
>
> Thanks,
>
> Sean
>
>
> On Friday, 26 April, 2013 at 2:16 PM, Matteo Bertozzi wrote:
>
> > Hey Sean,
> >
> > could you provide us the full stack trace of the FileNotFoundException
> > Unable to open link
> > and also the output of: hbase
> org.apache.hadoop.hbase.snapshot.SnapshotInfo
> > -files -stats -snapshot SNAPSHOT_NAME
> > to give us a better idea of what is the state of the snapshot
> >
> > Thanks!
> >
> >
> > On Fri, Apr 26, 2013 at 9:51 PM, Sean MacDonald <sean@opendns.com(mailto:
> sean@opendns.com)> wrote:
> >
> > > Hi Jon,
> > >
> > > I've actually discovered another issue with snapshot export. If you
> have a
> > > region that has recently split and you take a snapshot of that table
> and
> > > try to export it while the children still have references to the files
> in
> > > the split parent, the files will not be transferred and will be
> counted in
> > > the missing total. You end with error messages like:
> > >
> > > java.io.FileNotFoundException: Unable to open link:
> > > org.apache.hadoop.hbase.io.HLogLink
> > >
> > > Please let me know if you would like any additional information.
> > >
> > > Thanks and have a great day,
> > >
> > > Sean
> > >
> > >
> > > On Wednesday, 24 April, 2013 at 9:19 AM, Sean MacDonald wrote:
> > >
> > > > Hi Jon,
> > > >
> > > > No problem. We do have snapshots enabled on the target cluster, and
> we
> > > are using the default hfile archiver settings on both clusters.
> > > >
> > > > Thanks,
> > > >
> > > > Sean
> > > >
> > > >
> > > > On Tuesday, 23 April, 2013 at 1:54 PM, Jonathan Hsieh wrote:
> > > >
> > > > > Sean,
> > > > >
> > > > > Thanks for finding this problem. Can you provide some more
> information
> > > so
> > > > > that we can try to duplicate and fix this problem?
> > > > >
> > > > > Are snapshots on on the target cluster?
> > > > > What are the hfile archiver settings in your hbase-site.xml on both
> > > > > clusters?
> > > > >
> > > > > Thanks,
> > > > > Jon.
> > > > >
> > > > >
> > > > > On Mon, Apr 22, 2013 at 4:47 PM, Sean MacDonald <sean@opendns.com(mailto:
> sean@opendns.com)(mailto:
> > > sean@opendns.com (mailto:sean@opendns.com))> wrote:
> > > > >
> > > > > > It looks like you can't export a snapshot to a running cluster
> or it
> > > will
> > > > > > start cleaning up files from the archive after a period of time.
> I
> > > > >
> > > >
> > >
> > >
> > > have
> > > > > > turned off HBase on the destination cluster and the export is
> > > > >
> > > >
> > >
> > >
> > > working as
> > > > > > expected now.
> > > > > >
> > > > > > Sean
> > > > > >
> > > > > >
> > > > > > On Monday, 22 April, 2013 at 9:22 AM, Sean MacDonald wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I am using HBase 0.94.6 on CDH 4.2 and trying to export a
> snapshot
> > > to
> > > > > > another cluster (also CDH 4.2), but this is failing repeatedly.
> The
> > > > >
> > > >
> > >
> > >
> > > table I
> > > > > > am trying to export is approximately 4TB in size and has 10GB
> > > > >
> > > >
> > >
> > >
> > > regions. Each
> > > > > > of the map jobs runs for about 6 minutes and appears to be
> running
> > > > > > properly, but then fails with a message like the following:
> > > > > > >
> > > > > > > 2013-04-22 16:12:50,699 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > > > DataStreamer Exception
> > > > >
> > > >
> > >
> > >
> > >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > > > No lease on
> > > > >
> > > >
> > >
> > >
> > >
> /hbase/.archive/queries/533fcbb7858ef34b103a4f8804fa8719/d/651e974dafb64eefb9c49032aec4a35b
> > > > > > File does not exist. Holder DFSClient_NONMAPREDUCE_-192704511_1
> does
> > > > >
> > > >
> > >
> > >
> > > not
> > > > > > have any open files. at
> > > > >
> > > >
> > >
> > >
> > >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > > > > > at
> > > > >
> > > >
> > >
> > >
> > >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > > > > > at
> > > > >
> > > >
> > >
> > >
> > >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
> > > > > > at
> > > > >
> > > >
> > >
> > >
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
> > > > > > at
> > > > >
> > > >
> > >
> > >
> > >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> > > > > > at
> > > > >
> > > >
> > >
> > >
> > >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtoc
> > > > > > ol
> > > > > > > $2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
> at
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> > > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at
> > > > > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at
> > > > > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at
> > > > > > java.security.AccessController.doPrivileged(Native Method) at
> > > > > > javax.security.auth.Subject.doAs(Subject.java:396) at
> > > > >
> > > >
> > >
> > >
> > >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> > > > > > >
> > > > > > > I was able to see the file that the LeaseExpiredException
> mentions
> > > on
> > > > > > the destination cluster before the exception happened (it is gone
> > > > > > afterwards).
> > > > > > >
> > > > > > > Any help that could be provided in resolving this would be
> greatly
> > > > > > appreciated.
> > > > > > >
> > > > > > > Thanks and have a great day,
> > > > > > >
> > > > > > > Sean
> > > > >
> > > > >
> > > > > --
> > > > > // Jonathan Hsieh (shay)
> > > > > // Software Engineer, Cloudera
> > > > > // jon@cloudera.com (mailto:jon@cloudera.com)
> > > >
> > >
> >
>
>
>
>

Re: Snapshot Export Problem

Posted by Sean MacDonald <se...@opendns.com>.
Hi Matteo, 

I've posted the snapshot information here:

http://pastebin.com/ZgDfH2pT

and the stack trace here:

http://pastebin.com/GBQT3zdd

Thanks,

Sean 


On Friday, 26 April, 2013 at 2:16 PM, Matteo Bertozzi wrote:

> Hey Sean,
> 
> could you provide us the full stack trace of the FileNotFoundException
> Unable to open link
> and also the output of: hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo
> -files -stats -snapshot SNAPSHOT_NAME
> to give us a better idea of what is the state of the snapshot
> 
> Thanks!
> 
> 
> On Fri, Apr 26, 2013 at 9:51 PM, Sean MacDonald <sean@opendns.com (mailto:sean@opendns.com)> wrote:
> 
> > Hi Jon,
> > 
> > I've actually discovered another issue with snapshot export. If you have a
> > region that has recently split and you take a snapshot of that table and
> > try to export it while the children still have references to the files in
> > the split parent, the files will not be transferred and will be counted in
> > the missing total. You end with error messages like:
> > 
> > java.io.FileNotFoundException: Unable to open link:
> > org.apache.hadoop.hbase.io.HLogLink
> > 
> > Please let me know if you would like any additional information.
> > 
> > Thanks and have a great day,
> > 
> > Sean
> > 
> > 
> > On Wednesday, 24 April, 2013 at 9:19 AM, Sean MacDonald wrote:
> > 
> > > Hi Jon,
> > > 
> > > No problem. We do have snapshots enabled on the target cluster, and we
> > are using the default hfile archiver settings on both clusters.
> > > 
> > > Thanks,
> > > 
> > > Sean
> > > 
> > > 
> > > On Tuesday, 23 April, 2013 at 1:54 PM, Jonathan Hsieh wrote:
> > > 
> > > > Sean,
> > > > 
> > > > Thanks for finding this problem. Can you provide some more information
> > so
> > > > that we can try to duplicate and fix this problem?
> > > > 
> > > > Are snapshots on on the target cluster?
> > > > What are the hfile archiver settings in your hbase-site.xml on both
> > > > clusters?
> > > > 
> > > > Thanks,
> > > > Jon.
> > > > 
> > > > 
> > > > On Mon, Apr 22, 2013 at 4:47 PM, Sean MacDonald <sean@opendns.com (mailto:sean@opendns.com)(mailto:
> > sean@opendns.com (mailto:sean@opendns.com))> wrote:
> > > > 
> > > > > It looks like you can't export a snapshot to a running cluster or it
> > will
> > > > > start cleaning up files from the archive after a period of time. I
> > > > 
> > > 
> > 
> > 
> > have
> > > > > turned off HBase on the destination cluster and the export is
> > > > 
> > > 
> > 
> > 
> > working as
> > > > > expected now.
> > > > > 
> > > > > Sean
> > > > > 
> > > > > 
> > > > > On Monday, 22 April, 2013 at 9:22 AM, Sean MacDonald wrote:
> > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > I am using HBase 0.94.6 on CDH 4.2 and trying to export a snapshot
> > to
> > > > > another cluster (also CDH 4.2), but this is failing repeatedly. The
> > > > 
> > > 
> > 
> > 
> > table I
> > > > > am trying to export is approximately 4TB in size and has 10GB
> > > > 
> > > 
> > 
> > 
> > regions. Each
> > > > > of the map jobs runs for about 6 minutes and appears to be running
> > > > > properly, but then fails with a message like the following:
> > > > > > 
> > > > > > 2013-04-22 16:12:50,699 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > > DataStreamer Exception
> > > > 
> > > 
> > 
> > 
> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > > No lease on
> > > > 
> > > 
> > 
> > 
> > /hbase/.archive/queries/533fcbb7858ef34b103a4f8804fa8719/d/651e974dafb64eefb9c49032aec4a35b
> > > > > File does not exist. Holder DFSClient_NONMAPREDUCE_-192704511_1 does
> > > > 
> > > 
> > 
> > 
> > not
> > > > > have any open files. at
> > > > 
> > > 
> > 
> > 
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > > > > at
> > > > 
> > > 
> > 
> > 
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > > > > at
> > > > 
> > > 
> > 
> > 
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
> > > > > at
> > > > 
> > > 
> > 
> > 
> > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
> > > > > at
> > > > 
> > > 
> > 
> > 
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> > > > > at
> > > > 
> > > 
> > 
> > 
> > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtoc
> > > > > ol
> > > > > > $2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) at
> > > > > 
> > > > 
> > > 
> > 
> > 
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at
> > > > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at
> > > > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at
> > > > > java.security.AccessController.doPrivileged(Native Method) at
> > > > > javax.security.auth.Subject.doAs(Subject.java:396) at
> > > > 
> > > 
> > 
> > 
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> > > > > > 
> > > > > > I was able to see the file that the LeaseExpiredException mentions
> > on
> > > > > the destination cluster before the exception happened (it is gone
> > > > > afterwards).
> > > > > > 
> > > > > > Any help that could be provided in resolving this would be greatly
> > > > > appreciated.
> > > > > > 
> > > > > > Thanks and have a great day,
> > > > > > 
> > > > > > Sean
> > > > 
> > > > 
> > > > --
> > > > // Jonathan Hsieh (shay)
> > > > // Software Engineer, Cloudera
> > > > // jon@cloudera.com (mailto:jon@cloudera.com)
> > > 
> > 
> 




Re: Snapshot Export Problem

Posted by Matteo Bertozzi <th...@gmail.com>.
Hey Sean,

could you provide us the full stack trace of the FileNotFoundException
Unable to open link
and also the output of: hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo
-files -stats -snapshot SNAPSHOT_NAME
to give us a better idea of what is the state of the snapshot

Thanks!


On Fri, Apr 26, 2013 at 9:51 PM, Sean MacDonald <se...@opendns.com> wrote:

> Hi Jon,
>
> I've actually discovered another issue with snapshot export. If you have a
> region that has recently split and you take a snapshot of that table and
> try to export it while the children still have references to the files in
> the split parent, the files will not be transferred and will be counted in
> the missing total. You end with error messages like:
>
> java.io.FileNotFoundException: Unable to open link:
> org.apache.hadoop.hbase.io.HLogLink
>
> Please let me know if you would like any additional information.
>
> Thanks and have a great day,
>
> Sean
>
>
> On Wednesday, 24 April, 2013 at 9:19 AM, Sean MacDonald wrote:
>
> > Hi Jon,
> >
> > No problem. We do have snapshots enabled on the target cluster, and we
> are using the default hfile archiver settings on both clusters.
> >
> > Thanks,
> >
> > Sean
> >
> >
> > On Tuesday, 23 April, 2013 at 1:54 PM, Jonathan Hsieh wrote:
> >
> > > Sean,
> > >
> > > Thanks for finding this problem. Can you provide some more information
> so
> > > that we can try to duplicate and fix this problem?
> > >
> > > Are snapshots on on the target cluster?
> > > What are the hfile archiver settings in your hbase-site.xml on both
> > > clusters?
> > >
> > > Thanks,
> > > Jon.
> > >
> > >
> > > On Mon, Apr 22, 2013 at 4:47 PM, Sean MacDonald <sean@opendns.com(mailto:
> sean@opendns.com)> wrote:
> > >
> > > > It looks like you can't export a snapshot to a running cluster or it
> will
> > > > start cleaning up files from the archive after a period of time. I
> have
> > > > turned off HBase on the destination cluster and the export is
> working as
> > > > expected now.
> > > >
> > > > Sean
> > > >
> > > >
> > > > On Monday, 22 April, 2013 at 9:22 AM, Sean MacDonald wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am using HBase 0.94.6 on CDH 4.2 and trying to export a snapshot
> to
> > > > another cluster (also CDH 4.2), but this is failing repeatedly. The
> table I
> > > > am trying to export is approximately 4TB in size and has 10GB
> regions. Each
> > > > of the map jobs runs for about 6 minutes and appears to be running
> > > > properly, but then fails with a message like the following:
> > > > >
> > > > > 2013-04-22 16:12:50,699 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > DataStreamer Exception
> > > >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > No lease on
> > > >
> /hbase/.archive/queries/533fcbb7858ef34b103a4f8804fa8719/d/651e974dafb64eefb9c49032aec4a35b
> > > > File does not exist. Holder DFSClient_NONMAPREDUCE_-192704511_1 does
> not
> > > > have any open files. at
> > > >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > > > at
> > > >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > > > at
> > > >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
> > > > at
> > > >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
> > > > at
> > > >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> > > > at
> > > >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtoc
> > > > ol
> > > > > $2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) at
> > > >
> > > >
> > > >
> > > >
> > > >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at
> > > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at
> > > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at
> > > > java.security.AccessController.doPrivileged(Native Method) at
> > > > javax.security.auth.Subject.doAs(Subject.java:396) at
> > > >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> > > > >
> > > > > I was able to see the file that the LeaseExpiredException mentions
> on
> > > > the destination cluster before the exception happened (it is gone
> > > > afterwards).
> > > > >
> > > > > Any help that could be provided in resolving this would be greatly
> > > > appreciated.
> > > > >
> > > > > Thanks and have a great day,
> > > > >
> > > > > Sean
> > >
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com (mailto:jon@cloudera.com)
> >
>
>
>
>

Re: Snapshot Export Problem

Posted by Sean MacDonald <se...@opendns.com>.
Hi Jon, 

I've actually discovered another issue with snapshot export. If you have a region that has recently split and you take a snapshot of that table and try to export it while the children still have references to the files in the split parent, the files will not be transferred and will be counted in the missing total. You end with error messages like:

java.io.FileNotFoundException: Unable to open link: org.apache.hadoop.hbase.io.HLogLink

Please let me know if you would like any additional information.

Thanks and have a great day,

Sean 


On Wednesday, 24 April, 2013 at 9:19 AM, Sean MacDonald wrote:

> Hi Jon, 
> 
> No problem. We do have snapshots enabled on the target cluster, and we are using the default hfile archiver settings on both clusters.
> 
> Thanks,
> 
> Sean 
> 
> 
> On Tuesday, 23 April, 2013 at 1:54 PM, Jonathan Hsieh wrote:
> 
> > Sean,
> > 
> > Thanks for finding this problem. Can you provide some more information so
> > that we can try to duplicate and fix this problem?
> > 
> > Are snapshots on on the target cluster?
> > What are the hfile archiver settings in your hbase-site.xml on both
> > clusters?
> > 
> > Thanks,
> > Jon.
> > 
> > 
> > On Mon, Apr 22, 2013 at 4:47 PM, Sean MacDonald <sean@opendns.com (mailto:sean@opendns.com)> wrote:
> > 
> > > It looks like you can't export a snapshot to a running cluster or it will
> > > start cleaning up files from the archive after a period of time. I have
> > > turned off HBase on the destination cluster and the export is working as
> > > expected now.
> > > 
> > > Sean
> > > 
> > > 
> > > On Monday, 22 April, 2013 at 9:22 AM, Sean MacDonald wrote:
> > > 
> > > > Hello,
> > > > 
> > > > I am using HBase 0.94.6 on CDH 4.2 and trying to export a snapshot to
> > > another cluster (also CDH 4.2), but this is failing repeatedly. The table I
> > > am trying to export is approximately 4TB in size and has 10GB regions. Each
> > > of the map jobs runs for about 6 minutes and appears to be running
> > > properly, but then fails with a message like the following:
> > > > 
> > > > 2013-04-22 16:12:50,699 WARN org.apache.hadoop.hdfs.DFSClient:
> > > DataStreamer Exception
> > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > No lease on
> > > /hbase/.archive/queries/533fcbb7858ef34b103a4f8804fa8719/d/651e974dafb64eefb9c49032aec4a35b
> > > File does not exist. Holder DFSClient_NONMAPREDUCE_-192704511_1 does not
> > > have any open files. at
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > > at
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > > at
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
> > > at
> > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
> > > at
> > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> > > at
> > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtoc
> > > ol
> > > > $2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) at
> > > 
> > > 
> > > 
> > > 
> > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at
> > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at
> > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at
> > > java.security.AccessController.doPrivileged(Native Method) at
> > > javax.security.auth.Subject.doAs(Subject.java:396) at
> > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> > > > 
> > > > I was able to see the file that the LeaseExpiredException mentions on
> > > the destination cluster before the exception happened (it is gone
> > > afterwards).
> > > > 
> > > > Any help that could be provided in resolving this would be greatly
> > > appreciated.
> > > > 
> > > > Thanks and have a great day,
> > > > 
> > > > Sean
> > 
> > 
> > -- 
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com (mailto:jon@cloudera.com)
> 




Re: Snapshot Export Problem

Posted by Sean MacDonald <se...@opendns.com>.
Hi Jon, 

No problem. We do have snapshots enabled on the target cluster, and we are using the default hfile archiver settings on both clusters.

Thanks,

Sean 


On Tuesday, 23 April, 2013 at 1:54 PM, Jonathan Hsieh wrote:

> Sean,
> 
> Thanks for finding this problem. Can you provide some more information so
> that we can try to duplicate and fix this problem?
> 
> Are snapshots on on the target cluster?
> What are the hfile archiver settings in your hbase-site.xml on both
> clusters?
> 
> Thanks,
> Jon.
> 
> 
> On Mon, Apr 22, 2013 at 4:47 PM, Sean MacDonald <sean@opendns.com (mailto:sean@opendns.com)> wrote:
> 
> > It looks like you can't export a snapshot to a running cluster or it will
> > start cleaning up files from the archive after a period of time. I have
> > turned off HBase on the destination cluster and the export is working as
> > expected now.
> > 
> > Sean
> > 
> > 
> > On Monday, 22 April, 2013 at 9:22 AM, Sean MacDonald wrote:
> > 
> > > Hello,
> > > 
> > > I am using HBase 0.94.6 on CDH 4.2 and trying to export a snapshot to
> > another cluster (also CDH 4.2), but this is failing repeatedly. The table I
> > am trying to export is approximately 4TB in size and has 10GB regions. Each
> > of the map jobs runs for about 6 minutes and appears to be running
> > properly, but then fails with a message like the following:
> > > 
> > > 2013-04-22 16:12:50,699 WARN org.apache.hadoop.hdfs.DFSClient:
> > DataStreamer Exception
> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > No lease on
> > /hbase/.archive/queries/533fcbb7858ef34b103a4f8804fa8719/d/651e974dafb64eefb9c49032aec4a35b
> > File does not exist. Holder DFSClient_NONMAPREDUCE_-192704511_1 does not
> > have any open files. at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
> > at
> > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
> > at
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> > at
> > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtoc
> > ol
> > > $2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) at
> > 
> > 
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at
> > java.security.AccessController.doPrivileged(Native Method) at
> > javax.security.auth.Subject.doAs(Subject.java:396) at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> > > 
> > > I was able to see the file that the LeaseExpiredException mentions on
> > the destination cluster before the exception happened (it is gone
> > afterwards).
> > > 
> > > Any help that could be provided in resolving this would be greatly
> > appreciated.
> > > 
> > > Thanks and have a great day,
> > > 
> > > Sean
> 
> 
> -- 
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com (mailto:jon@cloudera.com)




Re: Snapshot Export Problem

Posted by Jonathan Hsieh <jo...@cloudera.com>.
Sean,

Thanks for finding this problem.  Can you provide some more information so
that we can try to duplicate and fix this problem?

Are snapshots on on the target cluster?
What are the hfile archiver settings in your hbase-site.xml on both
clusters?

Thanks,
Jon.


On Mon, Apr 22, 2013 at 4:47 PM, Sean MacDonald <se...@opendns.com> wrote:

> It looks like you can't export a snapshot to a running cluster or it will
> start cleaning up files from the archive after a period of time. I have
> turned off HBase on the destination cluster and the export is working as
> expected now.
>
> Sean
>
>
> On Monday, 22 April, 2013 at 9:22 AM, Sean MacDonald wrote:
>
> > Hello,
> >
> > I am using HBase 0.94.6 on CDH 4.2 and trying to export a snapshot to
> another cluster (also CDH 4.2), but this is failing repeatedly. The table I
> am trying to export is approximately 4TB in size and has 10GB regions. Each
> of the map jobs runs for about 6 minutes and appears to be running
> properly, but then fails with a message like the following:
> >
> > 2013-04-22 16:12:50,699 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /hbase/.archive/queries/533fcbb7858ef34b103a4f8804fa8719/d/651e974dafb64eefb9c49032aec4a35b
> File does not exist. Holder DFSClient_NONMAPREDUCE_-192704511_1 does not
> have any open files. at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtoc
> ol
> > $2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> >
> > I was able to see the file that the LeaseExpiredException mentions on
> the destination cluster before the exception happened (it is gone
> afterwards).
> >
> > Any help that could be provided in resolving this would be greatly
> appreciated.
> >
> > Thanks and have a great day,
> >
> > Sean
>
>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: help why do my regionservers shut themselves down?

Posted by Kevin O'dell <ke...@cloudera.com>.
Hi Kaveh,

  How large is your heap that you are using?  Also, what GC settings do you
have in place?  Your main issues looks to be here:

2013-04-22 16:47:21,843 FATAL
org.apache.hadoop.hbase.**regionserver.HRegionServer:
ABORTING region server
serverName=d1r1n17.prod.**plutoz.com<http://d1r1n17.prod.plutoz.com/>
,60020,**1366657358443, load=(requests=5
392, regions=196, usedHeap=1063, maxHeap=3966): regionserver:60020-**
0x13dd980d2ab8661-**0x13dd980d2ab8661 regionserver:60020-**0x13dd980d2
ab8661-**0x13dd980d2ab8661 received expired fr
om ZooKeeper, aborting
org.apache.zookeeper.**KeeperException$**SessionExpiredException:
KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.**zookeeper.ZooKeeperWatcher.**connectio
nEvent(**ZooKeeperWatcher.java:352)
        at org.apache.hadoop.hbase.**zookeeper.ZooKeeperWatcher.**process(
ZooKeeperWatcher.java:**270)
        at org.apache.zookeeper.**ClientCnxn$EventThread.**processEvent(
ClientCnxn.java:**523)
        at org.apache.zookeeper.**ClientCnxn$EventThread.run(**ClientCnxn.
java:499)

The interesting part comes afterwards:

2013-04-22 16:47:21,900 WARN org.apache.hadoop.hbase.**regionserver.wal.HLog:
Too many consecutive RollWriter requests, it's a sign of the total number
of live datanodes is lower than the tolerable replicas.

  Are you also seeing your Datanodes drop off the network or become dead
nodes?  My thought is this could be networking issues, oversubscribed
nodes, or GC issues.



On Tue, Apr 23, 2013 at 12:47 AM, kaveh minooie <ka...@plutoz.com> wrote:

> thanks everyone for responding.
>
> No I don't have the GC logs. I don't even know how i can get that. but it
> seems that the regionserver did recovere from that and then gets into
> trouble here:
>
>
> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.**regionserver.HRegion:
> compaction interrupted by user:
> java.io.**InterruptedIOException: Aborting compaction of store f in
> region t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**
> 9f565d5da3468c0725e590dc232abc**23. because user requested stop.
>
> the part that I don't understand is what it means when it say "compaction
> interrupted by user"!
>
> and to answer your question ted, I am using 0.90.6 over hadoop 1.1.1 ( i
> can't upgrade since gora so far only works with .90.x ) and no everything
> was normal as far as I could say the map jobs were staggering since, i
> assume, the hbase became unresponsive  ( the web interface start showing
> exception and that is how i figured out that that regionserver was down) ,
> while i was restarting this one ( through the status command in shell ) i
> noticed that two more regionserver went down ( with identicall error , the
> second one, not the one about GC pause ) but once I restarted the
> regionservers (using hbase-daemon.sh)  everything went back to normal.  but
> this keeps happening and as a result i can't left my jobs unsupervised.
>
> thanks,
>
>
> On 04/22/2013 07:35 PM, Ted Yu wrote:
>
>> Kaveh:
>> What version of HBase are you using ?
>> Around 2013-04-22 16:47:56, did you observe anything else happening in
>> your
>> cluster ? See below:
>>
>> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.****
>> regionserver.HRegion:
>> compaction interrupted by user:
>> java.io.****InterruptedIOException: Aborting compaction of store f in
>> region
>> t1_webpage,com.pandora.www:****http/shaggy,1366670139658.****9f565d5
>> da3468c0725e590dc232abc**23. because user requested stop.
>>          at org.apache.hadoop.hbase.****regionserver.Store.compact(****
>> Store.
>> java:998)
>>          at org.apache.hadoop.hbase.****regionserver.Store.compact(****
>> Store.
>> java:779)
>>          at org.apache.hadoop.hbase.****regionserver.HRegion.****
>> compactStores(
>> HRegion.java:**776)
>>
>> On Mon, Apr 22, 2013 at 6:46 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org> wrote:
>>
>>  Hi Kaveh,
>>>
>>> the respons is maybe already displayed on the logs you sent ;)
>>>
>>> "This disconnect could have been caused by a network partition or a
>>> long-running GC pause, either way it's recommended that you verify
>>> your environment."
>>>
>>> Do you have GC logs? Have you tried anything to solve that?
>>>
>>> JM
>>>
>>> 2013/4/22 kaveh minooie <ka...@plutoz.com>:
>>>
>>>> Hi
>>>>
>>>> after a few mapreduce jobs my regionservers shut themselves down. this
>>>> is
>>>> the latest time that this has happened:
>>>>
>>>> 2013-04-22 16:47:21,843 INFO
>>>>
>>>>  org.apache.hadoop.hbase.**client.HConnectionManager$**
>>> HConnectionImplementation:
>>>
>>>> This client just lost it's session with ZooKeeper, trying to reconnect.
>>>> 2013-04-22 16:47:21,843 FATAL
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: ABORTING region
>>>>
>>> server
>>>
>>>> serverName=d1r1n17.prod.**plutoz.com <http://d1r1n17.prod.plutoz.com>
>>>> ,60020,**1366657358443, load=(requests=5
>>>> 392, regions=196, usedHeap=1063, maxHeap=3966):
>>>> regionserver:60020-**0x13dd980d2ab8661-**0x13dd980d2ab8661
>>>> regionserver:60020-**0x13dd980d2ab8661-**0x13dd980d2ab8661 received
>>>> expired
>>>>
>>> fr
>>>
>>>> om ZooKeeper, aborting
>>>> org.apache.zookeeper.**KeeperException$**SessionExpiredException:
>>>> KeeperErrorCode = Session expired
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**zookeeper.ZooKeeperWatcher.**
>>> connectionEvent(**ZooKeeperWatcher.java:352)
>>>
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**zookeeper.ZooKeeperWatcher.**
>>> process(ZooKeeperWatcher.java:**270)
>>>
>>>>          at
>>>>
>>>>  org.apache.zookeeper.**ClientCnxn$EventThread.**
>>> processEvent(ClientCnxn.java:**523)
>>>
>>>>          at
>>>> org.apache.zookeeper.**ClientCnxn$EventThread.run(**
>>>> ClientCnxn.java:499)
>>>> 2013-04-22 16:47:21,843 INFO
>>>>
>>>>  org.apache.hadoop.hbase.**client.HConnectionManager$**
>>> HConnectionImplementation:
>>>
>>>> Trying to reconnect to zookeeper.
>>>> 2013-04-22 16:47:21,844 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: Dump of metrics:
>>>> requests=1794, regions=196, stores=1561, storefiles=1585,
>>>> storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10,
>>>> flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032,
>>>> blockCacheFree=169901776, blockCacheCount=7242,
>>>>
>>> blockCacheHitCount=910925,
>>>
>>>> blockCacheMissCount=1558134, blockCacheEvictedCount=**1344753,
>>>> blockCacheHitRatio=36, blockCacheHitCachingRatio=40
>>>> 2013-04-22 16:47:21,844 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: STOPPED:
>>>> regionserver:60020-**0x13dd980d2ab8661-**0x13dd980d2ab8661
>>>> regionserver:60020-**0x13dd980d2ab8661-**0x13dd980d2ab8661 received
>>>> expired
>>>>
>>> from
>>>
>>>> ZooKeeper, aborting
>>>> 2013-04-22 16:47:21,844 INFO org.apache.zookeeper.**ClientCnxn:
>>>> EventThread
>>>> shut down
>>>> 2013-04-22 16:47:21,900 WARN
>>>>
>>> org.apache.hadoop.hbase.**regionserver.wal.HLog:
>>>
>>>> Too many consecutive RollWriter requests, it's a sign of the total
>>>>
>>> number of
>>>
>>>> live datanodes is lower than the tolerable replicas.
>>>> 2013-04-22 16:47:22,341 INFO org.apache.zookeeper.**ZooKeeper:
>>>> Initiating
>>>> client connection, connectString=zk1:2181 sessionTimeout=180000
>>>> watcher=hconnection
>>>> 2013-04-22 16:47:22,357 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: Waiting on 1
>>>> regions
>>>>
>>> to
>>>
>>>> close
>>>> 2013-04-22 16:47:22,394 INFO org.apache.zookeeper.**ClientCnxn: Opening
>>>>
>>> socket
>>>
>>>> connection to server d1r2n2.prod.plutoz.com/10.0.0.**66:2181<http://d1r2n2.prod.plutoz.com/10.0.0.66:2181>.
>>>> Will not
>>>>
>>> attempt
>>>
>>>> to authenticate using SASL (unknown error)
>>>> 2013-04-22 16:47:22,395 INFO org.apache.zookeeper.**ClientCnxn: Socket
>>>> connection established to d1r2n2.prod.plutoz.com/10.0.0.**66:2181<http://d1r2n2.prod.plutoz.com/10.0.0.66:2181>
>>>> ,
>>>>
>>> initiating
>>>
>>>> session
>>>> 2013-04-22 16:47:22,397 INFO org.apache.zookeeper.**ClientCnxn: Session
>>>> establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.**
>>>> 66:2181 <http://d1r2n2.prod.plutoz.com/10.0.0.66:2181>,
>>>> sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000
>>>> 2013-04-22 16:47:22,400 INFO
>>>>
>>>>  org.apache.hadoop.hbase.**client.HConnectionManager$**
>>> HConnectionImplementation:
>>>
>>>> Reconnected successfully. This disconnect could have been caused by a
>>>> network partition or a long-running GC pause, either way it's
>>>> recommended
>>>> that you verify your environment.
>>>> 2013-04-22 16:47:22,400 INFO org.apache.zookeeper.**ClientCnxn:
>>>> EventThread
>>>> shut down
>>>> 2013-04-22 16:47:56,830 INFO
>>>>
>>> org.apache.hadoop.hbase.**regionserver.HRegion:
>>>
>>>> compaction interrupted by user:
>>>> java.io.**InterruptedIOException: Aborting compaction of store f in
>>>> region
>>>>
>>>>  t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**
>>> 9f565d5da3468c0725e590dc232abc**23.
>>>
>>>> because user requested stop.
>>>>          at
>>>> org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.java:998)
>>>>          at
>>>> org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.java:779)
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**regionserver.HRegion.**
>>> compactStores(HRegion.java:**776)
>>>
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**regionserver.HRegion.**
>>> compactStores(HRegion.java:**721)
>>>
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**regionserver.**CompactSplitThread.run(**
>>> CompactSplitThread.java:81)
>>>
>>>> 2013-04-22 16:47:56,830 INFO
>>>>
>>> org.apache.hadoop.hbase.**regionserver.HRegion:
>>>
>>>> aborted compaction on region
>>>>
>>>>  t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**
>>> 9f565d5da3468c0725e590dc232abc**23.
>>>
>>>> after 5mins, 58sec
>>>> 2013-04-22 16:47:56,830 INFO
>>>> org.apache.hadoop.hbase.**regionserver.**CompactSplitThread:
>>>> regionserver60020.compactor exiting
>>>> 2013-04-22 16:47:56,832 INFO
>>>>
>>> org.apache.hadoop.hbase.**regionserver.HRegion:
>>>
>>>> Closed
>>>>
>>>>  t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**
>>> 9f565d5da3468c0725e590dc232abc**23.
>>>
>>>> 2013-04-22 16:47:57,363 INFO
>>>>
>>> org.apache.hadoop.hbase.**regionserver.wal.HLog:
>>>
>>>> regionserver60020.logSyncer exiting
>>>> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.**
>>>> regionserver.Leases:
>>>> regionserver60020 closing leases
>>>> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.**
>>>> regionserver.Leases:
>>>> regionserver60020 closed leases
>>>> 2013-04-22 16:47:57,366 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: regionserver60020
>>>> exiting
>>>> 2013-04-22 16:47:57,497 INFO
>>>> org.apache.hadoop.hbase.**regionserver.ShutdownHook: Shutdown hook
>>>>
>>> starting;
>>>
>>>> hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-**15,5,main]
>>>> 2013-04-22 16:47:57,497 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: STOPPED: Shutdown
>>>>
>>> hook
>>>
>>>> 2013-04-22 16:47:57,497 INFO
>>>> org.apache.hadoop.hbase.**regionserver.ShutdownHook: Starting fs
>>>> shutdown
>>>>
>>> hook
>>>
>>>> thread.
>>>> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.**
>>>> regionserver.Leases:
>>>> regionserver60020.leaseChecker closing leases
>>>> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.**
>>>> regionserver.Leases:
>>>> regionserver60020.leaseChecker closed leases
>>>> 2013-04-22 16:47:57,598 INFO
>>>> org.apache.hadoop.hbase.**regionserver.ShutdownHook: Shutdown hook
>>>>
>>> finished.
>>>
>>>> I would appreciate it very much if someone could explain to me what just
>>>> happened here.
>>>>
>>>> thanks,
>>>>
>>>
>


-- 
Kevin O'Dell
Systems Engineer, Cloudera

Re: help why do my regionservers shut themselves down?

Posted by kaveh minooie <ka...@plutoz.com>.
thanks everyone for responding.

No I don't have the GC logs. I don't even know how i can get that. but 
it seems that the regionserver did recovere from that and then gets into 
trouble here:

2013-04-22 16:47:56,830 INFO 
org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted by 
user:
java.io.InterruptedIOException: Aborting compaction of store f in region 
t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. 
because user requested stop.

the part that I don't understand is what it means when it say 
"compaction interrupted by user"!

and to answer your question ted, I am using 0.90.6 over hadoop 1.1.1 ( i 
can't upgrade since gora so far only works with .90.x ) and no 
everything was normal as far as I could say the map jobs were staggering 
since, i assume, the hbase became unresponsive  ( the web interface 
start showing exception and that is how i figured out that that 
regionserver was down) , while i was restarting this one ( through the 
status command in shell ) i noticed that two more regionserver went down 
( with identicall error , the second one, not the one about GC pause ) 
but once I restarted the regionservers (using hbase-daemon.sh)  
everything went back to normal.  but this keeps happening and as a 
result i can't left my jobs unsupervised.

thanks,

On 04/22/2013 07:35 PM, Ted Yu wrote:
> Kaveh:
> What version of HBase are you using ?
> Around 2013-04-22 16:47:56, did you observe anything else happening in your
> cluster ? See below:
>
> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.**regionserver.HRegion:
> compaction interrupted by user:
> java.io.**InterruptedIOException: Aborting compaction of store f in region
> t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**9f565d5
> da3468c0725e590dc232abc**23. because user requested stop.
>          at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.
> java:998)
>          at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.
> java:779)
>          at org.apache.hadoop.hbase.**regionserver.HRegion.**compactStores(
> HRegion.java:**776)
>
> On Mon, Apr 22, 2013 at 6:46 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Kaveh,
>>
>> the respons is maybe already displayed on the logs you sent ;)
>>
>> "This disconnect could have been caused by a network partition or a
>> long-running GC pause, either way it's recommended that you verify
>> your environment."
>>
>> Do you have GC logs? Have you tried anything to solve that?
>>
>> JM
>>
>> 2013/4/22 kaveh minooie <ka...@plutoz.com>:
>>> Hi
>>>
>>> after a few mapreduce jobs my regionservers shut themselves down. this is
>>> the latest time that this has happened:
>>>
>>> 2013-04-22 16:47:21,843 INFO
>>>
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
>>> This client just lost it's session with ZooKeeper, trying to reconnect.
>>> 2013-04-22 16:47:21,843 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> server
>>> serverName=d1r1n17.prod.plutoz.com,60020,1366657358443, load=(requests=5
>>> 392, regions=196, usedHeap=1063, maxHeap=3966):
>>> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
>>> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired
>> fr
>>> om ZooKeeper, aborting
>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> KeeperErrorCode = Session expired
>>>          at
>>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
>>>          at
>>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
>>>          at
>>>
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:523)
>>>          at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
>>> 2013-04-22 16:47:21,843 INFO
>>>
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
>>> Trying to reconnect to zookeeper.
>>> 2013-04-22 16:47:21,844 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>>> requests=1794, regions=196, stores=1561, storefiles=1585,
>>> storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10,
>>> flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032,
>>> blockCacheFree=169901776, blockCacheCount=7242,
>> blockCacheHitCount=910925,
>>> blockCacheMissCount=1558134, blockCacheEvictedCount=1344753,
>>> blockCacheHitRatio=36, blockCacheHitCachingRatio=40
>>> 2013-04-22 16:47:21,844 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
>>> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
>>> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired
>> from
>>> ZooKeeper, aborting
>>> 2013-04-22 16:47:21,844 INFO org.apache.zookeeper.ClientCnxn: EventThread
>>> shut down
>>> 2013-04-22 16:47:21,900 WARN
>> org.apache.hadoop.hbase.regionserver.wal.HLog:
>>> Too many consecutive RollWriter requests, it's a sign of the total
>> number of
>>> live datanodes is lower than the tolerable replicas.
>>> 2013-04-22 16:47:22,341 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection, connectString=zk1:2181 sessionTimeout=180000
>>> watcher=hconnection
>>> 2013-04-22 16:47:22,357 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions
>> to
>>> close
>>> 2013-04-22 16:47:22,394 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket
>>> connection to server d1r2n2.prod.plutoz.com/10.0.0.66:2181. Will not
>> attempt
>>> to authenticate using SASL (unknown error)
>>> 2013-04-22 16:47:22,395 INFO org.apache.zookeeper.ClientCnxn: Socket
>>> connection established to d1r2n2.prod.plutoz.com/10.0.0.66:2181,
>> initiating
>>> session
>>> 2013-04-22 16:47:22,397 INFO org.apache.zookeeper.ClientCnxn: Session
>>> establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.66:2181,
>>> sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000
>>> 2013-04-22 16:47:22,400 INFO
>>>
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
>>> Reconnected successfully. This disconnect could have been caused by a
>>> network partition or a long-running GC pause, either way it's recommended
>>> that you verify your environment.
>>> 2013-04-22 16:47:22,400 INFO org.apache.zookeeper.ClientCnxn: EventThread
>>> shut down
>>> 2013-04-22 16:47:56,830 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> compaction interrupted by user:
>>> java.io.InterruptedIOException: Aborting compaction of store f in region
>>>
>> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
>>> because user requested stop.
>>>          at
>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
>>>          at
>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
>>>          at
>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
>>>          at
>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
>>>          at
>>>
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>>> 2013-04-22 16:47:56,830 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> aborted compaction on region
>>>
>> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
>>> after 5mins, 58sec
>>> 2013-04-22 16:47:56,830 INFO
>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>> regionserver60020.compactor exiting
>>> 2013-04-22 16:47:56,832 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Closed
>>>
>> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
>>> 2013-04-22 16:47:57,363 INFO
>> org.apache.hadoop.hbase.regionserver.wal.HLog:
>>> regionserver60020.logSyncer exiting
>>> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
>>> regionserver60020 closing leases
>>> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
>>> regionserver60020 closed leases
>>> 2013-04-22 16:47:57,366 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
>>> exiting
>>> 2013-04-22 16:47:57,497 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
>> starting;
>>> hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-15,5,main]
>>> 2013-04-22 16:47:57,497 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
>> hook
>>> 2013-04-22 16:47:57,497 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown
>> hook
>>> thread.
>>> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
>>> regionserver60020.leaseChecker closing leases
>>> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
>>> regionserver60020.leaseChecker closed leases
>>> 2013-04-22 16:47:57,598 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
>> finished.
>>> I would appreciate it very much if someone could explain to me what just
>>> happened here.
>>>
>>> thanks,


Re: help why do my regionservers shut themselves down?

Posted by Ted Yu <yu...@gmail.com>.
Kaveh:
What version of HBase are you using ?
Around 2013-04-22 16:47:56, did you observe anything else happening in your
cluster ? See below:

2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.**regionserver.HRegion:
compaction interrupted by user:
java.io.**InterruptedIOException: Aborting compaction of store f in region
t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**9f565d5
da3468c0725e590dc232abc**23. because user requested stop.
        at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.
java:998)
        at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.
java:779)
        at org.apache.hadoop.hbase.**regionserver.HRegion.**compactStores(
HRegion.java:**776)

On Mon, Apr 22, 2013 at 6:46 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Kaveh,
>
> the respons is maybe already displayed on the logs you sent ;)
>
> "This disconnect could have been caused by a network partition or a
> long-running GC pause, either way it's recommended that you verify
> your environment."
>
> Do you have GC logs? Have you tried anything to solve that?
>
> JM
>
> 2013/4/22 kaveh minooie <ka...@plutoz.com>:
> >
> > Hi
> >
> > after a few mapreduce jobs my regionservers shut themselves down. this is
> > the latest time that this has happened:
> >
> > 2013-04-22 16:47:21,843 INFO
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > This client just lost it's session with ZooKeeper, trying to reconnect.
> > 2013-04-22 16:47:21,843 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server
> > serverName=d1r1n17.prod.plutoz.com,60020,1366657358443, load=(requests=5
> > 392, regions=196, usedHeap=1063, maxHeap=3966):
> > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
> > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired
> fr
> > om ZooKeeper, aborting
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired
> >         at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
> >         at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
> >         at
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:523)
> >         at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
> > 2013-04-22 16:47:21,843 INFO
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > Trying to reconnect to zookeeper.
> > 2013-04-22 16:47:21,844 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> > requests=1794, regions=196, stores=1561, storefiles=1585,
> > storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10,
> > flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032,
> > blockCacheFree=169901776, blockCacheCount=7242,
> blockCacheHitCount=910925,
> > blockCacheMissCount=1558134, blockCacheEvictedCount=1344753,
> > blockCacheHitRatio=36, blockCacheHitCachingRatio=40
> > 2013-04-22 16:47:21,844 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
> > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
> > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired
> from
> > ZooKeeper, aborting
> > 2013-04-22 16:47:21,844 INFO org.apache.zookeeper.ClientCnxn: EventThread
> > shut down
> > 2013-04-22 16:47:21,900 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> > Too many consecutive RollWriter requests, it's a sign of the total
> number of
> > live datanodes is lower than the tolerable replicas.
> > 2013-04-22 16:47:22,341 INFO org.apache.zookeeper.ZooKeeper: Initiating
> > client connection, connectString=zk1:2181 sessionTimeout=180000
> > watcher=hconnection
> > 2013-04-22 16:47:22,357 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions
> to
> > close
> > 2013-04-22 16:47:22,394 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> > connection to server d1r2n2.prod.plutoz.com/10.0.0.66:2181. Will not
> attempt
> > to authenticate using SASL (unknown error)
> > 2013-04-22 16:47:22,395 INFO org.apache.zookeeper.ClientCnxn: Socket
> > connection established to d1r2n2.prod.plutoz.com/10.0.0.66:2181,
> initiating
> > session
> > 2013-04-22 16:47:22,397 INFO org.apache.zookeeper.ClientCnxn: Session
> > establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.66:2181,
> > sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000
> > 2013-04-22 16:47:22,400 INFO
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > Reconnected successfully. This disconnect could have been caused by a
> > network partition or a long-running GC pause, either way it's recommended
> > that you verify your environment.
> > 2013-04-22 16:47:22,400 INFO org.apache.zookeeper.ClientCnxn: EventThread
> > shut down
> > 2013-04-22 16:47:56,830 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > compaction interrupted by user:
> > java.io.InterruptedIOException: Aborting compaction of store f in region
> >
> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
> > because user requested stop.
> >         at
> > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
> >         at
> > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> > 2013-04-22 16:47:56,830 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > aborted compaction on region
> >
> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
> > after 5mins, 58sec
> > 2013-04-22 16:47:56,830 INFO
> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> > regionserver60020.compactor exiting
> > 2013-04-22 16:47:56,832 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Closed
> >
> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
> > 2013-04-22 16:47:57,363 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> > regionserver60020.logSyncer exiting
> > 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
> > regionserver60020 closing leases
> > 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
> > regionserver60020 closed leases
> > 2013-04-22 16:47:57,366 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
> > exiting
> > 2013-04-22 16:47:57,497 INFO
> > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting;
> > hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-15,5,main]
> > 2013-04-22 16:47:57,497 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> > 2013-04-22 16:47:57,497 INFO
> > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown
> hook
> > thread.
> > 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
> > regionserver60020.leaseChecker closing leases
> > 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
> > regionserver60020.leaseChecker closed leases
> > 2013-04-22 16:47:57,598 INFO
> > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> finished.
> >
> > I would appreciate it very much if someone could explain to me what just
> > happened here.
> >
> > thanks,
>

Re: help why do my regionservers shut themselves down?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Kaveh,

the respons is maybe already displayed on the logs you sent ;)

"This disconnect could have been caused by a network partition or a
long-running GC pause, either way it's recommended that you verify
your environment."

Do you have GC logs? Have you tried anything to solve that?

JM

2013/4/22 kaveh minooie <ka...@plutoz.com>:
>
> Hi
>
> after a few mapreduce jobs my regionservers shut themselves down. this is
> the latest time that this has happened:
>
> 2013-04-22 16:47:21,843 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> This client just lost it's session with ZooKeeper, trying to reconnect.
> 2013-04-22 16:47:21,843 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=d1r1n17.prod.plutoz.com,60020,1366657358443, load=(requests=5
> 392, regions=196, usedHeap=1063, maxHeap=3966):
> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired fr
> om ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
>         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:523)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
> 2013-04-22 16:47:21,843 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Trying to reconnect to zookeeper.
> 2013-04-22 16:47:21,844 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> requests=1794, regions=196, stores=1561, storefiles=1585,
> storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10,
> flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032,
> blockCacheFree=169901776, blockCacheCount=7242, blockCacheHitCount=910925,
> blockCacheMissCount=1558134, blockCacheEvictedCount=1344753,
> blockCacheHitRatio=36, blockCacheHitCachingRatio=40
> 2013-04-22 16:47:21,844 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired from
> ZooKeeper, aborting
> 2013-04-22 16:47:21,844 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
> 2013-04-22 16:47:21,900 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
> Too many consecutive RollWriter requests, it's a sign of the total number of
> live datanodes is lower than the tolerable replicas.
> 2013-04-22 16:47:22,341 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection, connectString=zk1:2181 sessionTimeout=180000
> watcher=hconnection
> 2013-04-22 16:47:22,357 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions to
> close
> 2013-04-22 16:47:22,394 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server d1r2n2.prod.plutoz.com/10.0.0.66:2181. Will not attempt
> to authenticate using SASL (unknown error)
> 2013-04-22 16:47:22,395 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to d1r2n2.prod.plutoz.com/10.0.0.66:2181, initiating
> session
> 2013-04-22 16:47:22,397 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.66:2181,
> sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000
> 2013-04-22 16:47:22,400 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Reconnected successfully. This disconnect could have been caused by a
> network partition or a long-running GC pause, either way it's recommended
> that you verify your environment.
> 2013-04-22 16:47:22,400 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> compaction interrupted by user:
> java.io.InterruptedIOException: Aborting compaction of store f in region
> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
> because user requested stop.
>         at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
>         at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
>         at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> aborted compaction on region
> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
> after 5mins, 58sec
> 2013-04-22 16:47:56,830 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> regionserver60020.compactor exiting
> 2013-04-22 16:47:56,832 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Closed
> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
> 2013-04-22 16:47:57,363 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer exiting
> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
> regionserver60020 closing leases
> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
> regionserver60020 closed leases
> 2013-04-22 16:47:57,366 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
> exiting
> 2013-04-22 16:47:57,497 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting;
> hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-15,5,main]
> 2013-04-22 16:47:57,497 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook
> 2013-04-22 16:47:57,497 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook
> thread.
> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
> regionserver60020.leaseChecker closing leases
> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
> regionserver60020.leaseChecker closed leases
> 2013-04-22 16:47:57,598 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.
>
> I would appreciate it very much if someone could explain to me what just
> happened here.
>
> thanks,

Re: help why do my regionservers shut themselves down?

Posted by Leonid Fedotov <lf...@hortonworks.com>.
This could be a reason as well:
2013-04-22 16:47:21,900 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: Too many consecutive RollWriter requests, it's a sign of the total number of live datanodes is lower than the tolerable replicas.
Make sure your cluster is in good health conditions...


Thank you!

Sincerely,
Leonid Fedotov
On Apr 22, 2013, at 6:25 PM, kaveh minooie wrote:

> 
> Hi
> 
> after a few mapreduce jobs my regionservers shut themselves down. this is the latest time that this has happened:
> 
> 2013-04-22 16:47:21,843 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, trying to reconnect.
> 2013-04-22 16:47:21,843 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=d1r1n17.prod.plutoz.com,60020,1366657358443, load=(requests=5
> 392, regions=196, usedHeap=1063, maxHeap=3966): regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired fr
> om ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
>        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:523)
>        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
> 2013-04-22 16:47:21,843 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Trying to reconnect to zookeeper.
> 2013-04-22 16:47:21,844 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: requests=1794, regions=196, stores=1561, storefiles=1585, storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10, flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032, blockCacheFree=169901776, blockCacheCount=7242, blockCacheHitCount=910925, blockCacheMissCount=1558134, blockCacheEvictedCount=1344753, blockCacheHitRatio=36, blockCacheHitCachingRatio=40
> 2013-04-22 16:47:21,844 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired from ZooKeeper, aborting
> 2013-04-22 16:47:21,844 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2013-04-22 16:47:21,900 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: Too many consecutive RollWriter requests, it's a sign of the total number of live datanodes is lower than the tolerable replicas.
> 2013-04-22 16:47:22,341 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=zk1:2181 sessionTimeout=180000 watcher=hconnection
> 2013-04-22 16:47:22,357 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions to close
> 2013-04-22 16:47:22,394 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server d1r2n2.prod.plutoz.com/10.0.0.66:2181. Will not attempt to authenticate using SASL (unknown error)
> 2013-04-22 16:47:22,395 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to d1r2n2.prod.plutoz.com/10.0.0.66:2181, initiating session
> 2013-04-22 16:47:22,397 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.66:2181, sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000
> 2013-04-22 16:47:22,400 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Reconnected successfully. This disconnect could have been caused by a network partition or a long-running GC pause, either way it's recommended that you verify your environment.
> 2013-04-22 16:47:22,400 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted by user:
> java.io.InterruptedIOException: Aborting compaction of store f in region t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. because user requested stop.
>        at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
>        at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
>        at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
>        at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
>        at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.regionserver.HRegion: aborted compaction on region t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. after 5mins, 58sec
> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver60020.compactor exiting
> 2013-04-22 16:47:56,832 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
> 2013-04-22 16:47:57,363 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: regionserver60020.logSyncer exiting
> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing leases
> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed leases
> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting
> 2013-04-22 16:47:57,497 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-15,5,main]
> 2013-04-22 16:47:57,497 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook
> 2013-04-22 16:47:57,497 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook thread.
> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases: regionserver60020.leaseChecker closing leases
> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases: regionserver60020.leaseChecker closed leases
> 2013-04-22 16:47:57,598 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.
> 
> I would appreciate it very much if someone could explain to me what just happened here.
> 
> thanks,


help why do my regionservers shut themselves down?

Posted by kaveh minooie <ka...@plutoz.com>.
Hi

after a few mapreduce jobs my regionservers shut themselves down. this 
is the latest time that this has happened:

2013-04-22 16:47:21,843 INFO 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
This client just lost it's session with ZooKeeper, trying to reconnect.
2013-04-22 16:47:21,843 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region 
server serverName=d1r1n17.prod.plutoz.com,60020,1366657358443, 
load=(requests=5
392, regions=196, usedHeap=1063, maxHeap=3966): 
regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 
regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired fr
om ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:523)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
2013-04-22 16:47:21,843 INFO 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Trying to reconnect to zookeeper.
2013-04-22 16:47:21,844 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: 
requests=1794, regions=196, stores=1561, storefiles=1585, 
storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10, 
flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032, 
blockCacheFree=169901776, blockCacheCount=7242, 
blockCacheHitCount=910925, blockCacheMissCount=1558134, 
blockCacheEvictedCount=1344753, blockCacheHitRatio=36, 
blockCacheHitCachingRatio=40
2013-04-22 16:47:21,844 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: 
regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 
regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired 
from ZooKeeper, aborting
2013-04-22 16:47:21,844 INFO org.apache.zookeeper.ClientCnxn: 
EventThread shut down
2013-04-22 16:47:21,900 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLog: Too many consecutive 
RollWriter requests, it's a sign of the total number of live datanodes 
is lower than the tolerable replicas.
2013-04-22 16:47:22,341 INFO org.apache.zookeeper.ZooKeeper: Initiating 
client connection, connectString=zk1:2181 sessionTimeout=180000 
watcher=hconnection
2013-04-22 16:47:22,357 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions 
to close
2013-04-22 16:47:22,394 INFO org.apache.zookeeper.ClientCnxn: Opening 
socket connection to server d1r2n2.prod.plutoz.com/10.0.0.66:2181. Will 
not attempt to authenticate using SASL (unknown error)
2013-04-22 16:47:22,395 INFO org.apache.zookeeper.ClientCnxn: Socket 
connection established to d1r2n2.prod.plutoz.com/10.0.0.66:2181, 
initiating session
2013-04-22 16:47:22,397 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.66:2181, 
sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000
2013-04-22 16:47:22,400 INFO 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Reconnected successfully. This disconnect could have been caused by a 
network partition or a long-running GC pause, either way it's 
recommended that you verify your environment.
2013-04-22 16:47:22,400 INFO org.apache.zookeeper.ClientCnxn: 
EventThread shut down
2013-04-22 16:47:56,830 INFO 
org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted by 
user:
java.io.InterruptedIOException: Aborting compaction of store f in region 
t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. 
because user requested stop.
         at 
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
         at 
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
         at 
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
         at 
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
         at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
2013-04-22 16:47:56,830 INFO 
org.apache.hadoop.hbase.regionserver.HRegion: aborted compaction on 
region 
t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. 
after 5mins, 58sec
2013-04-22 16:47:56,830 INFO 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: 
regionserver60020.compactor exiting
2013-04-22 16:47:56,832 INFO 
org.apache.hadoop.hbase.regionserver.HRegion: Closed 
t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
2013-04-22 16:47:57,363 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLog: 
regionserver60020.logSyncer exiting
2013-04-22 16:47:57,366 INFO 
org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing 
leases
2013-04-22 16:47:57,366 INFO 
org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed leases
2013-04-22 16:47:57,366 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 
exiting
2013-04-22 16:47:57,497 INFO 
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook 
starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-15,5,main]
2013-04-22 16:47:57,497 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook
2013-04-22 16:47:57,497 INFO 
org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown 
hook thread.
2013-04-22 16:47:57,504 INFO 
org.apache.hadoop.hbase.regionserver.Leases: 
regionserver60020.leaseChecker closing leases
2013-04-22 16:47:57,504 INFO 
org.apache.hadoop.hbase.regionserver.Leases: 
regionserver60020.leaseChecker closed leases
2013-04-22 16:47:57,598 INFO 
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.

I would appreciate it very much if someone could explain to me what just 
happened here.

thanks,

Re: Snapshot Export Problem

Posted by Sean MacDonald <se...@opendns.com>.
It looks like you can't export a snapshot to a running cluster or it will start cleaning up files from the archive after a period of time. I have turned off HBase on the destination cluster and the export is working as expected now.

Sean  


On Monday, 22 April, 2013 at 9:22 AM, Sean MacDonald wrote:

> Hello, 
> 
> I am using HBase 0.94.6 on CDH 4.2 and trying to export a snapshot to another cluster (also CDH 4.2), but this is failing repeatedly. The table I am trying to export is approximately 4TB in size and has 10GB regions. Each of the map jobs runs for about 6 minutes and appears to be running properly, but then fails with a message like the following:
> 
> 2013-04-22 16:12:50,699 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /hbase/.archive/queries/533fcbb7858ef34b103a4f8804fa8719/d/651e974dafb64eefb9c49032aec4a35b File does not exist. Holder DFSClient_NONMAPREDUCE_-192704511_1 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtoc
ol
> $2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> 
> I was able to see the file that the LeaseExpiredException mentions on the destination cluster before the exception happened (it is gone afterwards).
> 
> Any help that could be provided in resolving this would be greatly appreciated.
> 
> Thanks and have a great day,
> 
> Sean