You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Gibbon, Robert, VF-Group" <Ro...@vodafone.com> on 2010/01/07 14:04:58 UTC

RE: Seeing errors after loading a fair amount of data. KeeperException$NoNodeException, IOException

Maybe you are running Red Hat? Just changing limits.conf I think won't
work because RH has a maximum total open files across the whole system,
which is 4096 by default, unless you do something like this too

echo "32768" > /proc/sys/fs/file-max
service network restart

To make it permanent edit /etc/sysctl.conf to include the line:
      fs.file-max = 32768

Maybe you already did that though, or you're using something else, or
maybe this is not your problem.

Kind regards, 
Robert

-----Original Message-----
From: Marc Limotte [mailto:mslimotte@gmail.com] 
Sent: Donnerstag, 7. Januar 2010 09:40
To: hbase-user@hadoop.apache.org
Subject: Re: Seeing errors after loading a fair amount of data.
KeeperException$NoNodeException, IOException

Thanks for the response, Ryan.

I increased ulimt and Xceivers.  My load job still dies, but the
RegionServers stay up and running, and I can use the hbase shell to
retrieve a row, so I guess HBase is still running.

Doesn't seem to be swapping (still 2-3gig free) CPU usage is low top -
02:44:02 up 16 min,  1 user,  load average: 0.60, 0.47, 0.37
Tasks: 117 total,   1 running, 116 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.0%us,  0.3%sy,  0.0%ni, 95.9%id,  1.8%wa,  0.0%hi,  0.0%si,
0.1%st
Mem:   7348132k total,  4506192k used,  2841940k free,    18788k buffers
Swap:        0k total,        0k used,        0k free,  2846528k cached

open files for user between 2000 and 2500 ulimit -n is 4096,
/etc/security/limits.conf nofile set to 32768 dfs.datanode.max.xcievers
set in $HADOOP_HOME/conf/hdfs-site.xml to 2048

I also should have mentioned that this is running in Amazon EC2. I
checked out the recommendations for EC2 on the wiki, and hence this
particular run is on a 'c1.xlarge' instance, although most of my prior
testing has been on m1.large.


The jobtracker log has a bunch of these errors (each task, except for
ones with a very small input file, fail after a bunch of retries like
this):

2010-01-07 02:43:23,719 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_201001070230_0001_m_0000
18_0: org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying
to contact region server Some server, retr yOnlyOne=true, index=0,
islastrow=false, tries=9, numtries=10, i=1620, listsize=3823,
region=feed,\x00\xFFn\x
F5\x7F\xFF\xFE\xDA\xFDLi\xA4,1262714416618 for region
feed,\x00\xFFn\xF5\x7F\xFF\xFE\xDA\xFDLi\xA4,1262714416
618, row '\x01\x03Uz\x7F\xFF\xFE\xDB\x15\xEEt\xFD', but failed after 10
attempts.
Exceptions:
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.pro
cess(HConnectionManager.java:1120)
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBa
tchOfRows(HConnectionManager.java:1201)
        at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605)
        at
org.apache.hadoop.hbase.mapred.TableOutputFormat$TableRecordWriter.close
(TableOutputFormat.java:69)
        at
org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.close(MapTask.
java:631)
        at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)


Following are the exceptions I see in the regionserver logs (but they
seem to show up at startup):


java.net.ConnectException: Connection refused        at
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:933)
2010-01-07 02:30:38,431 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown input java.nio.channels.ClosedChannelException
        at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
        at
sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
        at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 02:30:38,431 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.nio.channels.ClosedChannelException
        at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649)
        at
sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
        at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 02:30:38,544 WARN
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to set
watcher on ZNode /
hbase/masterorg.apache.zookeeper.KeeperException$ConnectionLossException
:
KeeperErrorCode = ConnectionLoss for /hbase/master
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:780)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.watchMasterAddress(Zo
oKeeperWrapper.java:304)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.watchMasterAddress(HR
egionServer.java:385)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitializeZooKeeper
(HRegionServer.java:315)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitialize(HRegionS
erver.java:306)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.
java:276)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA
ccessorImpl.java:39)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons
tructorAccessorImpl.java:27)
        at
java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.doMain(HRegionServer.
java:2474)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.ja
va:2542)
2010-01-07 02:30:38,544 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to set
watcher on ZooKeeper master address. Retrying.

...

2010-01-07 01:49:34,756 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2010-01-07 01:49:34,765 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x0 to sun.nio.ch.SelectionKeyImpl@2cba5bdb
java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
lim=4 cap=4]
    at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
    at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-01-07 01:49:34,766 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown input
java.net.SocketException: Transport endpoint is not connected
    at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
    at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
    at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
    at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
    at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 01:49:34,767 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.net.SocketException: Transport endpoint is not connected
    at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
    at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
    at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
    at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
    at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 01:49:34,783 INFO
org.apache.hadoop.hbase.master.RegionManager:
-ROOT- region unset (but not set to be
 reassigned)
2010-01-07 01:49:34,813 INFO
org.apache.hadoop.hbase.master.RegionManager:
ROOT inserted into regionsInTransition
2010-01-07 01:49:35,015 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server slave3.cluster1/10.
251.130.176:2181


-marc

2010/1/5 <hb...@hadoop.apache.org>

>
> ---------- Forwarded message ----------
> From: Ryan Rawson <ry...@gmail.com>
> To: hbase-user@hadoop.apache.org
> Date: Tue, 5 Jan 2010 16:19:49 -0800
> Subject: Re: Seeing errors after loading a fair amount of data.
> KeeperException$NoNodeException, IOException Hey,
>
> What we need is the fatal exception in the Regionserver log... just 
> looking though I suspect you might be running into HDFS tuning limits.
>  Xciever count and ulimit -n are the key settings you want to verify.
>
> Look in:
>
> http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#
> overview_description
>
> Let us know!
> -ryan
>
> On Tue, Jan 5, 2010 at 4:06 PM, Marc Limotte <ms...@gmail.com>
wrote:
> > I'm struggling with a problem that seems to manifest only after I 
> > load a fair amount of data.  I run a few map/reduce jobs and load 
> > about 80M rows successfully. Then I start another process to load 
> > another 35M or so rows, and things start breaking:
> >
> > - Most of the RegionServer processes die (4 out of 5) -- log message
> below.
> > - HMaster does not die, but seems unresponsive at the status web 
> > page
> (port
> > 60030) -- log message below
> > - HQuorumPeer(s) are still running
> >
> >
> > I restart the entire cluster (full reboot) and try again, but the 
> > problem occurs again immediately (similar process state and log
messages).
> >
> > It seems that if I truncate the table and restart, I get a similar 
> > situation.  *I.e. I can load about 80M rows, but then the 
> > RegionServers
> die
> > and my jobs fail. *
> >
> > Small cluster: 5 nodes
> > 2 x 2 cores, 8gig memory, Fedora 8
> > Hbase 0.20.2 / Hadoop 0.20.2
> >
> > ---------
> >
> > *Master log contains errors like:*
> >
> > 2010-01-05 17:33:05,142 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > slave2.cluster1,60020,1262730096922 znode expired
> > 2010-01-05 17:33:05,147 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: process 
> > shutdown of server slave2.cluster1,60020,1262730096922: logSplit: 
> > false,
> rootRescanned:
> > false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> > 2010-01-05 17:33:05,156 INFO
org.apache.hadoop.hbase.regionserver.HLog:
> > Splitting 9 hlog(s) in
> >
> hdfs://master.cluster1:9000/hbase/.logs/slave2.cluster1,60020,12627300
> 96922
> > 2010-01-05 17:33:12,303 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer
> > Exception: org.apache.hadoop.ipc.RemoteException:
java.io.IOException:
> File
> > /hbase/feed/248946612/oldlogfile.log could only be replicated to 0 
> > nodes, instead of 1
> >        at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock
> (FSNamesystem.java:1267)
> >        at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java
> :422)
> >        at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown
Source)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> >        at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> >        at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at javax.security.auth.Subject.doAs(Subject.java:396)
> >        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> >        at org.apache.hadoop.ipc.Client.call(Client.java:739)
> >        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> >        at $Proxy0.addBlock(Unknown Source)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
> >        at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> ava:39)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryIn
> vocationHandler.java:82)
> >        at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocati
> onHandler.java:59)
> >        at $Proxy0.addBlock(Unknown Source)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(
> DFSClient.java:2906)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream
> (DFSClient.java:2788)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient
> .java:2078)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSC
> lient.java:2264)
> >
> > 2010-01-05 17:33:12,303 WARN org.apache.hadoop.hdfs.DFSClient: Error

> > Recovery for block null bad datanode[0] nodes == null
> > 2010-01-05 17:33:12,303 WARN org.apache.hadoop.hdfs.DFSClient: Could

> > not
> get
> > block locations. Source file "/hbase/feed/248946612/oldlogfile.log" 
> > - Aborting...
> >
> >
> > *And the region server logs contain this right after start up:*
> >
> > java.io.IOException:
> org.apache.zookeeper.KeeperException$NoNodeException:
> > KeeperErrorCode = NoNode for /hbase/root-region-server at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(
> ZooKeeperWrapper.java:332)
> >  at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddress(ZooKeep
> erWrapper.java:318)
> > at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readRootRegionLocat
> ion(ZooKeeperWrapper.java:231)
> >  at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> ava:442)
> > at java.lang.Thread.run(Thread.java:619)
> > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> > KeeperErrorCode = NoNode for /hbase/root-region-server at 
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:102
> > )  at 
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
> >  at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(
> ZooKeeperWrapper.java:328)
> > ... 4 more
> >
> > *and*
> >
> > 2010-01-05 18:49:57,398 WARN
org.apache.hadoop.hbase.regionserver.Store:
> > Exception processing reconstruction log 
> > hdfs://master.cluster1:9000/hbase/feed/1281791924/oldlogfile.log 
> > opening comments -- continuing.  Probably lack-of-HADOOP-1700
causing DATA LOSS!
> > java.io.EOFException
> >        at
java.io.DataInputStream.readFully(DataInputStream.java:180)
> >        at
java.io.DataInputStream.readFully(DataInputStream.java:152)
> >        at
> >
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> >        at
> >
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> >        at
> >
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >        at
> >
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.Store.doReconstructionLog(Store.j
> ava:318)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.Store.runReconstructionLog(Store.
> java:267)
> >        at
> org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:225)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion
> .java:1500)
> >        at
> >
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:305
)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(H
> RegionServer.java:1621)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionS
> erver.java:1588)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionS
> erver.java:1508)
> >        at java.lang.Thread.run(Thread.java:619)
> >
> > This last seems particularly strange, b/c HADOOP-1700 was fixed in 
> > Hadoop 0.19.
> >
> > Any help on what these exceptions mean and what I can about them 
> > would be appreciated.
> >
> > -Marc
> >
>
>
>
>
>

Re: Seeing errors after loading a fair amount of data. KeeperException$NoNodeException, IOException

Posted by Marc Limotte <ms...@gmail.com>.
Thanks, Robert.

I'm using Fedora, so it probably works the same way as you suggest.  Setting
the ulimit and xcievers as described in the troubleshooting didn't seem to
help.  But I'm going to try again with your suggestion.

Marc

On Thu, Jan 7, 2010 at 12:56 PM, Andrew Purtell <ap...@apache.org> wrote:

> Robert,
>
> Thanks for that. I updated the relevant section of the Troubleshooting page
> up on the HBase wiki with this advice.
>
> Best regards,
>
>   - Andy
>
>
>
> ----- Original Message ----
> > From: "Gibbon, Robert, VF-Group" <Ro...@vodafone.com>
> > To: hbase-user@hadoop.apache.org
> > Sent: Thu, January 7, 2010 5:04:58 AM
> > Subject: RE: Seeing errors after loading a fair amount of data.
> KeeperException$NoNodeException, IOException
> >
> > Maybe you are running Red Hat? Just changing limits.conf I think won't
> > work because RH has a maximum total open files across the whole system,
> > which is 4096 by default, unless you do something like this too
> >
> > echo "32768" > /proc/sys/fs/file-max
> > service network restart
> >
> > To make it permanent edit /etc/sysctl.conf to include the line:
> >       fs.file-max = 32768
> > Kind regards,
> > Robert
> [...]
>
>
>
>
>

Re: Seeing errors after loading a fair amount of data. KeeperException$NoNodeException, IOException

Posted by Andrew Purtell <ap...@apache.org>.
Robert,

Thanks for that. I updated the relevant section of the Troubleshooting page
up on the HBase wiki with this advice. 

Best regards,

   - Andy



----- Original Message ----
> From: "Gibbon, Robert, VF-Group" <Ro...@vodafone.com>
> To: hbase-user@hadoop.apache.org
> Sent: Thu, January 7, 2010 5:04:58 AM
> Subject: RE: Seeing errors after loading a fair amount of data. KeeperException$NoNodeException, IOException
> 
> Maybe you are running Red Hat? Just changing limits.conf I think won't
> work because RH has a maximum total open files across the whole system,
> which is 4096 by default, unless you do something like this too
> 
> echo "32768" > /proc/sys/fs/file-max
> service network restart
> 
> To make it permanent edit /etc/sysctl.conf to include the line:
>       fs.file-max = 32768
> Kind regards, 
> Robert
[...]