You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by jeff whiting <je...@qualtrics.com> on 2010/06/04 17:56:26 UTC

Lots of Different Kind of Datanode Errors

I had my HRegionServers go down due to hdfs exception.  In the datanode logs I'm seeing a lot of different and varied exceptions.  I've increased the data xceiver count now but these other ones don't make a lot of sense.  

Among them are:

:2010-06-04 07:41:56,917 ERROR datanode.DataNode (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010, storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075, ipcPort=50020):DataXceiver
-java.io.EOFException
-	at java.io.DataInputStream.readByte(DataInputStream.java:250)
-	at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
-	at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
-	at org.apache.hadoop.io.Text.readString(Text.java:400)
-	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313)
-	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
-	at java.lang.Thread.run(Thread.java:619)


:2010-06-04 08:49:56,389 ERROR datanode.DataNode (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010, storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075, ipcPort=50020):DataXceiver
-java.io.IOException: Connection reset by peer
-	at sun.nio.ch.FileDispatcher.read0(Native Method)
-	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
-	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
-	at sun.nio.ch.IOUtil.read(IOUtil.java:206)
-	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
-	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
-	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
-	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)


:2010-06-04 05:36:54,840 ERROR datanode.DataNode (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010, storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075, ipcPort=50020):DataXceiver
-java.io.IOException: xceiverCount 2049 exceeds the limit of concurrent xcievers 2047
-	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:88)
-	at java.lang.Thread.run(Thread.java:619)

:2010-06-04 05:36:48,848 ERROR datanode.DataNode (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010, storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075, ipcPort=50020):DataXceiver
-java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.184:50010 remote=/192.168.1.184:55349]
-	at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
-	at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
-	at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
-	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
-	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
-	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
-	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
-	at java.lang.Thread.run(Thread.java:619)
--

The EOFException is the most common one I get.  I'm also unsure how I would get a connection reset by peer when I'm connecting locally.  Why is the file prematurely ending? Any idea of what is going on?

Thanks,
~Jeff

--
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com






Re: Lots of Different Kind of Datanode Errors

Posted by Alex Kozlov <al...@cloudera.com>.
Hi Jeff,

Can you also check what's your machine swappiness is set by running
'/sbin/sysctl vm.swappiness'?  HBase recommends to set it very low (0 or 5).

Alex K

On Fri, Jun 4, 2010 at 12:03 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Jeff,
>
> That seems like a reasonable config, but the error message you pasted
> indicated xceivers was set to 2048 instead of 4096.
>
> Also, in my experience SocketTimeoutExceptions are usually due to swapping.
> Verify that your machines aren't swapping when you're under load.
>
> BTW since this is hbase-related, may be better to move this to the hbase
> user list.
>
> -Todd
>
> On Fri, Jun 4, 2010 at 9:37 AM, Jeff Whiting <je...@qualtrics.com> wrote:
>
>>  I've tried to follow it the best I can.  I already increased the ulimit
>> to 32768.  This is what I now have in my hdfs-site.xml.  Am I missing
>> anything?
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/media/sdb,/media/sdc,/media/sdd</value>
>> </property>
>>
>>   <property>
>>     <name>dfs.replication</name>
>>     <value>3</value>
>>   </property>
>>   <property>
>>     <name>dfs.datanode.max.xcievers</name>
>>     <value>4096</value>
>>   </property>
>>   <property>
>>     <name>dfs.datanode.handler.count</name>
>>     <value>10</value>
>>   </property>
>> </configuration>
>>
>>
>> .
>>
>> Todd Lipcon wrote:
>>
>> Hi Jeff,
>>
>>  Have you followed the HDFS configuration guide from the HBase wiki? You
>> need to bump up the transceiver count and probably ulimit as well. Looks
>> like you already tuned to 2048 but isn't high enough if you're still getting
>> the "exceeds the limit" message.
>>
>>  The EOFs and Connection Reset messages are when DFS clients are
>> disconnecting prematurely from a client stream (probably due to xceiver
>> errors on other streams)
>>
>>  -Todd
>>
>> On Fri, Jun 4, 2010 at 8:56 AM, jeff whiting <je...@qualtrics.com> wrote:
>>
>>> I had my HRegionServers go down due to hdfs exception.  In the datanode
>>> logs I'm seeing a lot of different and varied exceptions.  I've increased
>>> the data xceiver count now but these other ones don't make a lot of sense.
>>>
>>> Among them are:
>>>
>>> :2010-06-04 07:41:56,917 ERROR datanode.DataNode
>>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>>> ipcPort=50020):DataXceiver
>>> -java.io.EOFException
>>> -       at java.io.DataInputStream.readByte(DataInputStream.java:250)
>>> -       at
>>> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>>> -       at
>>> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>>> -       at org.apache.hadoop.io.Text.readString(Text.java:400)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>>> -       at java.lang.Thread.run(Thread.java:619)
>>>
>>>
>>> :2010-06-04 08:49:56,389 ERROR datanode.DataNode
>>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>>> ipcPort=50020):DataXceiver
>>> -java.io.IOException: Connection reset by peer
>>> -       at sun.nio.ch.FileDispatcher.read0(Native Method)
>>> -       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>>> -       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>>> -       at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>>> -       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>> -       at
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>> -       at
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>> -       at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>>
>>>
>>> :2010-06-04 05:36:54,840 ERROR datanode.DataNode
>>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>>> ipcPort=50020):DataXceiver
>>> -java.io.IOException: xceiverCount 2049 exceeds the limit of concurrent
>>> xcievers 2047
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:88)
>>> -       at java.lang.Thread.run(Thread.java:619)
>>>
>>> :2010-06-04 05:36:48,848 ERROR datanode.DataNode
>>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>>> ipcPort=50020):DataXceiver
>>> -java.net.SocketTimeoutException: 480000 millis timeout while waiting for
>>> channel to be ready for write. ch :
>>> java.nio.channels.SocketChannel[connected local=/192.168.1.184:50010remote=/
>>> 192.168.1.184:55349]
>>> -       at
>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>>> -       at
>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>> -       at
>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
>>> -       at java.lang.Thread.run(Thread.java:619)
>>> --
>>>
>>> The EOFException is the most common one I get.  I'm also unsure how I
>>> would get a connection reset by peer when I'm connecting locally.  Why is
>>> the file prematurely ending? Any idea of what is going on?
>>>
>>> Thanks,
>>> ~Jeff
>>>
>>> --
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> jeffw@qualtrics.com
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> --
>> Jeff Whiting
>> Qualtrics Senior Software Engineerjeffw@qualtrics.com
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

RE: Lots of Different Kind of Datanode Errors

Posted by Andrew Purtell <ap...@apache.org>.
HDFS-1148
   - Andy
From: Gokulakannan M <go...@huawei.com>
Subject: RE: Lots of Different Kind of Datanode Errors
To: hdfs-user@hadoop.apache.org, apurtell@apache.org
Date: Monday, June 7, 2010, 10:31 PM




 
 







 



Hi Andy, 

             

            What is the reference of that fix? 

   



 Thanks, 

  Gokul 

  

   











From: Andrew Purtell
[mailto:apurtell@apache.org] 

Sent: Tuesday, June 08, 2010 1:24
AM

To: hdfs-user@hadoop.apache.org

Subject: Re: Lots of Different
Kind of Datanode Errors 



   


 
  
  Current synchronization on FSDataset seems not
  quite right. Doing what amounted to applying Todd's patch that modifies
  FSDataSet to use reentrant rwlocks cleared up that type of problem for
  us.  
  
  
  
     
  
  
    - Andy 
  

  From: Jeff Whiting <je...@qualtrics.com>

  Subject: Re: Lots of Different Kind of Datanode Errors

  To: hdfs-user@hadoop.apache.org

  Date: Monday, June 7, 2010, 10:02 AM 
  
  Thanks for the replies.  I have turned off swap
  on all the machines to prevent any swap problems.  I was pounding my
  hard drives quite hard.  I had a simulated 60 clients loading data as
  fast as I could into hbase with a map reduce export job going at the same
  time.  Would that scenario explain some of the errors I was seeing?

  

  Over the weekend under more of a normal load I haven't not any exception
  except for about 6 of these:

  2010-06-05 03:46:41,229 ERROR datanode.DataNode (DataXceiver.java:run(131)) -
  DatanodeRegistration(192.168.0.98:50010,
  storageID=DS-1806250311-192.168.0.98-50010-1274208294562, infoPort=50075,
  ipcPort=50020):DataXceiver

  org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
  blk_-1677111232590888964_4471547 is valid, and cannot be written to.

      at
  org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:999)

  

  The reason the config shows 4096 is because I increased the xceiver account
  after the first email message in this thread.

  

  ~Jeff

  

  Allen Wittenauer wrote:  
  On Jun 4, 2010, at 12:03 PM, Todd Lipcon wrote:     
  Hi Jeff,  That seems like a reasonable config, but the error message you pasted indicated xceivers was set to 2048 instead of 4096.  Also, in my experience SocketTimeoutExceptions are usually due to swapping. Verify that your machines aren't swapping when you're under load.     
  Or doing any other heavy disk IO.     
     
  -- Jeff WhitingQualtrics Senior Software Engineerjeffw@qualtrics.com 
  
  
  
  
 


   



 





      

RE: Lots of Different Kind of Datanode Errors

Posted by Gokulakannan M <go...@huawei.com>.
Hi Andy,

            

            What is the reference of that fix?

 

 Thanks,

  Gokul

 

 

  _____  

From: Andrew Purtell [mailto:apurtell@apache.org] 
Sent: Tuesday, June 08, 2010 1:24 AM
To: hdfs-user@hadoop.apache.org
Subject: Re: Lots of Different Kind of Datanode Errors

 


Current synchronization on FSDataset seems not quite right. Doing what
amounted to applying Todd's patch that modifies FSDataSet to use reentrant
rwlocks cleared up that type of problem for us. 

 

  - Andy


From: Jeff Whiting <je...@qualtrics.com>
Subject: Re: Lots of Different Kind of Datanode Errors
To: hdfs-user@hadoop.apache.org
Date: Monday, June 7, 2010, 10:02 AM

Thanks for the replies.  I have turned off swap on all the machines to
prevent any swap problems.  I was pounding my hard drives quite hard.  I had
a simulated 60 clients loading data as fast as I could into hbase with a map
reduce export job going at the same time.  Would that scenario explain some
of the errors I was seeing?

Over the weekend under more of a normal load I haven't not any exception
except for about 6 of these:
2010-06-05 03:46:41,229 ERROR datanode.DataNode (DataXceiver.java:run(131))
- DatanodeRegistration(192.168.0.98:50010,
storageID=DS-1806250311-192.168.0.98-50010-1274208294562, infoPort=50075,
ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
blk_-1677111232590888964_4471547 is valid, and cannot be written to.
    at
org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java
:999)

The reason the config shows 4096 is because I increased the xceiver account
after the first email message in this thread.

~Jeff

Allen Wittenauer wrote: 

On Jun 4, 2010, at 12:03 PM, Todd Lipcon wrote:
 
  

Hi Jeff,
 
That seems like a reasonable config, but the error message you pasted
indicated xceivers was set to 2048 instead of 4096.
 
Also, in my experience SocketTimeoutExceptions are usually due to swapping.
Verify that your machines aren't swapping when you're under load.
    

Or doing any other heavy disk IO.
 
  

 

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com

 


Re: Lots of Different Kind of Datanode Errors

Posted by Andrew Purtell <ap...@apache.org>.
Current synchronization on FSDataset seems not quite right. Doing what amounted to applying Todd's patch that modifies FSDataSet to use reentrant rwlocks cleared up that type of problem for us. 
  - Andy


From: Jeff Whiting <je...@qualtrics.com>
Subject: Re: Lots of Different Kind of Datanode Errors
To: hdfs-user@hadoop.apache.org
Date: Monday, June 7, 2010, 10:02 AM




  

 
Thanks for the replies.  I have turned off swap on all the machines to
prevent any swap problems.  I was pounding my hard drives quite hard. 
I had a simulated 60 clients loading data as fast as I could into hbase
with a map reduce export job going at the same time.  Would that
scenario explain some of the errors I was seeing?



Over the weekend under more of a normal load I haven't not any
exception except for about 6 of these:

2010-06-05 03:46:41,229 ERROR datanode.DataNode
(DataXceiver.java:run(131)) - DatanodeRegistration(192.168.0.98:50010,
storageID=DS-1806250311-192.168.0.98-50010-1274208294562,
infoPort=50075, ipcPort=50020):DataXceiver

org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException:
Block blk_-1677111232590888964_4471547 is valid, and cannot be written
to.

    at
org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:999)



The reason the config shows 4096 is because I increased the xceiver
account after the first email message in this thread.



~Jeff



Allen Wittenauer wrote:

  On Jun 4, 2010, at 12:03 PM, Todd Lipcon wrote:

  
  
    Hi Jeff,

That seems like a reasonable config, but the error message you pasted indicated xceivers was set to 2048 instead of 4096.

Also, in my experience SocketTimeoutExceptions are usually due to swapping. Verify that your machines aren't swapping when you're under load.
    
  
  Or doing any other heavy disk IO.

  



-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com
 




      

Re: Lots of Different Kind of Datanode Errors

Posted by Jeff Whiting <je...@qualtrics.com>.
Thanks for the replies.  I have turned off swap on all the machines to 
prevent any swap problems.  I was pounding my hard drives quite hard.  I 
had a simulated 60 clients loading data as fast as I could into hbase 
with a map reduce export job going at the same time.  Would that 
scenario explain some of the errors I was seeing?

Over the weekend under more of a normal load I haven't not any exception 
except for about 6 of these:
2010-06-05 03:46:41,229 ERROR datanode.DataNode 
(DataXceiver.java:run(131)) - DatanodeRegistration(192.168.0.98:50010, 
storageID=DS-1806250311-192.168.0.98-50010-1274208294562, 
infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: 
Block blk_-1677111232590888964_4471547 is valid, and cannot be written to.
    at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:999)

The reason the config shows 4096 is because I increased the xceiver 
account after the first email message in this thread.

~Jeff

Allen Wittenauer wrote:
> On Jun 4, 2010, at 12:03 PM, Todd Lipcon wrote:
>
>   
>> Hi Jeff,
>>
>> That seems like a reasonable config, but the error message you pasted indicated xceivers was set to 2048 instead of 4096.
>>
>> Also, in my experience SocketTimeoutExceptions are usually due to swapping. Verify that your machines aren't swapping when you're under load.
>>     
>
> Or doing any other heavy disk IO.
>
>   

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com


Re: Lots of Different Kind of Datanode Errors

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jun 4, 2010, at 12:03 PM, Todd Lipcon wrote:

> Hi Jeff,
> 
> That seems like a reasonable config, but the error message you pasted indicated xceivers was set to 2048 instead of 4096.
> 
> Also, in my experience SocketTimeoutExceptions are usually due to swapping. Verify that your machines aren't swapping when you're under load.

Or doing any other heavy disk IO.


Re: Lots of Different Kind of Datanode Errors

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Jeff,

That seems like a reasonable config, but the error message you pasted
indicated xceivers was set to 2048 instead of 4096.

Also, in my experience SocketTimeoutExceptions are usually due to swapping.
Verify that your machines aren't swapping when you're under load.

BTW since this is hbase-related, may be better to move this to the hbase
user list.

-Todd

On Fri, Jun 4, 2010 at 9:37 AM, Jeff Whiting <je...@qualtrics.com> wrote:

>  I've tried to follow it the best I can.  I already increased the ulimit to
> 32768.  This is what I now have in my hdfs-site.xml.  Am I missing anything?
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/media/sdb,/media/sdc,/media/sdd</value>
> </property>
>
>   <property>
>     <name>dfs.replication</name>
>     <value>3</value>
>   </property>
>   <property>
>     <name>dfs.datanode.max.xcievers</name>
>     <value>4096</value>
>   </property>
>   <property>
>     <name>dfs.datanode.handler.count</name>
>     <value>10</value>
>   </property>
> </configuration>
>
>
> .
>
> Todd Lipcon wrote:
>
> Hi Jeff,
>
>  Have you followed the HDFS configuration guide from the HBase wiki? You
> need to bump up the transceiver count and probably ulimit as well. Looks
> like you already tuned to 2048 but isn't high enough if you're still getting
> the "exceeds the limit" message.
>
>  The EOFs and Connection Reset messages are when DFS clients are
> disconnecting prematurely from a client stream (probably due to xceiver
> errors on other streams)
>
>  -Todd
>
> On Fri, Jun 4, 2010 at 8:56 AM, jeff whiting <je...@qualtrics.com> wrote:
>
>> I had my HRegionServers go down due to hdfs exception.  In the datanode
>> logs I'm seeing a lot of different and varied exceptions.  I've increased
>> the data xceiver count now but these other ones don't make a lot of sense.
>>
>> Among them are:
>>
>> :2010-06-04 07:41:56,917 ERROR datanode.DataNode
>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> -java.io.EOFException
>> -       at java.io.DataInputStream.readByte(DataInputStream.java:250)
>> -       at
>> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>> -       at
>> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>> -       at org.apache.hadoop.io.Text.readString(Text.java:400)
>> -       at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313)
>> -       at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>> -       at java.lang.Thread.run(Thread.java:619)
>>
>>
>> :2010-06-04 08:49:56,389 ERROR datanode.DataNode
>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> -java.io.IOException: Connection reset by peer
>> -       at sun.nio.ch.FileDispatcher.read0(Native Method)
>> -       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> -       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>> -       at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>> -       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>> -       at
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>> -       at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>> -       at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>
>>
>> :2010-06-04 05:36:54,840 ERROR datanode.DataNode
>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> -java.io.IOException: xceiverCount 2049 exceeds the limit of concurrent
>> xcievers 2047
>> -       at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:88)
>> -       at java.lang.Thread.run(Thread.java:619)
>>
>> :2010-06-04 05:36:48,848 ERROR datanode.DataNode
>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> -java.net.SocketTimeoutException: 480000 millis timeout while waiting for
>> channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=/192.168.1.184:50010remote=/
>> 192.168.1.184:55349]
>> -       at
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>> -       at
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>> -       at
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>> -       at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
>> -       at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
>> -       at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
>> -       at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
>> -       at java.lang.Thread.run(Thread.java:619)
>> --
>>
>> The EOFException is the most common one I get.  I'm also unsure how I
>> would get a connection reset by peer when I'm connecting locally.  Why is
>> the file prematurely ending? Any idea of what is going on?
>>
>> Thanks,
>> ~Jeff
>>
>> --
>> Jeff Whiting
>> Qualtrics Senior Software Engineer
>> jeffw@qualtrics.com
>>
>>
>>
>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineerjeffw@qualtrics.com
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Lots of Different Kind of Datanode Errors

Posted by Jeff Whiting <je...@qualtrics.com>.
I've tried to follow it the best I can.  I already increased the ulimit 
to 32768.  This is what I now have in my hdfs-site.xml.  Am I missing 
anything?
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>dfs.data.dir</name>
  <value>/media/sdb,/media/sdc,/media/sdd</value>
</property>

  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.datanode.max.xcievers</name>
    <value>4096</value>
  </property>
  <property>
    <name>dfs.datanode.handler.count</name>
    <value>10</value>
  </property>
</configuration>


.
Todd Lipcon wrote:
> Hi Jeff,
>
> Have you followed the HDFS configuration guide from the HBase wiki? 
> You need to bump up the transceiver count and probably ulimit as well. 
> Looks like you already tuned to 2048 but isn't high enough if you're 
> still getting the "exceeds the limit" message.
>
> The EOFs and Connection Reset messages are when DFS clients are 
> disconnecting prematurely from a client stream (probably due to 
> xceiver errors on other streams)
>
> -Todd
>
> On Fri, Jun 4, 2010 at 8:56 AM, jeff whiting <jeffw@qualtrics.com 
> <ma...@qualtrics.com>> wrote:
>
>     I had my HRegionServers go down due to hdfs exception.  In the
>     datanode logs I'm seeing a lot of different and varied exceptions.
>      I've increased the data xceiver count now but these other ones
>     don't make a lot of sense.
>
>     Among them are:
>
>     :2010-06-04 07:41:56,917 ERROR datanode.DataNode
>     (DataXceiver.java:run(131)) -
>     DatanodeRegistration(192.168.1.184:50010
>     <http://192.168.1.184:50010>,
>     storageID=DS-1601700079-192.168.1.184-50010-1274208308658,
>     infoPort=50075, ipcPort=50020):DataXceiver
>     -java.io.EOFException
>     -       at java.io.DataInputStream.readByte(DataInputStream.java:250)
>     -       at
>     org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>     -       at
>     org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>     -       at org.apache.hadoop.io.Text.readString(Text.java:400)
>     -       at
>     org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313)
>     -       at
>     org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>     -       at java.lang.Thread.run(Thread.java:619)
>
>
>     :2010-06-04 08:49:56,389 ERROR datanode.DataNode
>     (DataXceiver.java:run(131)) -
>     DatanodeRegistration(192.168.1.184:50010
>     <http://192.168.1.184:50010>,
>     storageID=DS-1601700079-192.168.1.184-50010-1274208308658,
>     infoPort=50075, ipcPort=50020):DataXceiver
>     -java.io.IOException: Connection reset by peer
>     -       at sun.nio.ch.FileDispatcher.read0(Native Method)
>     -       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>     -       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>     -       at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>     -       at
>     sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>     -       at
>     org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>     -       at
>     org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     -       at
>     org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>
>
>     :2010-06-04 05:36:54,840 ERROR datanode.DataNode
>     (DataXceiver.java:run(131)) -
>     DatanodeRegistration(192.168.1.184:50010
>     <http://192.168.1.184:50010>,
>     storageID=DS-1601700079-192.168.1.184-50010-1274208308658,
>     infoPort=50075, ipcPort=50020):DataXceiver
>     -java.io.IOException: xceiverCount 2049 exceeds the limit of
>     concurrent xcievers 2047
>     -       at
>     org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:88)
>     -       at java.lang.Thread.run(Thread.java:619)
>
>     :2010-06-04 05:36:48,848 ERROR datanode.DataNode
>     (DataXceiver.java:run(131)) -
>     DatanodeRegistration(192.168.1.184:50010
>     <http://192.168.1.184:50010>,
>     storageID=DS-1601700079-192.168.1.184-50010-1274208308658,
>     infoPort=50075, ipcPort=50020):DataXceiver
>     -java.net.SocketTimeoutException: 480000 millis timeout while
>     waiting for channel to be ready for write. ch :
>     java.nio.channels.SocketChannel[connected
>     local=/192.168.1.184:50010 <http://192.168.1.184:50010>
>     remote=/192.168.1.184:55349 <http://192.168.1.184:55349>]
>     -       at
>     org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>     -       at
>     org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>     -       at
>     org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>     -       at
>     org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
>     -       at
>     org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
>     -       at
>     org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
>     -       at
>     org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
>     -       at java.lang.Thread.run(Thread.java:619)
>     --
>
>     The EOFException is the most common one I get.  I'm also unsure
>     how I would get a connection reset by peer when I'm connecting
>     locally.  Why is the file prematurely ending? Any idea of what is
>     going on?
>
>     Thanks,
>     ~Jeff
>
>     --
>     Jeff Whiting
>     Qualtrics Senior Software Engineer
>     jeffw@qualtrics.com <ma...@qualtrics.com>
>
>
>
>
>
>
>
>
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com


Re: Lots of Different Kind of Datanode Errors

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Jeff,

Have you followed the HDFS configuration guide from the HBase wiki? You need
to bump up the transceiver count and probably ulimit as well. Looks like you
already tuned to 2048 but isn't high enough if you're still getting the
"exceeds the limit" message.

The EOFs and Connection Reset messages are when DFS clients are
disconnecting prematurely from a client stream (probably due to xceiver
errors on other streams)

-Todd

On Fri, Jun 4, 2010 at 8:56 AM, jeff whiting <je...@qualtrics.com> wrote:

> I had my HRegionServers go down due to hdfs exception.  In the datanode
> logs I'm seeing a lot of different and varied exceptions.  I've increased
> the data xceiver count now but these other ones don't make a lot of sense.
>
> Among them are:
>
> :2010-06-04 07:41:56,917 ERROR datanode.DataNode
> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
> ipcPort=50020):DataXceiver
> -java.io.EOFException
> -       at java.io.DataInputStream.readByte(DataInputStream.java:250)
> -       at
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
> -       at
> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
> -       at org.apache.hadoop.io.Text.readString(Text.java:400)
> -       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313)
> -       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
> -       at java.lang.Thread.run(Thread.java:619)
>
>
> :2010-06-04 08:49:56,389 ERROR datanode.DataNode
> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
> ipcPort=50020):DataXceiver
> -java.io.IOException: Connection reset by peer
> -       at sun.nio.ch.FileDispatcher.read0(Native Method)
> -       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> -       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
> -       at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> -       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> -       at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
> -       at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> -       at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>
>
> :2010-06-04 05:36:54,840 ERROR datanode.DataNode
> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
> ipcPort=50020):DataXceiver
> -java.io.IOException: xceiverCount 2049 exceeds the limit of concurrent
> xcievers 2047
> -       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:88)
> -       at java.lang.Thread.run(Thread.java:619)
>
> :2010-06-04 05:36:48,848 ERROR datanode.DataNode
> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
> ipcPort=50020):DataXceiver
> -java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.1.184:50010remote=/
> 192.168.1.184:55349]
> -       at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> -       at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> -       at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> -       at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
> -       at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
> -       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
> -       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
> -       at java.lang.Thread.run(Thread.java:619)
> --
>
> The EOFException is the most common one I get.  I'm also unsure how I would
> get a connection reset by peer when I'm connecting locally.  Why is the file
> prematurely ending? Any idea of what is going on?
>
> Thanks,
> ~Jeff
>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> jeffw@qualtrics.com
>
>
>
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera