You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Florian Leibert <fl...@leibert.de> on 2009/09/24 22:36:30 UTC

DataXceiver error

Hi,
recently, we're seeing frequent STEs in our datanodes. We had prior fixed
this issue by upping the handler count max.xciever (note this is misspelled
in the code as well - so we're just being consistent).
We're using 0.19 with a couple of patches - none of which should affect any
of the areas in the stacktrace.

We've seen this before upping the limits on the xcievers - but these
settings seem very high already. We're running 102 nodes.

Any hints would be appreciated.

 <property>
    <name>dfs.datanode.handler.count</name>
    <value>300</value>
</property>
<property>
   <name>dfs.namenode.handler.count</name>
    <value>300</value>
 </property>
 <property>
    <name>dfs.datanode.max.xcievers</name>
    <value>2000</value>
 </property>


2009-09-24 17:48:13,648 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.16.160.79:50010,
storageID=DS-1662533511-10.16.160.79-50010-1219665628349, infoPort=50075,
ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.16.160.79:50010 remote=/
10.16.134.78:34280]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
        at java.lang.Thread.run(Thread.java:619)

Re: DataXceiver error

Posted by Amandeep Khurana <am...@gmail.com>.

On Thu, Sep 24, 2009 at 4:09 PM, Florian Leibert <fl...@leibert.de> wrote:

> We can't really alter the jobs... This is a rather complex system with our
> own DSL for writing jobs so that other departments can use our data. The
> number of mappers is determined based on the number of input files
> involved...
>

Ok


>
> Setting this to 0 in a cluster where resources will be scarce at times
> doesn't really sound like a solution - I don't have any of these problems
> on
> our 30 node test cluster, so I can't really try it out there and setting
> the
> timeout to 0 on production doesn't give me a great deal of confidence...
>
>
Ok.. In that case, lets see if someone else is able to give an alternate
workaround/solution to this.
Write a little bit more about the kind of job, how compute intensive it is,
the number of mappers (one of the cases where it troubles), the number of
reducers, number of tasks per node, does the job fail or does it give this
exception and carry on to restart the task and finish it, your cluster
configuration etc.. That might give a better understanding of the issue.


>
>

> On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana <am...@gmail.com>
> wrote:
>
> > On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <fl...@leibert.de> wrote:
> >
> > > This happens maybe 4-5 times a day on an arbitrary node - it usually
> > occurs
> > > during very intense jobs where there are 10s of thousands of map tasks
> > > scheduled...
> > >
> >
> > Right.. So, the reason most probably is that the particular file being
> read
> > is being kept open during the computation and thats causing the timeouts.
> > You can try to alter your jobs and number of tasks and see if you can
> come
> > out with a workaround.
> >
> >
> > > From what I gather in the code, this results from a write attempt - the
> > > selector seems to wait until it can write to a channel - setting this
> to
> > 0
> > > might impact our cluster reliability, hence I'm not
> > >
> > >
> > Setting the timeout to 0 doesnt impact the cluster reliability. We have
> it
> > set to 0 on our clusters as well and its a pretty normal thing to do.
> > However, we do it because we are using HBase as well and that is known to
> > keep file handles open for long periods. But, setting the timeout to 0
> > doesnt impact any of our non-Hbase applications/jobs at all.. So, its not
> a
> > problem.
> >
> >
> > > On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <am...@gmail.com>
> > > wrote:
> > >
> > > > What were you doing when you got this error? Did you monitor the
> > resource
> > > > consumption during whatever you were doing?
> > > >
> > > > Reason I said was that sometimes, file handles are open for longer
> than
> > > the
> > > > timeout for some reason (intended though) and that causes trouble..
> So,
> > > > people keep the timeout at 0 to solve this problem.
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <fl...@leibert.de>
> > wrote:
> > > >
> > > > > I don't think setting the timeout to 0 is a good idea - after all
> we
> > > have
> > > > a
> > > > > lot writes going on so it should happen at times that a resource
> > isn't
> > > > > available immediately. Am I missing something or what's your
> > reasoning
> > > > for
> > > > > assuming that the timeout value is the problem?
> > > > >
> > > > > On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <
> amansk@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > When do you get this error?
> > > > > >
> > > > > > Try making the timeout to 0. That'll remove the timeout of 480s.
> > > > Property
> > > > > > name: dfs.datanode.socket.write.timeout
> > > > > >
> > > > > > -ak
> > > > > >
> > > > > >
> > > > > >
> > > > > > Amandeep Khurana
> > > > > > Computer Science Graduate Student
> > > > > > University of California, Santa Cruz
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <flo@leibert.de
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > recently, we're seeing frequent STEs in our datanodes. We had
> > prior
> > > > > fixed
> > > > > > > this issue by upping the handler count max.xciever (note this
> is
> > > > > > misspelled
> > > > > > > in the code as well - so we're just being consistent).
> > > > > > > We're using 0.19 with a couple of patches - none of which
> should
> > > > affect
> > > > > > any
> > > > > > > of the areas in the stacktrace.
> > > > > > >
> > > > > > > We've seen this before upping the limits on the xcievers - but
> > > these
> > > > > > > settings seem very high already. We're running 102 nodes.
> > > > > > >
> > > > > > > Any hints would be appreciated.
> > > > > > >
> > > > > > >  <property>
> > > > > > >    <name>dfs.datanode.handler.count</name>
> > > > > > >    <value>300</value>
> > > > > > > </property>
> > > > > > > <property>
> > > > > > >   <name>dfs.namenode.handler.count</name>
> > > > > > >    <value>300</value>
> > > > > > >  </property>
> > > > > > >  <property>
> > > > > > >    <name>dfs.datanode.max.xcievers</name>
> > > > > > >    <value>2000</value>
> > > > > > >  </property>
> > > > > > >
> > > > > > >
> > > > > > > 2009-09-24 17:48:13,648 ERROR
> > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > > DatanodeRegistration(
> > > > > > > 10.16.160.79:50010,
> > > > > > > storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
> > > > > infoPort=50075,
> > > > > > > ipcPort=50020):DataXceiver
> > > > > > > java.net.SocketTimeoutException: 480000 millis timeout while
> > > waiting
> > > > > for
> > > > > > > channel to be ready for write. ch :
> > > > > > > java.nio.channels.SocketChannel[connected local=/
> > > 10.16.160.79:50010
> > > > > > remote=/
> > > > > > > 10.16.134.78:34280]
> > > > > > >        at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > > > > >        at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > > > > >        at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > > > > >        at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > > > > >        at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > > > > >        at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > > > > >        at java.lang.Thread.run(Thread.java:619)
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: DataXceiver error

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Amandeep Khurana wrote:
> On Thu, Sep 24, 2009 at 6:28 PM, Raghu Angadi <ra...@yahoo-inc.com> wrote:
> 
>> This exception is not related to max.xceivers.. though they are co-related.
>> Users who need a lot of xceivers tend to slow readers (nothing wrong with
>> that). And absolutely no relation to handler count.
>>
>> Is the exception actually resulting in task/job failures? If yes, with
>> 0.19, your only option is to set the timeout to 0 as Amandeep suggested.
>>
>> In 0.20 clients recover correctly from such errors. The failures because of
>> this exception should go away.
>>
>> Amandeep, you should need to set it to 0 if you are 0.20 based HBase.
>>
>>
> I should/shouldnt? I'm on 0.20 and have it set to 0... It just avoids the
> exception altogether and doesnt hurt the performance in any ways (I think
> so..).. Correct me if I'm wrong on this.

You shouldn't. Sorry.

It is not really a performance question. Without this, the clients could 
hold a few resources (threads, socket buffers) too long (or may be 
forever). Of course 8min is probably too long to save you noticeable 
resources.

In general infinite timeouts like this are not good, especially for very 
large clusters.

Raghu.

> 
> 
>> Raghu.
>>
>>
>> Florian Leibert wrote:
>>
>>> We can't really alter the jobs... This is a rather complex system with our
>>> own DSL for writing jobs so that other departments can use our data. The
>>> number of mappers is determined based on the number of input files
>>> involved...
>>>
>>> Setting this to 0 in a cluster where resources will be scarce at times
>>> doesn't really sound like a solution - I don't have any of these problems
>>> on
>>> our 30 node test cluster, so I can't really try it out there and setting
>>> the
>>> timeout to 0 on production doesn't give me a great deal of confidence...
>>>
>>>
>>> On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana <am...@gmail.com>
>>> wrote:
>>>
>>>  On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <fl...@leibert.de> wrote:
>>>>  This happens maybe 4-5 times a day on an arbitrary node - it usually
>>>> occurs
>>>>
>>>>> during very intense jobs where there are 10s of thousands of map tasks
>>>>> scheduled...
>>>>>
>>>>>  Right.. So, the reason most probably is that the particular file being
>>>> read
>>>> is being kept open during the computation and thats causing the timeouts.
>>>> You can try to alter your jobs and number of tasks and see if you can
>>>> come
>>>> out with a workaround.
>>>>
>>>>
>>>>  From what I gather in the code, this results from a write attempt - the
>>>>> selector seems to wait until it can write to a channel - setting this to
>>>>>
>>>> 0
>>>>
>>>>> might impact our cluster reliability, hence I'm not
>>>>>
>>>>>
>>>>>  Setting the timeout to 0 doesnt impact the cluster reliability. We have
>>>> it
>>>> set to 0 on our clusters as well and its a pretty normal thing to do.
>>>> However, we do it because we are using HBase as well and that is known to
>>>> keep file handles open for long periods. But, setting the timeout to 0
>>>> doesnt impact any of our non-Hbase applications/jobs at all.. So, its not
>>>> a
>>>> problem.
>>>>
>>>>
>>>>  On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <am...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>  What were you doing when you got this error? Did you monitor the
>>>>> resource
>>>>> consumption during whatever you were doing?
>>>>>> Reason I said was that sometimes, file handles are open for longer than
>>>>>>
>>>>> the
>>>>>
>>>>>> timeout for some reason (intended though) and that causes trouble.. So,
>>>>>> people keep the timeout at 0 to solve this problem.
>>>>>>
>>>>>>
>>>>>> Amandeep Khurana
>>>>>> Computer Science Graduate Student
>>>>>> University of California, Santa Cruz
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <fl...@leibert.de>
>>>>>>
>>>>> wrote:
>>>>>  I don't think setting the timeout to 0 is a good idea - after all we
>>>>>> have
>>>>>> a
>>>>>>
>>>>>>> lot writes going on so it should happen at times that a resource
>>>>>>>
>>>>>> isn't
>>>>>  available immediately. Am I missing something or what's your
>>>>>> reasoning
>>>>> for
>>>>>>> assuming that the timeout value is the problem?
>>>>>>>
>>>>>>> On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <am...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  When do you get this error?
>>>>>>>> Try making the timeout to 0. That'll remove the timeout of 480s.
>>>>>>>>
>>>>>>> Property
>>>>>>> name: dfs.datanode.socket.write.timeout
>>>>>>>> -ak
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Amandeep Khurana
>>>>>>>> Computer Science Graduate Student
>>>>>>>> University of California, Santa Cruz
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <fl...@leibert.de>
>>>>>>>>
>>>>>>> wrote:
>>>>>>>  Hi,
>>>>>>>>> recently, we're seeing frequent STEs in our datanodes. We had
>>>>>>>>>
>>>>>>>> prior
>>>>>  fixed
>>>>>>>> this issue by upping the handler count max.xciever (note this is
>>>>>>>> misspelled
>>>>>>>>
>>>>>>>>> in the code as well - so we're just being consistent).
>>>>>>>>> We're using 0.19 with a couple of patches - none of which should
>>>>>>>>>
>>>>>>>> affect
>>>>>>> any
>>>>>>>>> of the areas in the stacktrace.
>>>>>>>>>
>>>>>>>>> We've seen this before upping the limits on the xcievers - but
>>>>>>>>>
>>>>>>>> these
>>>>>>  settings seem very high already. We're running 102 nodes.
>>>>>>>>> Any hints would be appreciated.
>>>>>>>>>
>>>>>>>>>  <property>
>>>>>>>>>   <name>dfs.datanode.handler.count</name>
>>>>>>>>>   <value>300</value>
>>>>>>>>> </property>
>>>>>>>>> <property>
>>>>>>>>>  <name>dfs.namenode.handler.count</name>
>>>>>>>>>   <value>300</value>
>>>>>>>>>  </property>
>>>>>>>>>  <property>
>>>>>>>>>   <name>dfs.datanode.max.xcievers</name>
>>>>>>>>>   <value>2000</value>
>>>>>>>>>  </property>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2009-09-24 17:48:13,648 ERROR
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>>>>>>>
>>>>>>>> DatanodeRegistration(
>>>>>>>  10.16.160.79:50010,
>>>>>>>>> storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
>>>>>>>>>
>>>>>>>> infoPort=50075,
>>>>>>>> ipcPort=50020):DataXceiver
>>>>>>>>> java.net.SocketTimeoutException: 480000 millis timeout while
>>>>>>>>>
>>>>>>>> waiting
>>>>>> for
>>>>>>>> channel to be ready for write. ch :
>>>>>>>>> java.nio.channels.SocketChannel[connected local=/
>>>>>>>>>
>>>>>>>> 10.16.160.79:50010
>>>>>>  remote=/
>>>>>>>>> 10.16.134.78:34280]
>>>>>>>>>       at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>>>>>        at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>>>>        at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>>>>        at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>>>>>        at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>>>>>        at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>>>>>        at
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>>>>>        at java.lang.Thread.run(Thread.java:619)
>>>>>>>>>
>

Re: DataXceiver error

Posted by Amandeep Khurana <am...@gmail.com>.

On Thu, Sep 24, 2009 at 6:28 PM, Raghu Angadi <ra...@yahoo-inc.com> wrote:

>
> This exception is not related to max.xceivers.. though they are co-related.
> Users who need a lot of xceivers tend to slow readers (nothing wrong with
> that). And absolutely no relation to handler count.
>
> Is the exception actually resulting in task/job failures? If yes, with
> 0.19, your only option is to set the timeout to 0 as Amandeep suggested.
>
> In 0.20 clients recover correctly from such errors. The failures because of
> this exception should go away.
>
> Amandeep, you should need to set it to 0 if you are 0.20 based HBase.
>
>
I should/shouldnt? I'm on 0.20 and have it set to 0... It just avoids the
exception altogether and doesnt hurt the performance in any ways (I think
so..).. Correct me if I'm wrong on this.



> Raghu.
>
>
> Florian Leibert wrote:
>
>> We can't really alter the jobs... This is a rather complex system with our
>> own DSL for writing jobs so that other departments can use our data. The
>> number of mappers is determined based on the number of input files
>> involved...
>>
>> Setting this to 0 in a cluster where resources will be scarce at times
>> doesn't really sound like a solution - I don't have any of these problems
>> on
>> our 30 node test cluster, so I can't really try it out there and setting
>> the
>> timeout to 0 on production doesn't give me a great deal of confidence...
>>
>>
>> On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana <am...@gmail.com>
>> wrote:
>>
>>  On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <fl...@leibert.de> wrote:
>>>
>>>  This happens maybe 4-5 times a day on an arbitrary node - it usually
>>>>
>>> occurs
>>>
>>>> during very intense jobs where there are 10s of thousands of map tasks
>>>> scheduled...
>>>>
>>>>  Right.. So, the reason most probably is that the particular file being
>>> read
>>> is being kept open during the computation and thats causing the timeouts.
>>> You can try to alter your jobs and number of tasks and see if you can
>>> come
>>> out with a workaround.
>>>
>>>
>>>  From what I gather in the code, this results from a write attempt - the
>>>> selector seems to wait until it can write to a channel - setting this to
>>>>
>>> 0
>>>
>>>> might impact our cluster reliability, hence I'm not
>>>>
>>>>
>>>>  Setting the timeout to 0 doesnt impact the cluster reliability. We have
>>> it
>>> set to 0 on our clusters as well and its a pretty normal thing to do.
>>> However, we do it because we are using HBase as well and that is known to
>>> keep file handles open for long periods. But, setting the timeout to 0
>>> doesnt impact any of our non-Hbase applications/jobs at all.. So, its not
>>> a
>>> problem.
>>>
>>>
>>>  On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <am...@gmail.com>
>>>> wrote:
>>>>
>>>>  What were you doing when you got this error? Did you monitor the
>>>>>
>>>> resource
>>>
>>>> consumption during whatever you were doing?
>>>>>
>>>>> Reason I said was that sometimes, file handles are open for longer than
>>>>>
>>>> the
>>>>
>>>>> timeout for some reason (intended though) and that causes trouble.. So,
>>>>> people keep the timeout at 0 to solve this problem.
>>>>>
>>>>>
>>>>> Amandeep Khurana
>>>>> Computer Science Graduate Student
>>>>> University of California, Santa Cruz
>>>>>
>>>>>
>>>>> On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <fl...@leibert.de>
>>>>>
>>>> wrote:
>>>
>>>>  I don't think setting the timeout to 0 is a good idea - after all we
>>>>>>
>>>>> have
>>>>
>>>>> a
>>>>>
>>>>>> lot writes going on so it should happen at times that a resource
>>>>>>
>>>>> isn't
>>>
>>>>  available immediately. Am I missing something or what's your
>>>>>>
>>>>> reasoning
>>>
>>>> for
>>>>>
>>>>>> assuming that the timeout value is the problem?
>>>>>>
>>>>>> On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <am...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>  When do you get this error?
>>>>>>>
>>>>>>> Try making the timeout to 0. That'll remove the timeout of 480s.
>>>>>>>
>>>>>> Property
>>>>>
>>>>>> name: dfs.datanode.socket.write.timeout
>>>>>>>
>>>>>>> -ak
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Amandeep Khurana
>>>>>>> Computer Science Graduate Student
>>>>>>> University of California, Santa Cruz
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <fl...@leibert.de>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>>  Hi,
>>>>>>>> recently, we're seeing frequent STEs in our datanodes. We had
>>>>>>>>
>>>>>>> prior
>>>
>>>>  fixed
>>>>>>
>>>>>>> this issue by upping the handler count max.xciever (note this is
>>>>>>>>
>>>>>>> misspelled
>>>>>>>
>>>>>>>> in the code as well - so we're just being consistent).
>>>>>>>> We're using 0.19 with a couple of patches - none of which should
>>>>>>>>
>>>>>>> affect
>>>>>
>>>>>> any
>>>>>>>
>>>>>>>> of the areas in the stacktrace.
>>>>>>>>
>>>>>>>> We've seen this before upping the limits on the xcievers - but
>>>>>>>>
>>>>>>> these
>>>>
>>>>>  settings seem very high already. We're running 102 nodes.
>>>>>>>>
>>>>>>>> Any hints would be appreciated.
>>>>>>>>
>>>>>>>>  <property>
>>>>>>>>   <name>dfs.datanode.handler.count</name>
>>>>>>>>   <value>300</value>
>>>>>>>> </property>
>>>>>>>> <property>
>>>>>>>>  <name>dfs.namenode.handler.count</name>
>>>>>>>>   <value>300</value>
>>>>>>>>  </property>
>>>>>>>>  <property>
>>>>>>>>   <name>dfs.datanode.max.xcievers</name>
>>>>>>>>   <value>2000</value>
>>>>>>>>  </property>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2009-09-24 17:48:13,648 ERROR
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>>>>>>
>>>>>>> DatanodeRegistration(
>>>>>
>>>>>>  10.16.160.79:50010,
>>>>>>>> storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
>>>>>>>>
>>>>>>> infoPort=50075,
>>>>>>
>>>>>>> ipcPort=50020):DataXceiver
>>>>>>>> java.net.SocketTimeoutException: 480000 millis timeout while
>>>>>>>>
>>>>>>> waiting
>>>>
>>>>> for
>>>>>>
>>>>>>> channel to be ready for write. ch :
>>>>>>>> java.nio.channels.SocketChannel[connected local=/
>>>>>>>>
>>>>>>> 10.16.160.79:50010
>>>>
>>>>>  remote=/
>>>>>>>
>>>>>>>> 10.16.134.78:34280]
>>>>>>>>       at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>>>
>>>>        at
>>>>>>>>
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>>>
>>>>        at java.lang.Thread.run(Thread.java:619)
>>>>>>>>
>>>>>>>>
>>
>

Re: DataXceiver error

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

This exception is not related to max.xceivers.. though they are 
co-related. Users who need a lot of xceivers tend to slow readers 
(nothing wrong with that). And absolutely no relation to handler count.

Is the exception actually resulting in task/job failures? If yes, with 
0.19, your only option is to set the timeout to 0 as Amandeep suggested.

In 0.20 clients recover correctly from such errors. The failures because 
of this exception should go away.

Amandeep, you should need to set it to 0 if you are 0.20 based HBase.

Raghu.

Florian Leibert wrote:
> We can't really alter the jobs... This is a rather complex system with our
> own DSL for writing jobs so that other departments can use our data. The
> number of mappers is determined based on the number of input files
> involved...
> 
> Setting this to 0 in a cluster where resources will be scarce at times
> doesn't really sound like a solution - I don't have any of these problems on
> our 30 node test cluster, so I can't really try it out there and setting the
> timeout to 0 on production doesn't give me a great deal of confidence...
> 
> 
> On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana <am...@gmail.com> wrote:
> 
>> On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <fl...@leibert.de> wrote:
>>
>>> This happens maybe 4-5 times a day on an arbitrary node - it usually
>> occurs
>>> during very intense jobs where there are 10s of thousands of map tasks
>>> scheduled...
>>>
>> Right.. So, the reason most probably is that the particular file being read
>> is being kept open during the computation and thats causing the timeouts.
>> You can try to alter your jobs and number of tasks and see if you can come
>> out with a workaround.
>>
>>
>>> From what I gather in the code, this results from a write attempt - the
>>> selector seems to wait until it can write to a channel - setting this to
>> 0
>>> might impact our cluster reliability, hence I'm not
>>>
>>>
>> Setting the timeout to 0 doesnt impact the cluster reliability. We have it
>> set to 0 on our clusters as well and its a pretty normal thing to do.
>> However, we do it because we are using HBase as well and that is known to
>> keep file handles open for long periods. But, setting the timeout to 0
>> doesnt impact any of our non-Hbase applications/jobs at all.. So, its not a
>> problem.
>>
>>
>>> On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <am...@gmail.com>
>>> wrote:
>>>
>>>> What were you doing when you got this error? Did you monitor the
>> resource
>>>> consumption during whatever you were doing?
>>>>
>>>> Reason I said was that sometimes, file handles are open for longer than
>>> the
>>>> timeout for some reason (intended though) and that causes trouble.. So,
>>>> people keep the timeout at 0 to solve this problem.
>>>>
>>>>
>>>> Amandeep Khurana
>>>> Computer Science Graduate Student
>>>> University of California, Santa Cruz
>>>>
>>>>
>>>> On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <fl...@leibert.de>
>> wrote:
>>>>> I don't think setting the timeout to 0 is a good idea - after all we
>>> have
>>>> a
>>>>> lot writes going on so it should happen at times that a resource
>> isn't
>>>>> available immediately. Am I missing something or what's your
>> reasoning
>>>> for
>>>>> assuming that the timeout value is the problem?
>>>>>
>>>>> On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <am...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> When do you get this error?
>>>>>>
>>>>>> Try making the timeout to 0. That'll remove the timeout of 480s.
>>>> Property
>>>>>> name: dfs.datanode.socket.write.timeout
>>>>>>
>>>>>> -ak
>>>>>>
>>>>>>
>>>>>>
>>>>>> Amandeep Khurana
>>>>>> Computer Science Graduate Student
>>>>>> University of California, Santa Cruz
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <fl...@leibert.de>
>>>> wrote:
>>>>>>> Hi,
>>>>>>> recently, we're seeing frequent STEs in our datanodes. We had
>> prior
>>>>> fixed
>>>>>>> this issue by upping the handler count max.xciever (note this is
>>>>>> misspelled
>>>>>>> in the code as well - so we're just being consistent).
>>>>>>> We're using 0.19 with a couple of patches - none of which should
>>>> affect
>>>>>> any
>>>>>>> of the areas in the stacktrace.
>>>>>>>
>>>>>>> We've seen this before upping the limits on the xcievers - but
>>> these
>>>>>>> settings seem very high already. We're running 102 nodes.
>>>>>>>
>>>>>>> Any hints would be appreciated.
>>>>>>>
>>>>>>>  <property>
>>>>>>>    <name>dfs.datanode.handler.count</name>
>>>>>>>    <value>300</value>
>>>>>>> </property>
>>>>>>> <property>
>>>>>>>   <name>dfs.namenode.handler.count</name>
>>>>>>>    <value>300</value>
>>>>>>>  </property>
>>>>>>>  <property>
>>>>>>>    <name>dfs.datanode.max.xcievers</name>
>>>>>>>    <value>2000</value>
>>>>>>>  </property>
>>>>>>>
>>>>>>>
>>>>>>> 2009-09-24 17:48:13,648 ERROR
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>> DatanodeRegistration(
>>>>>>> 10.16.160.79:50010,
>>>>>>> storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
>>>>> infoPort=50075,
>>>>>>> ipcPort=50020):DataXceiver
>>>>>>> java.net.SocketTimeoutException: 480000 millis timeout while
>>> waiting
>>>>> for
>>>>>>> channel to be ready for write. ch :
>>>>>>> java.nio.channels.SocketChannel[connected local=/
>>> 10.16.160.79:50010
>>>>>> remote=/
>>>>>>> 10.16.134.78:34280]
>>>>>>>        at
>>>>>>>
>>>>>>>
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>>>>>>>        at
>>>>>>>
>>>>>>>
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>>>>>>        at
>>>>>>>
>>>>>>>
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>>>>>>        at
>>>>>>>
>>>>>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>>>>>>>        at
>>>>>>>
>>>>>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>>>>>>>        at
>>>>>>>
>>>>>>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>>>>>>>        at
>>>>>>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>>>>>>>        at java.lang.Thread.run(Thread.java:619)
>>>>>>>
>

Re: DataXceiver error

Posted by Florian Leibert <fl...@leibert.de>.

We can't really alter the jobs... This is a rather complex system with our
own DSL for writing jobs so that other departments can use our data. The
number of mappers is determined based on the number of input files
involved...

Setting this to 0 in a cluster where resources will be scarce at times
doesn't really sound like a solution - I don't have any of these problems on
our 30 node test cluster, so I can't really try it out there and setting the
timeout to 0 on production doesn't give me a great deal of confidence...


On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana <am...@gmail.com> wrote:

> On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <fl...@leibert.de> wrote:
>
> > This happens maybe 4-5 times a day on an arbitrary node - it usually
> occurs
> > during very intense jobs where there are 10s of thousands of map tasks
> > scheduled...
> >
>
> Right.. So, the reason most probably is that the particular file being read
> is being kept open during the computation and thats causing the timeouts.
> You can try to alter your jobs and number of tasks and see if you can come
> out with a workaround.
>
>
> > From what I gather in the code, this results from a write attempt - the
> > selector seems to wait until it can write to a channel - setting this to
> 0
> > might impact our cluster reliability, hence I'm not
> >
> >
> Setting the timeout to 0 doesnt impact the cluster reliability. We have it
> set to 0 on our clusters as well and its a pretty normal thing to do.
> However, we do it because we are using HBase as well and that is known to
> keep file handles open for long periods. But, setting the timeout to 0
> doesnt impact any of our non-Hbase applications/jobs at all.. So, its not a
> problem.
>
>
> > On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <am...@gmail.com>
> > wrote:
> >
> > > What were you doing when you got this error? Did you monitor the
> resource
> > > consumption during whatever you were doing?
> > >
> > > Reason I said was that sometimes, file handles are open for longer than
> > the
> > > timeout for some reason (intended though) and that causes trouble.. So,
> > > people keep the timeout at 0 to solve this problem.
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <fl...@leibert.de>
> wrote:
> > >
> > > > I don't think setting the timeout to 0 is a good idea - after all we
> > have
> > > a
> > > > lot writes going on so it should happen at times that a resource
> isn't
> > > > available immediately. Am I missing something or what's your
> reasoning
> > > for
> > > > assuming that the timeout value is the problem?
> > > >
> > > > On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <am...@gmail.com>
> > > > wrote:
> > > >
> > > > > When do you get this error?
> > > > >
> > > > > Try making the timeout to 0. That'll remove the timeout of 480s.
> > > Property
> > > > > name: dfs.datanode.socket.write.timeout
> > > > >
> > > > > -ak
> > > > >
> > > > >
> > > > >
> > > > > Amandeep Khurana
> > > > > Computer Science Graduate Student
> > > > > University of California, Santa Cruz
> > > > >
> > > > >
> > > > > On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <fl...@leibert.de>
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > recently, we're seeing frequent STEs in our datanodes. We had
> prior
> > > > fixed
> > > > > > this issue by upping the handler count max.xciever (note this is
> > > > > misspelled
> > > > > > in the code as well - so we're just being consistent).
> > > > > > We're using 0.19 with a couple of patches - none of which should
> > > affect
> > > > > any
> > > > > > of the areas in the stacktrace.
> > > > > >
> > > > > > We've seen this before upping the limits on the xcievers - but
> > these
> > > > > > settings seem very high already. We're running 102 nodes.
> > > > > >
> > > > > > Any hints would be appreciated.
> > > > > >
> > > > > >  <property>
> > > > > >    <name>dfs.datanode.handler.count</name>
> > > > > >    <value>300</value>
> > > > > > </property>
> > > > > > <property>
> > > > > >   <name>dfs.namenode.handler.count</name>
> > > > > >    <value>300</value>
> > > > > >  </property>
> > > > > >  <property>
> > > > > >    <name>dfs.datanode.max.xcievers</name>
> > > > > >    <value>2000</value>
> > > > > >  </property>
> > > > > >
> > > > > >
> > > > > > 2009-09-24 17:48:13,648 ERROR
> > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > DatanodeRegistration(
> > > > > > 10.16.160.79:50010,
> > > > > > storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
> > > > infoPort=50075,
> > > > > > ipcPort=50020):DataXceiver
> > > > > > java.net.SocketTimeoutException: 480000 millis timeout while
> > waiting
> > > > for
> > > > > > channel to be ready for write. ch :
> > > > > > java.nio.channels.SocketChannel[connected local=/
> > 10.16.160.79:50010
> > > > > remote=/
> > > > > > 10.16.134.78:34280]
> > > > > >        at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > > > >        at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > > > >        at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > > > >        at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > > > >        at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > > > >        at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > > > >        at java.lang.Thread.run(Thread.java:619)
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: DataXceiver error

Posted by Amandeep Khurana <am...@gmail.com>.

On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <fl...@leibert.de> wrote:

> This happens maybe 4-5 times a day on an arbitrary node - it usually occurs
> during very intense jobs where there are 10s of thousands of map tasks
> scheduled...
>

Right.. So, the reason most probably is that the particular file being read
is being kept open during the computation and thats causing the timeouts.
You can try to alter your jobs and number of tasks and see if you can come
out with a workaround.


> From what I gather in the code, this results from a write attempt - the
> selector seems to wait until it can write to a channel - setting this to 0
> might impact our cluster reliability, hence I'm not
>
>
Setting the timeout to 0 doesnt impact the cluster reliability. We have it
set to 0 on our clusters as well and its a pretty normal thing to do.
However, we do it because we are using HBase as well and that is known to
keep file handles open for long periods. But, setting the timeout to 0
doesnt impact any of our non-Hbase applications/jobs at all.. So, its not a
problem.


> On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <am...@gmail.com>
> wrote:
>
> > What were you doing when you got this error? Did you monitor the resource
> > consumption during whatever you were doing?
> >
> > Reason I said was that sometimes, file handles are open for longer than
> the
> > timeout for some reason (intended though) and that causes trouble.. So,
> > people keep the timeout at 0 to solve this problem.
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <fl...@leibert.de> wrote:
> >
> > > I don't think setting the timeout to 0 is a good idea - after all we
> have
> > a
> > > lot writes going on so it should happen at times that a resource isn't
> > > available immediately. Am I missing something or what's your reasoning
> > for
> > > assuming that the timeout value is the problem?
> > >
> > > On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <am...@gmail.com>
> > > wrote:
> > >
> > > > When do you get this error?
> > > >
> > > > Try making the timeout to 0. That'll remove the timeout of 480s.
> > Property
> > > > name: dfs.datanode.socket.write.timeout
> > > >
> > > > -ak
> > > >
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <fl...@leibert.de>
> > wrote:
> > > >
> > > > > Hi,
> > > > > recently, we're seeing frequent STEs in our datanodes. We had prior
> > > fixed
> > > > > this issue by upping the handler count max.xciever (note this is
> > > > misspelled
> > > > > in the code as well - so we're just being consistent).
> > > > > We're using 0.19 with a couple of patches - none of which should
> > affect
> > > > any
> > > > > of the areas in the stacktrace.
> > > > >
> > > > > We've seen this before upping the limits on the xcievers - but
> these
> > > > > settings seem very high already. We're running 102 nodes.
> > > > >
> > > > > Any hints would be appreciated.
> > > > >
> > > > >  <property>
> > > > >    <name>dfs.datanode.handler.count</name>
> > > > >    <value>300</value>
> > > > > </property>
> > > > > <property>
> > > > >   <name>dfs.namenode.handler.count</name>
> > > > >    <value>300</value>
> > > > >  </property>
> > > > >  <property>
> > > > >    <name>dfs.datanode.max.xcievers</name>
> > > > >    <value>2000</value>
> > > > >  </property>
> > > > >
> > > > >
> > > > > 2009-09-24 17:48:13,648 ERROR
> > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > > 10.16.160.79:50010,
> > > > > storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
> > > infoPort=50075,
> > > > > ipcPort=50020):DataXceiver
> > > > > java.net.SocketTimeoutException: 480000 millis timeout while
> waiting
> > > for
> > > > > channel to be ready for write. ch :
> > > > > java.nio.channels.SocketChannel[connected local=/
> 10.16.160.79:50010
> > > > remote=/
> > > > > 10.16.134.78:34280]
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > > >        at java.lang.Thread.run(Thread.java:619)
> > > > >
> > > >
> > >
> >
>

Re: DataXceiver error

Posted by Florian Leibert <fl...@leibert.de>.

This happens maybe 4-5 times a day on an arbitrary node - it usually occurs
during very intense jobs where there are 10s of thousands of map tasks
scheduled...
>From what I gather in the code, this results from a write attempt - the
selector seems to wait until it can write to a channel - setting this to 0
might impact our cluster reliability, hence I'm not

On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <am...@gmail.com> wrote:

> What were you doing when you got this error? Did you monitor the resource
> consumption during whatever you were doing?
>
> Reason I said was that sometimes, file handles are open for longer than the
> timeout for some reason (intended though) and that causes trouble.. So,
> people keep the timeout at 0 to solve this problem.
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <fl...@leibert.de> wrote:
>
> > I don't think setting the timeout to 0 is a good idea - after all we have
> a
> > lot writes going on so it should happen at times that a resource isn't
> > available immediately. Am I missing something or what's your reasoning
> for
> > assuming that the timeout value is the problem?
> >
> > On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <am...@gmail.com>
> > wrote:
> >
> > > When do you get this error?
> > >
> > > Try making the timeout to 0. That'll remove the timeout of 480s.
> Property
> > > name: dfs.datanode.socket.write.timeout
> > >
> > > -ak
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <fl...@leibert.de>
> wrote:
> > >
> > > > Hi,
> > > > recently, we're seeing frequent STEs in our datanodes. We had prior
> > fixed
> > > > this issue by upping the handler count max.xciever (note this is
> > > misspelled
> > > > in the code as well - so we're just being consistent).
> > > > We're using 0.19 with a couple of patches - none of which should
> affect
> > > any
> > > > of the areas in the stacktrace.
> > > >
> > > > We've seen this before upping the limits on the xcievers - but these
> > > > settings seem very high already. We're running 102 nodes.
> > > >
> > > > Any hints would be appreciated.
> > > >
> > > >  <property>
> > > >    <name>dfs.datanode.handler.count</name>
> > > >    <value>300</value>
> > > > </property>
> > > > <property>
> > > >   <name>dfs.namenode.handler.count</name>
> > > >    <value>300</value>
> > > >  </property>
> > > >  <property>
> > > >    <name>dfs.datanode.max.xcievers</name>
> > > >    <value>2000</value>
> > > >  </property>
> > > >
> > > >
> > > > 2009-09-24 17:48:13,648 ERROR
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > > > 10.16.160.79:50010,
> > > > storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
> > infoPort=50075,
> > > > ipcPort=50020):DataXceiver
> > > > java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > > > channel to be ready for write. ch :
> > > > java.nio.channels.SocketChannel[connected local=/10.16.160.79:50010
> > > remote=/
> > > > 10.16.134.78:34280]
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >
> > >
> >
>

Re: DataXceiver error

Posted by Amandeep Khurana <am...@gmail.com>.

What were you doing when you got this error? Did you monitor the resource
consumption during whatever you were doing?

Reason I said was that sometimes, file handles are open for longer than the
timeout for some reason (intended though) and that causes trouble.. So,
people keep the timeout at 0 to solve this problem.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <fl...@leibert.de> wrote:

> I don't think setting the timeout to 0 is a good idea - after all we have a
> lot writes going on so it should happen at times that a resource isn't
> available immediately. Am I missing something or what's your reasoning for
> assuming that the timeout value is the problem?
>
> On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <am...@gmail.com>
> wrote:
>
> > When do you get this error?
> >
> > Try making the timeout to 0. That'll remove the timeout of 480s. Property
> > name: dfs.datanode.socket.write.timeout
> >
> > -ak
> >
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <fl...@leibert.de> wrote:
> >
> > > Hi,
> > > recently, we're seeing frequent STEs in our datanodes. We had prior
> fixed
> > > this issue by upping the handler count max.xciever (note this is
> > misspelled
> > > in the code as well - so we're just being consistent).
> > > We're using 0.19 with a couple of patches - none of which should affect
> > any
> > > of the areas in the stacktrace.
> > >
> > > We've seen this before upping the limits on the xcievers - but these
> > > settings seem very high already. We're running 102 nodes.
> > >
> > > Any hints would be appreciated.
> > >
> > >  <property>
> > >    <name>dfs.datanode.handler.count</name>
> > >    <value>300</value>
> > > </property>
> > > <property>
> > >   <name>dfs.namenode.handler.count</name>
> > >    <value>300</value>
> > >  </property>
> > >  <property>
> > >    <name>dfs.datanode.max.xcievers</name>
> > >    <value>2000</value>
> > >  </property>
> > >
> > >
> > > 2009-09-24 17:48:13,648 ERROR
> > > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > > 10.16.160.79:50010,
> > > storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
> infoPort=50075,
> > > ipcPort=50020):DataXceiver
> > > java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> > > channel to be ready for write. ch :
> > > java.nio.channels.SocketChannel[connected local=/10.16.160.79:50010
> > remote=/
> > > 10.16.134.78:34280]
> > >        at
> > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > >        at java.lang.Thread.run(Thread.java:619)
> > >
> >
>

Re: DataXceiver error

Posted by Florian Leibert <fl...@leibert.de>.

I don't think setting the timeout to 0 is a good idea - after all we have a
lot writes going on so it should happen at times that a resource isn't
available immediately. Am I missing something or what's your reasoning for
assuming that the timeout value is the problem?

On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <am...@gmail.com> wrote:

> When do you get this error?
>
> Try making the timeout to 0. That'll remove the timeout of 480s. Property
> name: dfs.datanode.socket.write.timeout
>
> -ak
>
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <fl...@leibert.de> wrote:
>
> > Hi,
> > recently, we're seeing frequent STEs in our datanodes. We had prior fixed
> > this issue by upping the handler count max.xciever (note this is
> misspelled
> > in the code as well - so we're just being consistent).
> > We're using 0.19 with a couple of patches - none of which should affect
> any
> > of the areas in the stacktrace.
> >
> > We've seen this before upping the limits on the xcievers - but these
> > settings seem very high already. We're running 102 nodes.
> >
> > Any hints would be appreciated.
> >
> >  <property>
> >    <name>dfs.datanode.handler.count</name>
> >    <value>300</value>
> > </property>
> > <property>
> >   <name>dfs.namenode.handler.count</name>
> >    <value>300</value>
> >  </property>
> >  <property>
> >    <name>dfs.datanode.max.xcievers</name>
> >    <value>2000</value>
> >  </property>
> >
> >
> > 2009-09-24 17:48:13,648 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.16.160.79:50010,
> > storageID=DS-1662533511-10.16.160.79-50010-1219665628349, infoPort=50075,
> > ipcPort=50020):DataXceiver
> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> > channel to be ready for write. ch :
> > java.nio.channels.SocketChannel[connected local=/10.16.160.79:50010
> remote=/
> > 10.16.134.78:34280]
> >        at
> >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> >        at
> >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >        at
> >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> >        at java.lang.Thread.run(Thread.java:619)
> >
>

Re: DataXceiver error

Posted by Amandeep Khurana <am...@gmail.com>.

When do you get this error?

Try making the timeout to 0. That'll remove the timeout of 480s. Property
name: dfs.datanode.socket.write.timeout

-ak



Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <fl...@leibert.de> wrote:

> Hi,
> recently, we're seeing frequent STEs in our datanodes. We had prior fixed
> this issue by upping the handler count max.xciever (note this is misspelled
> in the code as well - so we're just being consistent).
> We're using 0.19 with a couple of patches - none of which should affect any
> of the areas in the stacktrace.
>
> We've seen this before upping the limits on the xcievers - but these
> settings seem very high already. We're running 102 nodes.
>
> Any hints would be appreciated.
>
>  <property>
>    <name>dfs.datanode.handler.count</name>
>    <value>300</value>
> </property>
> <property>
>   <name>dfs.namenode.handler.count</name>
>    <value>300</value>
>  </property>
>  <property>
>    <name>dfs.datanode.max.xcievers</name>
>    <value>2000</value>
>  </property>
>
>
> 2009-09-24 17:48:13,648 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.16.160.79:50010,
> storageID=DS-1662533511-10.16.160.79-50010-1219665628349, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.16.160.79:50010remote=/
> 10.16.134.78:34280]
>        at
>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>        at
>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>        at
>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>        at
>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>        at
>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>        at
>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>        at java.lang.Thread.run(Thread.java:619)
>