You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Emmanuel de Castro Santana <em...@gmail.com> on 2010/08/06 22:58:34 UTC

crawldb - DatanodeRegistration - EOFException

Hi all,

We are running Nutch in a 4 nodes cluster (3 tasktracker & datanode, 1
jobtracker & namenode).
These machines are pretty strong hardware and fetch jobs run easily.

however, sometimes as the update job is running, we see the following
exception:

2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
    at java.lang.Thread.run(Thread.java:619)
2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
    at java.lang.Thread.run(Thread.java:619)
2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
    at java.lang.Thread.run(Thread.java:619)


This exception plots out at a 5 or 4 minutes rate.
This is the amount of data being read and written by this job as those
exceptions appear:

FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187     3,726,132,529

checking fileSystem with "bin/hadoop fsck" shows me most of the time only
HEALTHY blocks,
although there are times when job history files seem to become CORRUPT, as I
can see with "bin/hadoop fsck -openforwrite"

dfs.block.size is 128Mb
system ulimit is set to 16384

The cluster is composed of strong hardware and the network between them is
pretty fast too.
There is plenty of disk space and memory on all nodes.
Given that, I guess it
should be something about my current configuration that is not fully
appropriate.

A short tip would be helpful at this moment.


Thanks in advance

Emmanuel de Castro Santana

Re: crawldb - DatanodeRegistration - EOFException

Posted by Scott Gonyea <sc...@aitrus.org>.

Have any more info on the setup to offer? What do performance metrics look like on the nodes? Network/Disk/etc?

What are you using to store the data on the namenode? Hadoop hdfs backend and hardware?

Also, what are the ulimits (-a) on your nodes? And how much memory per task?

sg

Sent from my iPhone

On Aug 6, 2010, at 1:58 PM, Emmanuel de Castro Santana <em...@gmail.com> wrote:

> Hi all,
> 
> We are running Nutch in a 4 nodes cluster (3 tasktracker & datanode, 1
> jobtracker & namenode).
> These machines are pretty strong hardware and fetch jobs run easily.
> 
> however, sometimes as the update job is running, we see the following
> exception:
> 
> 2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
> 172.16.202.172:50010,
> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>    at java.io.DataInputStream.readShort(DataInputStream.java:298)
>    at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>    at java.lang.Thread.run(Thread.java:619)
> 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
> 2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
> 172.16.202.172:50010,
> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>    at java.io.DataInputStream.readShort(DataInputStream.java:298)
>    at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>    at java.lang.Thread.run(Thread.java:619)
> 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
> 2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
> 172.16.202.172:50010,
> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>    at java.io.DataInputStream.readShort(DataInputStream.java:298)
>    at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>    at java.lang.Thread.run(Thread.java:619)
> 
> 
> This exception plots out at a 5 or 4 minutes rate.
> This is the amount of data being read and written by this job as those
> exceptions appear:
> 
> FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
> HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
> FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187     3,726,132,529
> 
> checking fileSystem with "bin/hadoop fsck" shows me most of the time only
> HEALTHY blocks,
> although there are times when job history files seem to become CORRUPT, as I
> can see with "bin/hadoop fsck -openforwrite"
> 
> dfs.block.size is 128Mb
> system ulimit is set to 16384
> 
> The cluster is composed of strong hardware and the network between them is
> pretty fast too.
> There is plenty of disk space and memory on all nodes.
> Given that, I guess it
> should be something about my current configuration that is not fully
> appropriate.
> 
> A short tip would be helpful at this moment.
> 
> 
> Thanks in advance
> 
> Emmanuel de Castro Santana

Re: crawldb - DatanodeRegistration - EOFException

Posted by Emmanuel de Castro Santana <em...@gmail.com>.

"Can you go from 1 machine to
another using whatever name appears"

Yes, I can go directly from one machine to the other.

"Also, what does it say when it shows that it's parsing?  By that, I mean:
Look at the job details, and see the status output from each node"

Status shows as SUCCEEDED mostly, even when that EOF Exception appears
frequently, it seems Hadoop shows itself resilient
to that kind of impact. There are some seldom failures, but I have lost no
jobs till now.

Anyways, I fear those Exceptions could cause jobs to fail once the demand
increases.

"... it'd be good if you just paste overwhelming amounts of data to the
list.
 That'll make it easier to spot more obvious/potential issues."

I am not sure what data would be relevant for the case.

Nevertheless, I am sending some configuration data

mapred.tasktracker.map.tasks.maximum        7
mapred.map.tasks    2
mapred.skip.map.auto.incr.proc.count    true
mapred.map.tasks.speculative.execution        true
mapred.map.max.attempts    4

mapred.tasktracker.reduce.tasks.maximum    7
mapred.reduce.tasks    1
mapred.reduce.copy.backoff    300
mapred.skip.reduce.auto.incr.proc.count    true
mapred.reduce.slowstart.completed.maps    0.05
mapred.reduce.parallel.copies    20
mapred.reduce.max.attempts    4
mapred.reduce.tasks.speculative.execution    true

dfs.df.interval    60000
dfs.namenode.decommission.interval    30
dfs.namenode.decommission.nodes.per.interval    5

We are slowly learning how each configuration interacts with all others,
by experience this is the one that has served us best till now.

Emmanuel

2010/8/9 Scott Gonyea <me...@sgonyea.com>

> Without going too deep into it, one thing that crossed my mind:  How are
> you
> naming the nodes (DNS)?  When looking at the job tracker, what name does it
> show for the trackers, in the Machine List?  Can you go from 1 machine to
> another using whatever name appears?
>
> Also, what does it say when it shows that it's parsing?  By that, I mean:
> Look at the job details, and see the status output from each node.  Also,
> it'd be good if you just paste overwhelming amounts of data to the list.
>  That'll make it easier to spot more obvious/potential issues.
>
> sg
>
> On Mon, Aug 9, 2010 at 1:42 PM, Emmanuel de Castro Santana <
> emmanuel.csantana@gmail.com> wrote:
>
> > "What do performance metrics look like on the nodes? Network/Disk/etc?"
> >
> > I do not have exact metrics yet.
> > However, 'top' command tells me that cpu usage gets significantly higher
> > while parsing. I guess there is nothing to worry about it though.
> > Most of the time cores are mostly idle and load average does not surpass
> > 0.5
> > (except when parsing).
> >
> > "what are the ulimits (-a)"
> >
> > ulimits are the same for all nodes, which means ...
> >
> > 16384 for open files
> > 139264 for max user processes
> > 32 for max locked memory
> >
> > "... so during peaks it would choke and drop packets"
> >
> > All nodes talk directly to each other through a switch, there are no long
> > paths to cross.
> > Don't really believe the problem is on network.
> > It seems to be more likely that I am not using the proper Hadoop
> > configurations.
> >
> > Emmanuel
> >
> >
> > 2010/8/7 Andrzej Bialecki <ab...@getopt.org>
> >
> > > On 2010-08-06 22:58, Emmanuel de Castro Santana wrote:
> > >
> > >> Hi all,
> > >>
> > >> We are running Nutch in a 4 nodes cluster (3 tasktracker&  datanode, 1
> > >> jobtracker&  namenode).
> > >> These machines are pretty strong hardware and fetch jobs run easily.
> > >>
> > >> however, sometimes as the update job is running, we see the following
> > >> exception:
> > >>
> > >> 2010-08-05 21:07:19,213 ERROR datanode.DataNode -
> DatanodeRegistration(
> > >> 172.16.202.172:50010,
> > >> storageID=DS-246829865-172.16.202.172-50010-1280352366878,
> > infoPort=50075,
> > >> ipcPort=50020):DataXceiver
> > >> java.io.EOFException
> > >>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
> > >>     at
> > >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
> > >>     at java.lang.Thread.run(Thread.java:619)
> > >> 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
> > >> 2010-08-05 21:12:19,155 ERROR datanode.DataNode -
> DatanodeRegistration(
> > >> 172.16.202.172:50010,
> > >> storageID=DS-246829865-172.16.202.172-50010-1280352366878,
> > infoPort=50075,
> > >> ipcPort=50020):DataXceiver
> > >> java.io.EOFException
> > >>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
> > >>     at
> > >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
> > >>     at java.lang.Thread.run(Thread.java:619)
> > >> 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
> > >> 2010-08-05 21:17:19,239 ERROR datanode.DataNode -
> DatanodeRegistration(
> > >> 172.16.202.172:50010,
> > >> storageID=DS-246829865-172.16.202.172-50010-1280352366878,
> > infoPort=50075,
> > >> ipcPort=50020):DataXceiver
> > >> java.io.EOFException
> > >>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
> > >>     at
> > >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
> > >>     at java.lang.Thread.run(Thread.java:619)
> > >>
> > >>
> > >> This exception plots out at a 5 or 4 minutes rate.
> > >> This is the amount of data being read and written by this job as those
> > >> exceptions appear:
> > >>
> > >> FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
> > >> HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
> > >> FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187
> 3,726,132,529
> > >>
> > >> checking fileSystem with "bin/hadoop fsck" shows me most of the time
> > only
> > >> HEALTHY blocks,
> > >> although there are times when job history files seem to become
> CORRUPT,
> > as
> > >> I
> > >> can see with "bin/hadoop fsck -openforwrite"
> > >>
> > >> dfs.block.size is 128Mb
> > >> system ulimit is set to 16384
> > >>
> > >> The cluster is composed of strong hardware and the network between
> them
> > is
> > >> pretty fast too.
> > >> There is plenty of disk space and memory on all nodes.
> > >> Given that, I guess it
> > >> should be something about my current configuration that is not fully
> > >> appropriate.
> > >>
> > >> A short tip would be helpful at this moment.
> > >>
> > >
> > > Hadoop network usage patterns are sometimes taxing for the network
> > > equipment - I've seen strange errors pop up in situations with cabling
> of
> > > poor quality, and even one case when everything was perfect except for
> > the
> > > gigE switch - the switch was equipped with several gigE ports, and the
> > > vendor claimed it can support all ports simultaneously... but it's poor
> > CPU
> > > was too underpowered to actually handle so many packets/sec from all
> > ports,
> > > so during peaks it would choke and drop packets.
> > >
> > > --
> > > Best regards,
> > > Andrzej Bialecki     <><
> > >  ___. ___ ___ ___ _ _   __________________________________
> > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > http://www.sigram.com  Contact: info at sigram dot com
> > >
> > >
> >
> >
> > --
> > Emmanuel de Castro Santana
> >
>



-- 
Emmanuel de Castro Santana

Re: crawldb - DatanodeRegistration - EOFException

Posted by Scott Gonyea <me...@sgonyea.com>.

Without going too deep into it, one thing that crossed my mind:  How are you
naming the nodes (DNS)?  When looking at the job tracker, what name does it
show for the trackers, in the Machine List?  Can you go from 1 machine to
another using whatever name appears?

Also, what does it say when it shows that it's parsing?  By that, I mean:
Look at the job details, and see the status output from each node.  Also,
it'd be good if you just paste overwhelming amounts of data to the list.
 That'll make it easier to spot more obvious/potential issues.

sg

On Mon, Aug 9, 2010 at 1:42 PM, Emmanuel de Castro Santana <
emmanuel.csantana@gmail.com> wrote:

> "What do performance metrics look like on the nodes? Network/Disk/etc?"
>
> I do not have exact metrics yet.
> However, 'top' command tells me that cpu usage gets significantly higher
> while parsing. I guess there is nothing to worry about it though.
> Most of the time cores are mostly idle and load average does not surpass
> 0.5
> (except when parsing).
>
> "what are the ulimits (-a)"
>
> ulimits are the same for all nodes, which means ...
>
> 16384 for open files
> 139264 for max user processes
> 32 for max locked memory
>
> "... so during peaks it would choke and drop packets"
>
> All nodes talk directly to each other through a switch, there are no long
> paths to cross.
> Don't really believe the problem is on network.
> It seems to be more likely that I am not using the proper Hadoop
> configurations.
>
> Emmanuel
>
>
> 2010/8/7 Andrzej Bialecki <ab...@getopt.org>
>
> > On 2010-08-06 22:58, Emmanuel de Castro Santana wrote:
> >
> >> Hi all,
> >>
> >> We are running Nutch in a 4 nodes cluster (3 tasktracker&  datanode, 1
> >> jobtracker&  namenode).
> >> These machines are pretty strong hardware and fetch jobs run easily.
> >>
> >> however, sometimes as the update job is running, we see the following
> >> exception:
> >>
> >> 2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
> >> 172.16.202.172:50010,
> >> storageID=DS-246829865-172.16.202.172-50010-1280352366878,
> infoPort=50075,
> >> ipcPort=50020):DataXceiver
> >> java.io.EOFException
> >>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
> >>     at
> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
> >>     at java.lang.Thread.run(Thread.java:619)
> >> 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
> >> 2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
> >> 172.16.202.172:50010,
> >> storageID=DS-246829865-172.16.202.172-50010-1280352366878,
> infoPort=50075,
> >> ipcPort=50020):DataXceiver
> >> java.io.EOFException
> >>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
> >>     at
> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
> >>     at java.lang.Thread.run(Thread.java:619)
> >> 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
> >> 2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
> >> 172.16.202.172:50010,
> >> storageID=DS-246829865-172.16.202.172-50010-1280352366878,
> infoPort=50075,
> >> ipcPort=50020):DataXceiver
> >> java.io.EOFException
> >>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
> >>     at
> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
> >>     at java.lang.Thread.run(Thread.java:619)
> >>
> >>
> >> This exception plots out at a 5 or 4 minutes rate.
> >> This is the amount of data being read and written by this job as those
> >> exceptions appear:
> >>
> >> FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
> >> HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
> >> FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187     3,726,132,529
> >>
> >> checking fileSystem with "bin/hadoop fsck" shows me most of the time
> only
> >> HEALTHY blocks,
> >> although there are times when job history files seem to become CORRUPT,
> as
> >> I
> >> can see with "bin/hadoop fsck -openforwrite"
> >>
> >> dfs.block.size is 128Mb
> >> system ulimit is set to 16384
> >>
> >> The cluster is composed of strong hardware and the network between them
> is
> >> pretty fast too.
> >> There is plenty of disk space and memory on all nodes.
> >> Given that, I guess it
> >> should be something about my current configuration that is not fully
> >> appropriate.
> >>
> >> A short tip would be helpful at this moment.
> >>
> >
> > Hadoop network usage patterns are sometimes taxing for the network
> > equipment - I've seen strange errors pop up in situations with cabling of
> > poor quality, and even one case when everything was perfect except for
> the
> > gigE switch - the switch was equipped with several gigE ports, and the
> > vendor claimed it can support all ports simultaneously... but it's poor
> CPU
> > was too underpowered to actually handle so many packets/sec from all
> ports,
> > so during peaks it would choke and drop packets.
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>
>
> --
> Emmanuel de Castro Santana
>

Re: crawldb - DatanodeRegistration - EOFException

Posted by Emmanuel de Castro Santana <em...@gmail.com>.

"What do performance metrics look like on the nodes? Network/Disk/etc?"

I do not have exact metrics yet.
However, 'top' command tells me that cpu usage gets significantly higher
while parsing. I guess there is nothing to worry about it though.
Most of the time cores are mostly idle and load average does not surpass 0.5
(except when parsing).

"what are the ulimits (-a)"

ulimits are the same for all nodes, which means ...

16384 for open files
139264 for max user processes
32 for max locked memory

"... so during peaks it would choke and drop packets"

All nodes talk directly to each other through a switch, there are no long
paths to cross.
Don't really believe the problem is on network.
It seems to be more likely that I am not using the proper Hadoop
configurations.

Emmanuel


2010/8/7 Andrzej Bialecki <ab...@getopt.org>

> On 2010-08-06 22:58, Emmanuel de Castro Santana wrote:
>
>> Hi all,
>>
>> We are running Nutch in a 4 nodes cluster (3 tasktracker&  datanode, 1
>> jobtracker&  namenode).
>> These machines are pretty strong hardware and fetch jobs run easily.
>>
>> however, sometimes as the update job is running, we see the following
>> exception:
>>
>> 2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
>> 172.16.202.172:50010,
>> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> java.io.EOFException
>>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
>>     at
>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>>     at java.lang.Thread.run(Thread.java:619)
>> 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
>> 2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
>> 172.16.202.172:50010,
>> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> java.io.EOFException
>>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
>>     at
>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>>     at java.lang.Thread.run(Thread.java:619)
>> 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
>> 2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
>> 172.16.202.172:50010,
>> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> java.io.EOFException
>>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
>>     at
>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>>     at java.lang.Thread.run(Thread.java:619)
>>
>>
>> This exception plots out at a 5 or 4 minutes rate.
>> This is the amount of data being read and written by this job as those
>> exceptions appear:
>>
>> FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
>> HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
>> FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187     3,726,132,529
>>
>> checking fileSystem with "bin/hadoop fsck" shows me most of the time only
>> HEALTHY blocks,
>> although there are times when job history files seem to become CORRUPT, as
>> I
>> can see with "bin/hadoop fsck -openforwrite"
>>
>> dfs.block.size is 128Mb
>> system ulimit is set to 16384
>>
>> The cluster is composed of strong hardware and the network between them is
>> pretty fast too.
>> There is plenty of disk space and memory on all nodes.
>> Given that, I guess it
>> should be something about my current configuration that is not fully
>> appropriate.
>>
>> A short tip would be helpful at this moment.
>>
>
> Hadoop network usage patterns are sometimes taxing for the network
> equipment - I've seen strange errors pop up in situations with cabling of
> poor quality, and even one case when everything was perfect except for the
> gigE switch - the switch was equipped with several gigE ports, and the
> vendor claimed it can support all ports simultaneously... but it's poor CPU
> was too underpowered to actually handle so many packets/sec from all ports,
> so during peaks it would choke and drop packets.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Emmanuel de Castro Santana

Re: crawldb - DatanodeRegistration - EOFException

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-08-06 22:58, Emmanuel de Castro Santana wrote:
> Hi all,
>
> We are running Nutch in a 4 nodes cluster (3 tasktracker&  datanode, 1
> jobtracker&  namenode).
> These machines are pretty strong hardware and fetch jobs run easily.
>
> however, sometimes as the update job is running, we see the following
> exception:
>
> 2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
> 172.16.202.172:50010,
> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>      at java.io.DataInputStream.readShort(DataInputStream.java:298)
>      at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>      at java.lang.Thread.run(Thread.java:619)
> 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
> 2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
> 172.16.202.172:50010,
> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>      at java.io.DataInputStream.readShort(DataInputStream.java:298)
>      at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>      at java.lang.Thread.run(Thread.java:619)
> 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
> 2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
> 172.16.202.172:50010,
> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>      at java.io.DataInputStream.readShort(DataInputStream.java:298)
>      at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>      at java.lang.Thread.run(Thread.java:619)
>
>
> This exception plots out at a 5 or 4 minutes rate.
> This is the amount of data being read and written by this job as those
> exceptions appear:
>
> FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
> HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
> FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187     3,726,132,529
>
> checking fileSystem with "bin/hadoop fsck" shows me most of the time only
> HEALTHY blocks,
> although there are times when job history files seem to become CORRUPT, as I
> can see with "bin/hadoop fsck -openforwrite"
>
> dfs.block.size is 128Mb
> system ulimit is set to 16384
>
> The cluster is composed of strong hardware and the network between them is
> pretty fast too.
> There is plenty of disk space and memory on all nodes.
> Given that, I guess it
> should be something about my current configuration that is not fully
> appropriate.
>
> A short tip would be helpful at this moment.

Hadoop network usage patterns are sometimes taxing for the network 
equipment - I've seen strange errors pop up in situations with cabling 
of poor quality, and even one case when everything was perfect except 
for the gigE switch - the switch was equipped with several gigE ports, 
and the vendor claimed it can support all ports simultaneously... but 
it's poor CPU was too underpowered to actually handle so many 
packets/sec from all ports, so during peaks it would choke and drop packets.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com