You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Evert Lammerts <Ev...@sara.nl> on 2011/05/11 11:23:25 UTC

Stability issue - dead DN's

Hi list,

I notice that whenever our Hadoop installation is put under a heavy load we lose one or two (on a total of five) datanodes. This results in IOExceptions, and affects the overall performance of the job being run. Can anybody give me advise or best practices on a different configuration to increase the stability? Below I've included the specs of the cluster, the hadoop related config and an example of when which things go wrong. Any help is very much appreciated, and if I can provide any other info please let me know.

Cheers,
Evert

== What goes wrong, and when ==

See attached a screenshot of Ganglia when the cluster is under load of a single job. This job:
* reads ~1TB from HDFS
* writes ~200GB to HDFS
* runs 288 Mappers and 35 Reducers

When the job runs it takes all available Map and Reduce slots. The system starts swapping and there is a short time interval during which most cores are in WAIT. After that the job really starts running. At around half way, one or two datanodes become unreachable and are marked as dead nodes. The amount of under-replicated blocks becomes huge. Then some "java.io.IOException: Could not obtain block" are thrown in Mappers. The job does manage to finish successfully after around 3.5 hours, but my fear is that when we make the input much larger - which we want - the system becomes too unstable to finish the job.

Maybe worth mentioning - never know what might help diagnostics.  We notice that memory usage becomes less when we switch our keys from Text to LongWritable. Also, the Mappers are done in a fraction of the time. However, this for some reason results in much more network traffic and makes Reducers extremely slow. We're working on figuring out what causes this.


== The cluster ==

We have a cluster that consists of 6 Sun Thumpers running Hadoop 0.20.2 on CentOS 5.5. One of them acts as NN and JT, the other 5 run DN's and TT's. Each node has:
* 16GB RAM
* 32GB swapspace
* 4 cores
* 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
* non-HDFS stuff on separate disks
* a 2x1GE bonded network interface for interconnects
* a 2x1GE bonded network interface for external access

I realize that this is not a well balanced system, but it's what we had available for a prototype environment. We're working on putting together a specification for a much larger production environment.


== Hadoop config ==

Here some properties that I think might be relevant:

__CORE-SITE.XML__

fs.inmemory.size.mb: 200
mapreduce.task.io.sort.factor: 100
mapreduce.task.io.sort.mb: 200
# 1024*1024*4 MB, blocksize of the LVM's
io.file.buffer.size: 4194304

__HDFS-SITE.XML__

# 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
dfs.block.size: 134217728
# Only 5 DN's, but this shouldn't hurt
dfs.namenode.handler.count: 40
# This got rid of the occasional "Could not obtain block"'s
dfs.datanode.max.xcievers: 4096

__MAPRED-SITE.XML__

mapred.tasktracker.map.tasks.maximum: 4
mapred.tasktracker.reduce.tasks.maximum: 4
mapred.child.java.opts: -Xmx2560m
mapreduce.reduce.shuffle.parallelcopies: 20
mapreduce.map.java.opts: -Xmx512m
mapreduce.reduce.java.opts: -Xmx512m
# Compression codecs are configured and seem to work fine
mapred.compress.map.output: true
mapred.map.output.compression.codec: com.hadoop.compression.lzo.LzoCodec


RE: Stability issue - dead DN's

Posted by Evert Lammerts <Ev...@sara.nl>.
> From my experience, hadoop loathes swap and you mention that all reduces and mappers are running (8 total) and from the ganglia screenshot I see that you have a thick crest of that purple swap.

I know, it's ugly isn't it :) My understanding is that this is partly due to forked processes though.

> If we do the math that means [ map.tasks.max * mapred.child.java.opts ]  +  [ reduce.tasks.max * mapred.child.java.opts ] => or [ 4 * 2.5G ] + [ 4 * 2.5G ] is greater than the amount of physical RAM in the machine.
> This doesn't account for the base tasktracker and datanode process + OS overhead and whatever else may be hoarding resources on the systems.

This makes me feel stupid :) Your right, I've just screwed it down, we'll see how it performs now.

> I would play with this ratio, either less maps / reduces max - or lower your child.java.opts so that when you are fully subscribed you are not using more resource than the machine can offer.
> Also, setting mapred.reduce.slowstart.completed.maps   to 1.00 or some other value close to 1 would be one way to guarantee only 4 either maps or reduces to be running at once and address (albeit in a duct tape like way) the oversubscription problem you are seeing (this represents the fractions of maps that should complete before initiating the reduce phase).

This is a new one for me. I get Allen's point that on a multi tenant cluster this won't fix the problem, but the default is definitely not a good one. Starting reduce tasks as soon as map tasks start running is hardly ever useful, and just takes up slots that could be used by others.

Thanks a bunch for the suggestions!

Cheers,
Evert



On Wed, May 11, 2011 at 3:23 AM, Evert Lammerts <Ev...@sara.nl> wrote:
Hi list,

I notice that whenever our Hadoop installation is put under a heavy load we lose one or two (on a total of five) datanodes. This results in IOExceptions, and affects the overall performance of the job being run. Can anybody give me advise or best practices on a different configuration to increase the stability? Below I've included the specs of the cluster, the hadoop related config and an example of when which things go wrong. Any help is very much appreciated, and if I can provide any other info please let me know.

Cheers,
Evert

== What goes wrong, and when ==

See attached a screenshot of Ganglia when the cluster is under load of a single job. This job:
* reads ~1TB from HDFS
* writes ~200GB to HDFS
* runs 288 Mappers and 35 Reducers

When the job runs it takes all available Map and Reduce slots. The system starts swapping and there is a short time interval during which most cores are in WAIT. After that the job really starts running. At around half way, one or two datanodes become unreachable and are marked as dead nodes. The amount of under-replicated blocks becomes huge. Then some "java.io.IOException: Could not obtain block" are thrown in Mappers. The job does manage to finish successfully after around 3.5 hours, but my fear is that when we make the input much larger - which we want - the system becomes too unstable to finish the job.

Maybe worth mentioning - never know what might help diagnostics.  We notice that memory usage becomes less when we switch our keys from Text to LongWritable. Also, the Mappers are done in a fraction of the time. However, this for some reason results in much more network traffic and makes Reducers extremely slow. We're working on figuring out what causes this.


== The cluster ==

We have a cluster that consists of 6 Sun Thumpers running Hadoop 0.20.2 on CentOS 5.5. One of them acts as NN and JT, the other 5 run DN's and TT's. Each node has:
* 16GB RAM
* 32GB swapspace
* 4 cores
* 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
* non-HDFS stuff on separate disks
* a 2x1GE bonded network interface for interconnects
* a 2x1GE bonded network interface for external access

I realize that this is not a well balanced system, but it's what we had available for a prototype environment. We're working on putting together a specification for a much larger production environment.


== Hadoop config ==

Here some properties that I think might be relevant:

__CORE-SITE.XML__

fs.inmemory.size.mb: 200
mapreduce.task.io.sort.factor: 100
mapreduce.task.io.sort.mb: 200
# 1024*1024*4 MB, blocksize of the LVM's
io.file.buffer.size: 4194304

__HDFS-SITE.XML__

# 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
dfs.block.size: 134217728
# Only 5 DN's, but this shouldn't hurt
dfs.namenode.handler.count: 40
# This got rid of the occasional "Could not obtain block"'s
dfs.datanode.max.xcievers: 4096

__MAPRED-SITE.XML__

mapred.tasktracker.map.tasks.maximum: 4
mapred.tasktracker.reduce.tasks.maximum: 4
mapred.child.java.opts: -Xmx2560m
mapreduce.reduce.shuffle.parallelcopies: 20
mapreduce.map.java.opts: -Xmx512m
mapreduce.reduce.java.opts: -Xmx512m
# Compression codecs are configured and seem to work fine
mapred.compress.map.output: true
mapred.map.output.compression.codec: com.hadoop.compression.lzo.LzoCodec


RE: Stability issue - dead DN's

Posted by Evert Lammerts <Ev...@sara.nl>.
Just to check: the NN gives back hostnames of DN's to the client when getting or putting data, and not IP addresses right?

Cheers,
Evert

________________________________________
From: Evert Lammerts [Evert.Lammerts@sara.nl]
Sent: Saturday, May 14, 2011 10:53 AM
To: general@hadoop.apache.org
Subject: RE: Stability issue - dead DN's

Ok, I'll give this scenario a try (in spite of the intoxication ;-)).

= putting or getting a file =
A client will access the NameNode first and get a list of hostnames. These will resolve to addresses either in public or in private space, depending on whether the request to the nameserver was made by a machine in public or in private space. Each node has one NIC listening on its address in private space and one on its address in public space. The Hadoop daemons are bound to 0.0.0.0:*. Reverse DNS will return an address in private space when the client connects from one of the nodes, and (obviously) an address in public space when the request came through WAN.

I'm not sure what could go wrong here... On Monday I'll recheck this scenario with our HPN guys as well.

Cheers,
Evert


________________________________________
From: Segel, Mike [msegel@navteq.com]
Sent: Saturday, May 14, 2011 12:33 AM
To: general@hadoop.apache.org
Subject: Re: Stability issue - dead DN's

Ok...

Hum, look, I've been force fed a couple of margaritas so, my memory is a bit foggy...
You say your clients connect on nic A. Your cluster connects on nic B.

What happens when you want to upload a file from your client to HDFS? Or even access it?

... ;-)



Sent from a remote device. Please excuse any typos...

Mike Segel

On May 13, 2011, at 4:15 PM, "Evert Lammerts" <Ev...@sara.nl> wrote:

> Hi Mike,
>
> Thanks for trying to help out.
>
> I had a talk with our networking guys this afternoon. According to them (and this is way out of my area of expertise, so excuse any mistakes) multiple interfaces shouldn't be a problem. We could set up a nameserver to resolve hostnames to addresses in our private space when the request comes from one of the nodes, and route this traffic over a single interface. Any other request can be resolved to an address in the public space, which is bound to an other interface. In our current setup we're not even resolving hostnames in our private address space through a nameserver - we do it with an ugly hack in /etc/hosts. And it seems to work alright.
>
> Having said that, our problems are still not completely gone even after adjusting the maximum allowed RAM for tasks - although things are lots better. While writing this mail three out of five DN's were marked as dead. There still is some swapping going on, but the cores are not spending any time in WAIT, so this shouldn't be the cause of anything. See below a trace from a dead DN - any thoughts are appreciated!
>
> Cheers,
> Evert
>
> 2011-05-13 23:13:27,716 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9131821326787012529_2915672 src: /192.168.28.211:60136 dest: /192.168.28.214:50050 of size 382425
> 2011-05-13 23:13:27,915 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9132067116195286882_130888 java.io.EOFException: while trying to read 3744913 bytes
> 2011-05-13 23:13:27,925 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35139, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001437_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 6254000
> 2011-05-13 23:13:28,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9149862728087355005_3793421 src: /192.168.28.210:41197 dest: /192.168.28.214:50050 of size 245767
> 2011-05-13 23:13:28,033 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9132067116195286882_130888 unfinalized and removed.
> 2011-05-13 23:13:28,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9132067116195286882_130888 received exception java.io.EOFException: while trying to read 3744913 bytes
> 2011-05-13 23:13:28,033 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 3744913 bytes
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32910, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001443_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4323000
> 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35138, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001440_0, offset: 197120, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5573000
> 2011-05-13 23:13:28,159 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38574, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001444_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 16939000
> 2011-05-13 23:13:28,209 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9123390874940601805_2898225 src: /192.168.28.210:44227 dest: /192.168.28.214:50050 of size 300441
> 2011-05-13 23:13:28,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42364, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001451_0, offset: 198656, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5291000
> 2011-05-13 23:13:28,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32930, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-1800696633107072247_4099834, duration: 5099000
> 2011-05-13 23:13:28,256 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42363, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001458_0, offset: 199680, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4945000
> 2011-05-13 23:13:28,257 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35137, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4159000
> 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9140444589483291821_3585975 java.io.EOFException: while trying to read 100 bytes
> 2011-05-13 23:13:28,258 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9140444589483291821_3585975 unfinalized and removed.
> 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9140444589483291821_3585975 received exception java.io.EOFException: while trying to read 100 bytes
> 2011-05-13 23:13:28,259 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 100 bytes
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38553, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001441_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-5819719631677148140_4098274, duration: 5625000
> 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38535, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001438_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4473000
> 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020): Exception writing block blk_-9150014886921014525_2267869 to mirror 192.168.28.213:50050
> java.io.IOException: The stream is closed
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:108)
>        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:540)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
>
> 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:45484, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001432_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_405051931214094755_4098504, duration: 5597000
> 2011-05-13 23:13:28,273 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9150014886921014525_2267869 src: /192.168.28.211:49208 dest: /192.168.28.214:50050 of size 3033173
> 2011-05-13 23:13:28,313 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9144765354308563975_3310572 src: /192.168.28.211:51592 dest: /192.168.28.214:50050 of size 242383
>
> ________________________________________
> From: Segel, Mike [msegel@navteq.com]
> Sent: Friday, May 13, 2011 2:36 PM
> To: general@hadoop.apache.org
> Cc: <cd...@cloudera.org>; <ge...@hadoop.apache.org>
> Subject: Re: Stability issue - dead DN's
>
> Bonded will work but you may not see the performance you would expect.  If you need >1 GBe, go 10GBe less headache and has even more headroom.
>
> Multiple interfaces won't work. Or I should say didn't work in past releases.
> If you think about it, clients have to connect to each node. So having two interfaces and trying to manage them makes no sense.
>
> Add to this trying to manage this in DNS ... Why make more work for yourself?
> Going from memory... It looked like you rDNS had to match you hostnames so your internal interfaces had to match hostnames so you had an inverted network.
>
> If you draw out your network topology you end up with a ladder.
> You would be better off (IMHO) to create a subnet where only your edge servers are dual nic'd.
> But then if your cluster is for development... Now your PCs can't be used as clients...
>
> Does this make sense?
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 13, 2011, at 4:57 AM, "Evert Lammerts" <Ev...@sara.nl> wrote:
>
>> Hi Mike,
>>
>>> You really really don't want to do this.
>>> Long story short... It won't work.
>>
>> Can you elaborate? Are you talking about the bonded interfaces or about having a separated network for interconnects and external network? What can go wrong there?
>>
>>>
>>> Just a suggestion.. You don't want anyone on your cluster itself. They
>>> should interact wit edge nodes, which are 'Hadoop aware'. Then your
>>> cluster has a single network to worry about.
>>
>> That's our current setup. We have a single headnode that is used as a SPOE. However, I'd like to change that on our future production system. We want to implement Kerberos for authentication, and let users interact with the cluster from their own machine. This would enable them to submit their jobs from the local IDE. The only way to do this is by opening up Hadoop ports for the world, is my understanding: if people interact with HDFS they need to be able to interact with all nodes, right? What would be the argument against this?
>>
>> Cheers,
>> Evert
>>
>>>
>>>
>>> Sent from a remote device. Please excuse any typos...
>>>
>>> Mike Segel
>>>
>>> On May 11, 2011, at 11:45 AM, Allen Wittenauer <aw...@apache.org> wrote:
>>>
>>>>
>>>>
>>>>
>>>>
>>>>>> * a 2x1GE bonded network interface for interconnects
>>>>>> * a 2x1GE bonded network interface for external access
>>>>
>>>>  Multiple NICs on a box can sometimes cause big performance
>>> problems with Hadoop.  So watch your traffic carefully.
>>>>
>>>>
>>>>
>
>
> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

RE: Stability issue - dead DN's

Posted by Evert Lammerts <Ev...@sara.nl>.
Ok, I'll give this scenario a try (in spite of the intoxication ;-)).

= putting or getting a file =
A client will access the NameNode first and get a list of hostnames. These will resolve to addresses either in public or in private space, depending on whether the request to the nameserver was made by a machine in public or in private space. Each node has one NIC listening on its address in private space and one on its address in public space. The Hadoop daemons are bound to 0.0.0.0:*. Reverse DNS will return an address in private space when the client connects from one of the nodes, and (obviously) an address in public space when the request came through WAN.

I'm not sure what could go wrong here... On Monday I'll recheck this scenario with our HPN guys as well.

Cheers,
Evert


________________________________________
From: Segel, Mike [msegel@navteq.com]
Sent: Saturday, May 14, 2011 12:33 AM
To: general@hadoop.apache.org
Subject: Re: Stability issue - dead DN's

Ok...

Hum, look, I've been force fed a couple of margaritas so, my memory is a bit foggy...
You say your clients connect on nic A. Your cluster connects on nic B.

What happens when you want to upload a file from your client to HDFS? Or even access it?

... ;-)



Sent from a remote device. Please excuse any typos...

Mike Segel

On May 13, 2011, at 4:15 PM, "Evert Lammerts" <Ev...@sara.nl> wrote:

> Hi Mike,
>
> Thanks for trying to help out.
>
> I had a talk with our networking guys this afternoon. According to them (and this is way out of my area of expertise, so excuse any mistakes) multiple interfaces shouldn't be a problem. We could set up a nameserver to resolve hostnames to addresses in our private space when the request comes from one of the nodes, and route this traffic over a single interface. Any other request can be resolved to an address in the public space, which is bound to an other interface. In our current setup we're not even resolving hostnames in our private address space through a nameserver - we do it with an ugly hack in /etc/hosts. And it seems to work alright.
>
> Having said that, our problems are still not completely gone even after adjusting the maximum allowed RAM for tasks - although things are lots better. While writing this mail three out of five DN's were marked as dead. There still is some swapping going on, but the cores are not spending any time in WAIT, so this shouldn't be the cause of anything. See below a trace from a dead DN - any thoughts are appreciated!
>
> Cheers,
> Evert
>
> 2011-05-13 23:13:27,716 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9131821326787012529_2915672 src: /192.168.28.211:60136 dest: /192.168.28.214:50050 of size 382425
> 2011-05-13 23:13:27,915 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9132067116195286882_130888 java.io.EOFException: while trying to read 3744913 bytes
> 2011-05-13 23:13:27,925 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35139, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001437_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 6254000
> 2011-05-13 23:13:28,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9149862728087355005_3793421 src: /192.168.28.210:41197 dest: /192.168.28.214:50050 of size 245767
> 2011-05-13 23:13:28,033 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9132067116195286882_130888 unfinalized and removed.
> 2011-05-13 23:13:28,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9132067116195286882_130888 received exception java.io.EOFException: while trying to read 3744913 bytes
> 2011-05-13 23:13:28,033 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 3744913 bytes
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32910, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001443_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4323000
> 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35138, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001440_0, offset: 197120, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5573000
> 2011-05-13 23:13:28,159 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38574, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001444_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 16939000
> 2011-05-13 23:13:28,209 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9123390874940601805_2898225 src: /192.168.28.210:44227 dest: /192.168.28.214:50050 of size 300441
> 2011-05-13 23:13:28,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42364, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001451_0, offset: 198656, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5291000
> 2011-05-13 23:13:28,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32930, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-1800696633107072247_4099834, duration: 5099000
> 2011-05-13 23:13:28,256 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42363, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001458_0, offset: 199680, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4945000
> 2011-05-13 23:13:28,257 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35137, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4159000
> 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9140444589483291821_3585975 java.io.EOFException: while trying to read 100 bytes
> 2011-05-13 23:13:28,258 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9140444589483291821_3585975 unfinalized and removed.
> 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9140444589483291821_3585975 received exception java.io.EOFException: while trying to read 100 bytes
> 2011-05-13 23:13:28,259 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 100 bytes
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38553, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001441_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-5819719631677148140_4098274, duration: 5625000
> 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38535, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001438_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4473000
> 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020): Exception writing block blk_-9150014886921014525_2267869 to mirror 192.168.28.213:50050
> java.io.IOException: The stream is closed
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:108)
>        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:540)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
>
> 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:45484, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001432_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_405051931214094755_4098504, duration: 5597000
> 2011-05-13 23:13:28,273 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9150014886921014525_2267869 src: /192.168.28.211:49208 dest: /192.168.28.214:50050 of size 3033173
> 2011-05-13 23:13:28,313 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9144765354308563975_3310572 src: /192.168.28.211:51592 dest: /192.168.28.214:50050 of size 242383
>
> ________________________________________
> From: Segel, Mike [msegel@navteq.com]
> Sent: Friday, May 13, 2011 2:36 PM
> To: general@hadoop.apache.org
> Cc: <cd...@cloudera.org>; <ge...@hadoop.apache.org>
> Subject: Re: Stability issue - dead DN's
>
> Bonded will work but you may not see the performance you would expect.  If you need >1 GBe, go 10GBe less headache and has even more headroom.
>
> Multiple interfaces won't work. Or I should say didn't work in past releases.
> If you think about it, clients have to connect to each node. So having two interfaces and trying to manage them makes no sense.
>
> Add to this trying to manage this in DNS ... Why make more work for yourself?
> Going from memory... It looked like you rDNS had to match you hostnames so your internal interfaces had to match hostnames so you had an inverted network.
>
> If you draw out your network topology you end up with a ladder.
> You would be better off (IMHO) to create a subnet where only your edge servers are dual nic'd.
> But then if your cluster is for development... Now your PCs can't be used as clients...
>
> Does this make sense?
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 13, 2011, at 4:57 AM, "Evert Lammerts" <Ev...@sara.nl> wrote:
>
>> Hi Mike,
>>
>>> You really really don't want to do this.
>>> Long story short... It won't work.
>>
>> Can you elaborate? Are you talking about the bonded interfaces or about having a separated network for interconnects and external network? What can go wrong there?
>>
>>>
>>> Just a suggestion.. You don't want anyone on your cluster itself. They
>>> should interact wit edge nodes, which are 'Hadoop aware'. Then your
>>> cluster has a single network to worry about.
>>
>> That's our current setup. We have a single headnode that is used as a SPOE. However, I'd like to change that on our future production system. We want to implement Kerberos for authentication, and let users interact with the cluster from their own machine. This would enable them to submit their jobs from the local IDE. The only way to do this is by opening up Hadoop ports for the world, is my understanding: if people interact with HDFS they need to be able to interact with all nodes, right? What would be the argument against this?
>>
>> Cheers,
>> Evert
>>
>>>
>>>
>>> Sent from a remote device. Please excuse any typos...
>>>
>>> Mike Segel
>>>
>>> On May 11, 2011, at 11:45 AM, Allen Wittenauer <aw...@apache.org> wrote:
>>>
>>>>
>>>>
>>>>
>>>>
>>>>>> * a 2x1GE bonded network interface for interconnects
>>>>>> * a 2x1GE bonded network interface for external access
>>>>
>>>>  Multiple NICs on a box can sometimes cause big performance
>>> problems with Hadoop.  So watch your traffic carefully.
>>>>
>>>>
>>>>
>
>
> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

Re: Stability issue - dead DN's

Posted by "Segel, Mike" <ms...@navteq.com>.
Ok...

Hum, look, I've been force fed a couple of margaritas so, my memory is a bit foggy...
You say your clients connect on nic A. Your cluster connects on nic B.

What happens when you want to upload a file from your client to HDFS? Or even access it?

... ;-)



Sent from a remote device. Please excuse any typos...

Mike Segel

On May 13, 2011, at 4:15 PM, "Evert Lammerts" <Ev...@sara.nl> wrote:

> Hi Mike,
> 
> Thanks for trying to help out.
> 
> I had a talk with our networking guys this afternoon. According to them (and this is way out of my area of expertise, so excuse any mistakes) multiple interfaces shouldn't be a problem. We could set up a nameserver to resolve hostnames to addresses in our private space when the request comes from one of the nodes, and route this traffic over a single interface. Any other request can be resolved to an address in the public space, which is bound to an other interface. In our current setup we're not even resolving hostnames in our private address space through a nameserver - we do it with an ugly hack in /etc/hosts. And it seems to work alright.
> 
> Having said that, our problems are still not completely gone even after adjusting the maximum allowed RAM for tasks - although things are lots better. While writing this mail three out of five DN's were marked as dead. There still is some swapping going on, but the cores are not spending any time in WAIT, so this shouldn't be the cause of anything. See below a trace from a dead DN - any thoughts are appreciated!
> 
> Cheers,
> Evert
> 
> 2011-05-13 23:13:27,716 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9131821326787012529_2915672 src: /192.168.28.211:60136 dest: /192.168.28.214:50050 of size 382425
> 2011-05-13 23:13:27,915 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9132067116195286882_130888 java.io.EOFException: while trying to read 3744913 bytes
> 2011-05-13 23:13:27,925 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35139, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001437_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 6254000
> 2011-05-13 23:13:28,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9149862728087355005_3793421 src: /192.168.28.210:41197 dest: /192.168.28.214:50050 of size 245767
> 2011-05-13 23:13:28,033 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9132067116195286882_130888 unfinalized and removed.
> 2011-05-13 23:13:28,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9132067116195286882_130888 received exception java.io.EOFException: while trying to read 3744913 bytes
> 2011-05-13 23:13:28,033 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 3744913 bytes
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32910, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001443_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4323000
> 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35138, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001440_0, offset: 197120, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5573000
> 2011-05-13 23:13:28,159 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38574, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001444_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 16939000
> 2011-05-13 23:13:28,209 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9123390874940601805_2898225 src: /192.168.28.210:44227 dest: /192.168.28.214:50050 of size 300441
> 2011-05-13 23:13:28,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42364, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001451_0, offset: 198656, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5291000
> 2011-05-13 23:13:28,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32930, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-1800696633107072247_4099834, duration: 5099000
> 2011-05-13 23:13:28,256 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42363, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001458_0, offset: 199680, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4945000
> 2011-05-13 23:13:28,257 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35137, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4159000
> 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9140444589483291821_3585975 java.io.EOFException: while trying to read 100 bytes
> 2011-05-13 23:13:28,258 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9140444589483291821_3585975 unfinalized and removed.
> 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9140444589483291821_3585975 received exception java.io.EOFException: while trying to read 100 bytes
> 2011-05-13 23:13:28,259 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 100 bytes
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38553, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001441_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-5819719631677148140_4098274, duration: 5625000
> 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38535, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001438_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4473000
> 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020): Exception writing block blk_-9150014886921014525_2267869 to mirror 192.168.28.213:50050
> java.io.IOException: The stream is closed
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:108)
>        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:540)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 
> 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:45484, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001432_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_405051931214094755_4098504, duration: 5597000
> 2011-05-13 23:13:28,273 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9150014886921014525_2267869 src: /192.168.28.211:49208 dest: /192.168.28.214:50050 of size 3033173
> 2011-05-13 23:13:28,313 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9144765354308563975_3310572 src: /192.168.28.211:51592 dest: /192.168.28.214:50050 of size 242383
> 
> ________________________________________
> From: Segel, Mike [msegel@navteq.com]
> Sent: Friday, May 13, 2011 2:36 PM
> To: general@hadoop.apache.org
> Cc: <cd...@cloudera.org>; <ge...@hadoop.apache.org>
> Subject: Re: Stability issue - dead DN's
> 
> Bonded will work but you may not see the performance you would expect.  If you need >1 GBe, go 10GBe less headache and has even more headroom.
> 
> Multiple interfaces won't work. Or I should say didn't work in past releases.
> If you think about it, clients have to connect to each node. So having two interfaces and trying to manage them makes no sense.
> 
> Add to this trying to manage this in DNS ... Why make more work for yourself?
> Going from memory... It looked like you rDNS had to match you hostnames so your internal interfaces had to match hostnames so you had an inverted network.
> 
> If you draw out your network topology you end up with a ladder.
> You would be better off (IMHO) to create a subnet where only your edge servers are dual nic'd.
> But then if your cluster is for development... Now your PCs can't be used as clients...
> 
> Does this make sense?
> 
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On May 13, 2011, at 4:57 AM, "Evert Lammerts" <Ev...@sara.nl> wrote:
> 
>> Hi Mike,
>> 
>>> You really really don't want to do this.
>>> Long story short... It won't work.
>> 
>> Can you elaborate? Are you talking about the bonded interfaces or about having a separated network for interconnects and external network? What can go wrong there?
>> 
>>> 
>>> Just a suggestion.. You don't want anyone on your cluster itself. They
>>> should interact wit edge nodes, which are 'Hadoop aware'. Then your
>>> cluster has a single network to worry about.
>> 
>> That's our current setup. We have a single headnode that is used as a SPOE. However, I'd like to change that on our future production system. We want to implement Kerberos for authentication, and let users interact with the cluster from their own machine. This would enable them to submit their jobs from the local IDE. The only way to do this is by opening up Hadoop ports for the world, is my understanding: if people interact with HDFS they need to be able to interact with all nodes, right? What would be the argument against this?
>> 
>> Cheers,
>> Evert
>> 
>>> 
>>> 
>>> Sent from a remote device. Please excuse any typos...
>>> 
>>> Mike Segel
>>> 
>>> On May 11, 2011, at 11:45 AM, Allen Wittenauer <aw...@apache.org> wrote:
>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>>> * a 2x1GE bonded network interface for interconnects
>>>>>> * a 2x1GE bonded network interface for external access
>>>> 
>>>>  Multiple NICs on a box can sometimes cause big performance
>>> problems with Hadoop.  So watch your traffic carefully.
>>>> 
>>>> 
>>>> 
> 
> 
> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

RE: Stability issue - dead DN's

Posted by Evert Lammerts <Ev...@sara.nl>.
Hi Mike,

Thanks for trying to help out.

I had a talk with our networking guys this afternoon. According to them (and this is way out of my area of expertise, so excuse any mistakes) multiple interfaces shouldn't be a problem. We could set up a nameserver to resolve hostnames to addresses in our private space when the request comes from one of the nodes, and route this traffic over a single interface. Any other request can be resolved to an address in the public space, which is bound to an other interface. In our current setup we're not even resolving hostnames in our private address space through a nameserver - we do it with an ugly hack in /etc/hosts. And it seems to work alright.

Having said that, our problems are still not completely gone even after adjusting the maximum allowed RAM for tasks - although things are lots better. While writing this mail three out of five DN's were marked as dead. There still is some swapping going on, but the cores are not spending any time in WAIT, so this shouldn't be the cause of anything. See below a trace from a dead DN - any thoughts are appreciated!

Cheers,
Evert

2011-05-13 23:13:27,716 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9131821326787012529_2915672 src: /192.168.28.211:60136 dest: /192.168.28.214:50050 of size 382425
2011-05-13 23:13:27,915 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9132067116195286882_130888 java.io.EOFException: while trying to read 3744913 bytes
2011-05-13 23:13:27,925 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35139, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001437_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 6254000
2011-05-13 23:13:28,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9149862728087355005_3793421 src: /192.168.28.210:41197 dest: /192.168.28.214:50050 of size 245767
2011-05-13 23:13:28,033 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9132067116195286882_130888 unfinalized and removed. 
2011-05-13 23:13:28,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9132067116195286882_130888 received exception java.io.EOFException: while trying to read 3744913 bytes
2011-05-13 23:13:28,033 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 3744913 bytes
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32910, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001443_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4323000
2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35138, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001440_0, offset: 197120, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5573000
2011-05-13 23:13:28,159 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38574, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001444_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 16939000
2011-05-13 23:13:28,209 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9123390874940601805_2898225 src: /192.168.28.210:44227 dest: /192.168.28.214:50050 of size 300441
2011-05-13 23:13:28,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42364, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001451_0, offset: 198656, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5291000
2011-05-13 23:13:28,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32930, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-1800696633107072247_4099834, duration: 5099000
2011-05-13 23:13:28,256 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42363, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001458_0, offset: 199680, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4945000
2011-05-13 23:13:28,257 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35137, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4159000
2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9140444589483291821_3585975 java.io.EOFException: while trying to read 100 bytes
2011-05-13 23:13:28,258 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9140444589483291821_3585975 unfinalized and removed. 
2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9140444589483291821_3585975 received exception java.io.EOFException: while trying to read 100 bytes
2011-05-13 23:13:28,259 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 100 bytes
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38553, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001441_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-5819719631677148140_4098274, duration: 5625000
2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38535, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001438_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4473000
2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020): Exception writing block blk_-9150014886921014525_2267869 to mirror 192.168.28.213:50050
java.io.IOException: The stream is closed
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:108)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:540)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)

2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:45484, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001432_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_405051931214094755_4098504, duration: 5597000
2011-05-13 23:13:28,273 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9150014886921014525_2267869 src: /192.168.28.211:49208 dest: /192.168.28.214:50050 of size 3033173
2011-05-13 23:13:28,313 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9144765354308563975_3310572 src: /192.168.28.211:51592 dest: /192.168.28.214:50050 of size 242383

________________________________________
From: Segel, Mike [msegel@navteq.com]
Sent: Friday, May 13, 2011 2:36 PM
To: general@hadoop.apache.org
Cc: <cd...@cloudera.org>; <ge...@hadoop.apache.org>
Subject: Re: Stability issue - dead DN's

Bonded will work but you may not see the performance you would expect.  If you need >1 GBe, go 10GBe less headache and has even more headroom.

Multiple interfaces won't work. Or I should say didn't work in past releases.
If you think about it, clients have to connect to each node. So having two interfaces and trying to manage them makes no sense.

Add to this trying to manage this in DNS ... Why make more work for yourself?
Going from memory... It looked like you rDNS had to match you hostnames so your internal interfaces had to match hostnames so you had an inverted network.

If you draw out your network topology you end up with a ladder.
You would be better off (IMHO) to create a subnet where only your edge servers are dual nic'd.
But then if your cluster is for development... Now your PCs can't be used as clients...

Does this make sense?


Sent from a remote device. Please excuse any typos...

Mike Segel

On May 13, 2011, at 4:57 AM, "Evert Lammerts" <Ev...@sara.nl> wrote:

> Hi Mike,
>
>> You really really don't want to do this.
>> Long story short... It won't work.
>
> Can you elaborate? Are you talking about the bonded interfaces or about having a separated network for interconnects and external network? What can go wrong there?
>
>>
>> Just a suggestion.. You don't want anyone on your cluster itself. They
>> should interact wit edge nodes, which are 'Hadoop aware'. Then your
>> cluster has a single network to worry about.
>
> That's our current setup. We have a single headnode that is used as a SPOE. However, I'd like to change that on our future production system. We want to implement Kerberos for authentication, and let users interact with the cluster from their own machine. This would enable them to submit their jobs from the local IDE. The only way to do this is by opening up Hadoop ports for the world, is my understanding: if people interact with HDFS they need to be able to interact with all nodes, right? What would be the argument against this?
>
> Cheers,
> Evert
>
>>
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On May 11, 2011, at 11:45 AM, Allen Wittenauer <aw...@apache.org> wrote:
>>
>>>
>>>
>>>
>>>
>>>>> * a 2x1GE bonded network interface for interconnects
>>>>> * a 2x1GE bonded network interface for external access
>>>
>>>   Multiple NICs on a box can sometimes cause big performance
>> problems with Hadoop.  So watch your traffic carefully.
>>>
>>>
>>>


The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

Re: Stability issue - dead DN's

Posted by "Segel, Mike" <ms...@navteq.com>.
Bonded will work but you may not see the performance you would expect.  If you need >1 GBe, go 10GBe less headache and has even more headroom.

Multiple interfaces won't work. Or I should say didn't work in past releases. 
If you think about it, clients have to connect to each node. So having two interfaces and trying to manage them makes no sense. 

Add to this trying to manage this in DNS ... Why make more work for yourself?
Going from memory... It looked like you rDNS had to match you hostnames so your internal interfaces had to match hostnames so you had an inverted network.

If you draw out your network topology you end up with a ladder. 
You would be better off (IMHO) to create a subnet where only your edge servers are dual nic'd.
But then if your cluster is for development... Now your PCs can't be used as clients...

Does this make sense?


Sent from a remote device. Please excuse any typos...

Mike Segel

On May 13, 2011, at 4:57 AM, "Evert Lammerts" <Ev...@sara.nl> wrote:

> Hi Mike,
> 
>> You really really don't want to do this.
>> Long story short... It won't work.
> 
> Can you elaborate? Are you talking about the bonded interfaces or about having a separated network for interconnects and external network? What can go wrong there?
> 
>> 
>> Just a suggestion.. You don't want anyone on your cluster itself. They
>> should interact wit edge nodes, which are 'Hadoop aware'. Then your
>> cluster has a single network to worry about.
> 
> That's our current setup. We have a single headnode that is used as a SPOE. However, I'd like to change that on our future production system. We want to implement Kerberos for authentication, and let users interact with the cluster from their own machine. This would enable them to submit their jobs from the local IDE. The only way to do this is by opening up Hadoop ports for the world, is my understanding: if people interact with HDFS they need to be able to interact with all nodes, right? What would be the argument against this?
> 
> Cheers,
> Evert
> 
>> 
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On May 11, 2011, at 11:45 AM, Allen Wittenauer <aw...@apache.org> wrote:
>> 
>>> 
>>> 
>>> 
>>> 
>>>>> * a 2x1GE bonded network interface for interconnects
>>>>> * a 2x1GE bonded network interface for external access
>>> 
>>>   Multiple NICs on a box can sometimes cause big performance
>> problems with Hadoop.  So watch your traffic carefully.
>>> 
>>> 
>>> 


The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

RE: Stability issue - dead DN's

Posted by Evert Lammerts <Ev...@sara.nl>.
Hi Mike,

> You really really don't want to do this.
> Long story short... It won't work.

Can you elaborate? Are you talking about the bonded interfaces or about having a separated network for interconnects and external network? What can go wrong there?

>
> Just a suggestion.. You don't want anyone on your cluster itself. They
> should interact wit edge nodes, which are 'Hadoop aware'. Then your
> cluster has a single network to worry about.

That's our current setup. We have a single headnode that is used as a SPOE. However, I'd like to change that on our future production system. We want to implement Kerberos for authentication, and let users interact with the cluster from their own machine. This would enable them to submit their jobs from the local IDE. The only way to do this is by opening up Hadoop ports for the world, is my understanding: if people interact with HDFS they need to be able to interact with all nodes, right? What would be the argument against this?

Cheers,
Evert

>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 11, 2011, at 11:45 AM, Allen Wittenauer <aw...@apache.org> wrote:
>
> >
> >
> >
> >
> >>> * a 2x1GE bonded network interface for interconnects
> >>> * a 2x1GE bonded network interface for external access
> >
> >    Multiple NICs on a box can sometimes cause big performance
> problems with Hadoop.  So watch your traffic carefully.
> >
> >
> >

Re: Stability issue - dead DN's

Posted by Michel Segel <mi...@hotmail.com>.
You really really don't want to do this.
Long story short... It won't work. 

Just a suggestion.. You don't want anyone on your cluster itself. They should interact wit edge nodes, which are 'Hadoop aware'. Then your cluster has a single network to worry about.


Sent from a remote device. Please excuse any typos...

Mike Segel

On May 11, 2011, at 11:45 AM, Allen Wittenauer <aw...@apache.org> wrote:

> 
> 
> 
> 
>>> * a 2x1GE bonded network interface for interconnects
>>> * a 2x1GE bonded network interface for external access
> 
>    Multiple NICs on a box can sometimes cause big performance problems with Hadoop.  So watch your traffic carefully.
> 
> 
> 

Re: Stability issue - dead DN's

Posted by Allen Wittenauer <aw...@apache.org>.
On May 11, 2011, at 5:57 AM, Eric Fiala wrote:
> 
> If we do the math that means [ map.tasks.max * mapred.child.java.opts ]  +
> [ reduce.tasks.max * mapred.child.java.opts ] => or [ 4 * 2.5G ] + [ 4 *
> 2.5G ] is greater than the amount of physical RAM in the machine.
> This doesn't account for the base tasktracker and datanode process + OS
> overhead and whatever else may be hoarding resources on the systems.

	+1 to what Eric said.

	You've exhausted memory and now the whole system is falling apart.  

> I would play with this ratio, either less maps / reduces max - or lower your
> child.java.opts so that when you are fully subscribed you are not using
> more resource than the machine can offer.

	Yup.

> Also, setting mapred.reduce.slowstart.completed.maps  to 1.00 or some other
> value close to 1 would be one way to guarantee only 4 either maps or reduces
> to be running at once and address (albeit in a duct tape like way) the
> oversubscription problem you are seeing (this represents the fractions of
> maps that should complete before initiating the reduce phase).

	slowstart isn't really going to help you much here.  All it takes is another job with the same settings running at the same time and processes will start dying again.  That said, the default for slowstart is incredibly stupid for the vast majority.  Something closer to .70 or .80 is more realistic.


>> * a 2x1GE bonded network interface for interconnects
>> * a 2x1GE bonded network interface for external access

	Multiple NICs on a box can sometimes cause big performance problems with Hadoop.  So watch your traffic carefully.



Re: Stability issue - dead DN's

Posted by Eric Fiala <er...@fiala.ca>.
Evert,
>From my experience, hadoop loathes swap and you mention that all reduces and
mappers are running (8 total) and from the ganglia screenshot I see that you
have a thick crest of that purple swap.

If we do the math that means [ map.tasks.max * mapred.child.java.opts ]  +
 [ reduce.tasks.max * mapred.child.java.opts ] => or [ 4 * 2.5G ] + [ 4 *
2.5G ] is greater than the amount of physical RAM in the machine.
This doesn't account for the base tasktracker and datanode process + OS
overhead and whatever else may be hoarding resources on the systems.

I would play with this ratio, either less maps / reduces max - or lower your
child.java.opts so that when you are fully subscribed you are not using
more resource than the machine can offer.
Also, setting mapred.reduce.slowstart.completed.maps  to 1.00 or some other
value close to 1 would be one way to guarantee only 4 either maps or reduces
to be running at once and address (albeit in a duct tape like way) the
oversubscription problem you are seeing (this represents the fractions of
maps that should complete before initiating the reduce phase).

hth

EF


On Wed, May 11, 2011 at 3:23 AM, Evert Lammerts <Ev...@sara.nl>wrote:

> Hi list,
>
> I notice that whenever our Hadoop installation is put under a heavy load we
> lose one or two (on a total of five) datanodes. This results in
> IOExceptions, and affects the overall performance of the job being run. Can
> anybody give me advise or best practices on a different configuration to
> increase the stability? Below I've included the specs of the cluster, the
> hadoop related config and an example of when which things go wrong. Any help
> is very much appreciated, and if I can provide any other info please let me
> know.
>
> Cheers,
> Evert
>
> == What goes wrong, and when ==
>
> See attached a screenshot of Ganglia when the cluster is under load of a
> single job. This job:
> * reads ~1TB from HDFS
> * writes ~200GB to HDFS
> * runs 288 Mappers and 35 Reducers
>
> When the job runs it takes all available Map and Reduce slots. The system
> starts swapping and there is a short time interval during which most cores
> are in WAIT. After that the job really starts running. At around half way,
> one or two datanodes become unreachable and are marked as dead nodes. The
> amount of under-replicated blocks becomes huge. Then some
> "java.io.IOException: Could not obtain block" are thrown in Mappers. The job
> does manage to finish successfully after around 3.5 hours, but my fear is
> that when we make the input much larger - which we want - the system becomes
> too unstable to finish the job.
>
> Maybe worth mentioning - never know what might help diagnostics.  We notice
> that memory usage becomes less when we switch our keys from Text to
> LongWritable. Also, the Mappers are done in a fraction of the time. However,
> this for some reason results in much more network traffic and makes Reducers
> extremely slow. We're working on figuring out what causes this.
>
>
> == The cluster ==
>
> We have a cluster that consists of 6 Sun Thumpers running Hadoop 0.20.2 on
> CentOS 5.5. One of them acts as NN and JT, the other 5 run DN's and TT's.
> Each node has:
> * 16GB RAM
> * 32GB swapspace
> * 4 cores
> * 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
> * non-HDFS stuff on separate disks
> * a 2x1GE bonded network interface for interconnects
> * a 2x1GE bonded network interface for external access
>
> I realize that this is not a well balanced system, but it's what we had
> available for a prototype environment. We're working on putting together a
> specification for a much larger production environment.
>
>
> == Hadoop config ==
>
> Here some properties that I think might be relevant:
>
> __CORE-SITE.XML__
>
> fs.inmemory.size.mb: 200
> mapreduce.task.io.sort.factor: 100
> mapreduce.task.io.sort.mb: 200
> # 1024*1024*4 MB, blocksize of the LVM's
> io.file.buffer.size: 4194304
>
> __HDFS-SITE.XML__
>
> # 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
> dfs.block.size: 134217728
> # Only 5 DN's, but this shouldn't hurt
> dfs.namenode.handler.count: 40
> # This got rid of the occasional "Could not obtain block"'s
> dfs.datanode.max.xcievers: 4096
>
> __MAPRED-SITE.XML__
>
> mapred.tasktracker.map.tasks.maximum: 4
> mapred.tasktracker.reduce.tasks.maximum: 4
> mapred.child.java.opts: -Xmx2560m
> mapreduce.reduce.shuffle.parallelcopies: 20
> mapreduce.map.java.opts: -Xmx512m
> mapreduce.reduce.java.opts: -Xmx512m
> # Compression codecs are configured and seem to work fine
> mapred.compress.map.output: true
> mapred.map.output.compression.codec: com.hadoop.compression.lzo.LzoCodec
>
>

Re: Stability issue - dead DN's

Posted by Joey Echeverria <jo...@cloudera.com>.
Which version of hadoop are you running?

I'm pretty sure the problem is you're over committing your RAM. Hadoop
really doesn't like swapping. I would try setting your
mapred.child.java.opts to
-Xmx1024m.

-Joey

On Wed, May 11, 2011 at 2:23 AM, Evert Lammerts <Ev...@sara.nl> wrote:
> Hi list,
>
> I notice that whenever our Hadoop installation is put under a heavy load we lose one or two (on a total of five) datanodes. This results in IOExceptions, and affects the overall performance of the job being run. Can anybody give me advise or best practices on a different configuration to increase the stability? Below I've included the specs of the cluster, the hadoop related config and an example of when which things go wrong. Any help is very much appreciated, and if I can provide any other info please let me know.
>
> Cheers,
> Evert
>
> == What goes wrong, and when ==
>
> See attached a screenshot of Ganglia when the cluster is under load of a single job. This job:
> * reads ~1TB from HDFS
> * writes ~200GB to HDFS
> * runs 288 Mappers and 35 Reducers
>
> When the job runs it takes all available Map and Reduce slots. The system starts swapping and there is a short time interval during which most cores are in WAIT. After that the job really starts running. At around half way, one or two datanodes become unreachable and are marked as dead nodes. The amount of under-replicated blocks becomes huge. Then some "java.io.IOException: Could not obtain block" are thrown in Mappers. The job does manage to finish successfully after around 3.5 hours, but my fear is that when we make the input much larger - which we want - the system becomes too unstable to finish the job.
>
> Maybe worth mentioning - never know what might help diagnostics.  We notice that memory usage becomes less when we switch our keys from Text to LongWritable. Also, the Mappers are done in a fraction of the time. However, this for some reason results in much more network traffic and makes Reducers extremely slow. We're working on figuring out what causes this.
>
>
> == The cluster ==
>
> We have a cluster that consists of 6 Sun Thumpers running Hadoop 0.20.2 on CentOS 5.5. One of them acts as NN and JT, the other 5 run DN's and TT's. Each node has:
> * 16GB RAM
> * 32GB swapspace
> * 4 cores
> * 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
> * non-HDFS stuff on separate disks
> * a 2x1GE bonded network interface for interconnects
> * a 2x1GE bonded network interface for external access
>
> I realize that this is not a well balanced system, but it's what we had available for a prototype environment. We're working on putting together a specification for a much larger production environment.
>
>
> == Hadoop config ==
>
> Here some properties that I think might be relevant:
>
> __CORE-SITE.XML__
>
> fs.inmemory.size.mb: 200
> mapreduce.task.io.sort.factor: 100
> mapreduce.task.io.sort.mb: 200
> # 1024*1024*4 MB, blocksize of the LVM's
> io.file.buffer.size: 4194304
>
> __HDFS-SITE.XML__
>
> # 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
> dfs.block.size: 134217728
> # Only 5 DN's, but this shouldn't hurt
> dfs.namenode.handler.count: 40
> # This got rid of the occasional "Could not obtain block"'s
> dfs.datanode.max.xcievers: 4096
>
> __MAPRED-SITE.XML__
>
> mapred.tasktracker.map.tasks.maximum: 4
> mapred.tasktracker.reduce.tasks.maximum: 4
> mapred.child.java.opts: -Xmx2560m
> mapreduce.reduce.shuffle.parallelcopies: 20
> mapreduce.map.java.opts: -Xmx512m
> mapreduce.reduce.java.opts: -Xmx512m
> # Compression codecs are configured and seem to work fine
> mapred.compress.map.output: true
> mapred.map.output.compression.codec: com.hadoop.compression.lzo.LzoCodec
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Stability issue - dead DN's

Posted by James Seigel <ja...@tynt.com>.
Have you dug into the dead DN logs as well?

James

Sent from my mobile. Please excuse the typos.

On 2011-05-11, at 7:39 AM, Evert Lammerts <Ev...@sara.nl> wrote:

> Hi James,
>
> Hadoop version is 0.20.2 (find that and more on our setup also in my first mail, under heading "The cluster").
>
> Below I) an example stacktrace of losing a datanode is and II) an example of a "Could not obtain block" IOException.
>
> Cheers,
> Evert
>
> 11/05/11 15:06:43 INFO hdfs.DFSClient: Failed to connect to /192.168.28.214:50050, add to deadNodes and continue
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.28.209:50726 remote=/192.168.28.214:50050]
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readShort(DataInputStream.java:295)
>        at org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1478)
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845)
>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
>        at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>        at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:1)
>        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>        at nl.liacs.infrawatch.hadoop.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:85)
>        at nl.liacs.infrawatch.hadoop.kmeans.Job.run(Job.java:171)
>        at nl.liacs.infrawatch.hadoop.kmeans.Job.main(Job.java:74)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
>
> 11/05/10 09:43:47 INFO mapred.JobClient:  map 82% reduce 17% 11/05/10 09:44:39 INFO mapred.JobClient: Task Id :
> attempt_201104121440_0122_m_000225_0, Status : FAILED
> java.io.IOException: Could not obtain block:
> blk_4397122445076815438_4097927
> file=/user/joaquin/data/20081201/20081201.039
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1993)
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1800)
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>        at java.io.DataInputStream.read(DataInputStream.java:83)
>        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
>        at nl.liacs.infrawatch.hadoop.kmeans.KeyValueLineRecordReader.nextKeyValue(KeyValueLineRecordReader.java:94)
>        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:455)
>        at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.mapred.Child.main(Child.java:262)
>
>> -----Original Message-----
>> From: James Seigel [mailto:james@tynt.com]
>> Sent: woensdag 11 mei 2011 14:54
>> To: general@hadoop.apache.org
>> Subject: Re: Stability issue - dead DN's
>>
>> Evert,
>>
>> What's the stack trace and what version of hadoop do you have installed
>> Sir!
>>
>> James.
>> On 2011-05-11, at 3:23 AM, Evert Lammerts wrote:
>>
>>> Hi list,
>>>
>>> I notice that whenever our Hadoop installation is put under a heavy
>> load we lose one or two (on a total of five) datanodes. This results in
>> IOExceptions, and affects the overall performance of the job being run.
>> Can anybody give me advise or best practices on a different
>> configuration to increase the stability? Below I've included the specs
>> of the cluster, the hadoop related config and an example of when which
>> things go wrong. Any help is very much appreciated, and if I can
>> provide any other info please let me know.
>>>
>>> Cheers,
>>> Evert
>>>
>>> == What goes wrong, and when ==
>>>
>>> See attached a screenshot of Ganglia when the cluster is under load
>> of a single job. This job:
>>> * reads ~1TB from HDFS
>>> * writes ~200GB to HDFS
>>> * runs 288 Mappers and 35 Reducers
>>>
>>> When the job runs it takes all available Map and Reduce slots. The
>> system starts swapping and there is a short time interval during which
>> most cores are in WAIT. After that the job really starts running. At
>> around half way, one or two datanodes become unreachable and are marked
>> as dead nodes. The amount of under-replicated blocks becomes huge. Then
>> some "java.io.IOException: Could not obtain block" are thrown in
>> Mappers. The job does manage to finish successfully after around 3.5
>> hours, but my fear is that when we make the input much larger - which
>> we want - the system becomes too unstable to finish the job.
>>>
>>> Maybe worth mentioning - never know what might help diagnostics.  We
>> notice that memory usage becomes less when we switch our keys from Text
>> to LongWritable. Also, the Mappers are done in a fraction of the time.
>> However, this for some reason results in much more network traffic and
>> makes Reducers extremely slow. We're working on figuring out what
>> causes this.
>>>
>>>
>>> == The cluster ==
>>>
>>> We have a cluster that consists of 6 Sun Thumpers running Hadoop
>> 0.20.2 on CentOS 5.5. One of them acts as NN and JT, the other 5 run
>> DN's and TT's. Each node has:
>>> * 16GB RAM
>>> * 32GB swapspace
>>> * 4 cores
>>> * 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
>>> * non-HDFS stuff on separate disks
>>> * a 2x1GE bonded network interface for interconnects
>>> * a 2x1GE bonded network interface for external access
>>>
>>> I realize that this is not a well balanced system, but it's what we
>> had available for a prototype environment. We're working on putting
>> together a specification for a much larger production environment.
>>>
>>>
>>> == Hadoop config ==
>>>
>>> Here some properties that I think might be relevant:
>>>
>>> __CORE-SITE.XML__
>>>
>>> fs.inmemory.size.mb: 200
>>> mapreduce.task.io.sort.factor: 100
>>> mapreduce.task.io.sort.mb: 200
>>> # 1024*1024*4 MB, blocksize of the LVM's
>>> io.file.buffer.size: 4194304
>>>
>>> __HDFS-SITE.XML__
>>>
>>> # 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
>>> dfs.block.size: 134217728
>>> # Only 5 DN's, but this shouldn't hurt
>>> dfs.namenode.handler.count: 40
>>> # This got rid of the occasional "Could not obtain block"'s
>>> dfs.datanode.max.xcievers: 4096
>>>
>>> __MAPRED-SITE.XML__
>>>
>>> mapred.tasktracker.map.tasks.maximum: 4
>>> mapred.tasktracker.reduce.tasks.maximum: 4
>>> mapred.child.java.opts: -Xmx2560m
>>> mapreduce.reduce.shuffle.parallelcopies: 20
>>> mapreduce.map.java.opts: -Xmx512m
>>> mapreduce.reduce.java.opts: -Xmx512m
>>> # Compression codecs are configured and seem to work fine
>>> mapred.compress.map.output: true
>>> mapred.map.output.compression.codec:
>> com.hadoop.compression.lzo.LzoCodec
>>>
>

Re: Stability issue - dead DN's

Posted by James Seigel <ja...@tynt.com>.
I noticed you crossposted to cloudera, what version of theirs are you running?

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2011-05-11, at 7:39 AM, Evert Lammerts <Ev...@sara.nl> wrote:

> Hi James,
>
> Hadoop version is 0.20.2 (find that and more on our setup also in my first mail, under heading "The cluster").
>
> Below I) an example stacktrace of losing a datanode is and II) an example of a "Could not obtain block" IOException.
>
> Cheers,
> Evert
>
> 11/05/11 15:06:43 INFO hdfs.DFSClient: Failed to connect to /192.168.28.214:50050, add to deadNodes and continue
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.28.209:50726 remote=/192.168.28.214:50050]
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readShort(DataInputStream.java:295)
>        at org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1478)
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845)
>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
>        at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>        at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:1)
>        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>        at nl.liacs.infrawatch.hadoop.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:85)
>        at nl.liacs.infrawatch.hadoop.kmeans.Job.run(Job.java:171)
>        at nl.liacs.infrawatch.hadoop.kmeans.Job.main(Job.java:74)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
>
> 11/05/10 09:43:47 INFO mapred.JobClient:  map 82% reduce 17% 11/05/10 09:44:39 INFO mapred.JobClient: Task Id :
> attempt_201104121440_0122_m_000225_0, Status : FAILED
> java.io.IOException: Could not obtain block:
> blk_4397122445076815438_4097927
> file=/user/joaquin/data/20081201/20081201.039
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1993)
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1800)
>        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>        at java.io.DataInputStream.read(DataInputStream.java:83)
>        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
>        at nl.liacs.infrawatch.hadoop.kmeans.KeyValueLineRecordReader.nextKeyValue(KeyValueLineRecordReader.java:94)
>        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:455)
>        at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.mapred.Child.main(Child.java:262)
>
>> -----Original Message-----
>> From: James Seigel [mailto:james@tynt.com]
>> Sent: woensdag 11 mei 2011 14:54
>> To: general@hadoop.apache.org
>> Subject: Re: Stability issue - dead DN's
>>
>> Evert,
>>
>> What's the stack trace and what version of hadoop do you have installed
>> Sir!
>>
>> James.
>> On 2011-05-11, at 3:23 AM, Evert Lammerts wrote:
>>
>>> Hi list,
>>>
>>> I notice that whenever our Hadoop installation is put under a heavy
>> load we lose one or two (on a total of five) datanodes. This results in
>> IOExceptions, and affects the overall performance of the job being run.
>> Can anybody give me advise or best practices on a different
>> configuration to increase the stability? Below I've included the specs
>> of the cluster, the hadoop related config and an example of when which
>> things go wrong. Any help is very much appreciated, and if I can
>> provide any other info please let me know.
>>>
>>> Cheers,
>>> Evert
>>>
>>> == What goes wrong, and when ==
>>>
>>> See attached a screenshot of Ganglia when the cluster is under load
>> of a single job. This job:
>>> * reads ~1TB from HDFS
>>> * writes ~200GB to HDFS
>>> * runs 288 Mappers and 35 Reducers
>>>
>>> When the job runs it takes all available Map and Reduce slots. The
>> system starts swapping and there is a short time interval during which
>> most cores are in WAIT. After that the job really starts running. At
>> around half way, one or two datanodes become unreachable and are marked
>> as dead nodes. The amount of under-replicated blocks becomes huge. Then
>> some "java.io.IOException: Could not obtain block" are thrown in
>> Mappers. The job does manage to finish successfully after around 3.5
>> hours, but my fear is that when we make the input much larger - which
>> we want - the system becomes too unstable to finish the job.
>>>
>>> Maybe worth mentioning - never know what might help diagnostics.  We
>> notice that memory usage becomes less when we switch our keys from Text
>> to LongWritable. Also, the Mappers are done in a fraction of the time.
>> However, this for some reason results in much more network traffic and
>> makes Reducers extremely slow. We're working on figuring out what
>> causes this.
>>>
>>>
>>> == The cluster ==
>>>
>>> We have a cluster that consists of 6 Sun Thumpers running Hadoop
>> 0.20.2 on CentOS 5.5. One of them acts as NN and JT, the other 5 run
>> DN's and TT's. Each node has:
>>> * 16GB RAM
>>> * 32GB swapspace
>>> * 4 cores
>>> * 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
>>> * non-HDFS stuff on separate disks
>>> * a 2x1GE bonded network interface for interconnects
>>> * a 2x1GE bonded network interface for external access
>>>
>>> I realize that this is not a well balanced system, but it's what we
>> had available for a prototype environment. We're working on putting
>> together a specification for a much larger production environment.
>>>
>>>
>>> == Hadoop config ==
>>>
>>> Here some properties that I think might be relevant:
>>>
>>> __CORE-SITE.XML__
>>>
>>> fs.inmemory.size.mb: 200
>>> mapreduce.task.io.sort.factor: 100
>>> mapreduce.task.io.sort.mb: 200
>>> # 1024*1024*4 MB, blocksize of the LVM's
>>> io.file.buffer.size: 4194304
>>>
>>> __HDFS-SITE.XML__
>>>
>>> # 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
>>> dfs.block.size: 134217728
>>> # Only 5 DN's, but this shouldn't hurt
>>> dfs.namenode.handler.count: 40
>>> # This got rid of the occasional "Could not obtain block"'s
>>> dfs.datanode.max.xcievers: 4096
>>>
>>> __MAPRED-SITE.XML__
>>>
>>> mapred.tasktracker.map.tasks.maximum: 4
>>> mapred.tasktracker.reduce.tasks.maximum: 4
>>> mapred.child.java.opts: -Xmx2560m
>>> mapreduce.reduce.shuffle.parallelcopies: 20
>>> mapreduce.map.java.opts: -Xmx512m
>>> mapreduce.reduce.java.opts: -Xmx512m
>>> # Compression codecs are configured and seem to work fine
>>> mapred.compress.map.output: true
>>> mapred.map.output.compression.codec:
>> com.hadoop.compression.lzo.LzoCodec
>>>
>

RE: Stability issue - dead DN's

Posted by Evert Lammerts <Ev...@sara.nl>.
Hi James,

Hadoop version is 0.20.2 (find that and more on our setup also in my first mail, under heading "The cluster").

Below I) an example stacktrace of losing a datanode is and II) an example of a "Could not obtain block" IOException.

Cheers,
Evert

11/05/11 15:06:43 INFO hdfs.DFSClient: Failed to connect to /192.168.28.214:50050, add to deadNodes and continue
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.28.209:50726 remote=/192.168.28.214:50050]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readShort(DataInputStream.java:295)
        at org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1478)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
        at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
        at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:1)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
        at nl.liacs.infrawatch.hadoop.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:85)
        at nl.liacs.infrawatch.hadoop.kmeans.Job.run(Job.java:171)
        at nl.liacs.infrawatch.hadoop.kmeans.Job.main(Job.java:74)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)


11/05/10 09:43:47 INFO mapred.JobClient:  map 82% reduce 17% 11/05/10 09:44:39 INFO mapred.JobClient: Task Id :
attempt_201104121440_0122_m_000225_0, Status : FAILED
java.io.IOException: Could not obtain block:
blk_4397122445076815438_4097927
file=/user/joaquin/data/20081201/20081201.039
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1993)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1800)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
        at java.io.DataInputStream.read(DataInputStream.java:83)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
        at nl.liacs.infrawatch.hadoop.kmeans.KeyValueLineRecordReader.nextKeyValue(KeyValueLineRecordReader.java:94)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:455)
        at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)

> -----Original Message-----
> From: James Seigel [mailto:james@tynt.com]
> Sent: woensdag 11 mei 2011 14:54
> To: general@hadoop.apache.org
> Subject: Re: Stability issue - dead DN's
>
> Evert,
>
> What's the stack trace and what version of hadoop do you have installed
> Sir!
>
> James.
> On 2011-05-11, at 3:23 AM, Evert Lammerts wrote:
>
> > Hi list,
> >
> > I notice that whenever our Hadoop installation is put under a heavy
> load we lose one or two (on a total of five) datanodes. This results in
> IOExceptions, and affects the overall performance of the job being run.
> Can anybody give me advise or best practices on a different
> configuration to increase the stability? Below I've included the specs
> of the cluster, the hadoop related config and an example of when which
> things go wrong. Any help is very much appreciated, and if I can
> provide any other info please let me know.
> >
> > Cheers,
> > Evert
> >
> > == What goes wrong, and when ==
> >
> > See attached a screenshot of Ganglia when the cluster is under load
> of a single job. This job:
> > * reads ~1TB from HDFS
> > * writes ~200GB to HDFS
> > * runs 288 Mappers and 35 Reducers
> >
> > When the job runs it takes all available Map and Reduce slots. The
> system starts swapping and there is a short time interval during which
> most cores are in WAIT. After that the job really starts running. At
> around half way, one or two datanodes become unreachable and are marked
> as dead nodes. The amount of under-replicated blocks becomes huge. Then
> some "java.io.IOException: Could not obtain block" are thrown in
> Mappers. The job does manage to finish successfully after around 3.5
> hours, but my fear is that when we make the input much larger - which
> we want - the system becomes too unstable to finish the job.
> >
> > Maybe worth mentioning - never know what might help diagnostics.  We
> notice that memory usage becomes less when we switch our keys from Text
> to LongWritable. Also, the Mappers are done in a fraction of the time.
> However, this for some reason results in much more network traffic and
> makes Reducers extremely slow. We're working on figuring out what
> causes this.
> >
> >
> > == The cluster ==
> >
> > We have a cluster that consists of 6 Sun Thumpers running Hadoop
> 0.20.2 on CentOS 5.5. One of them acts as NN and JT, the other 5 run
> DN's and TT's. Each node has:
> > * 16GB RAM
> > * 32GB swapspace
> > * 4 cores
> > * 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
> > * non-HDFS stuff on separate disks
> > * a 2x1GE bonded network interface for interconnects
> > * a 2x1GE bonded network interface for external access
> >
> > I realize that this is not a well balanced system, but it's what we
> had available for a prototype environment. We're working on putting
> together a specification for a much larger production environment.
> >
> >
> > == Hadoop config ==
> >
> > Here some properties that I think might be relevant:
> >
> > __CORE-SITE.XML__
> >
> > fs.inmemory.size.mb: 200
> > mapreduce.task.io.sort.factor: 100
> > mapreduce.task.io.sort.mb: 200
> > # 1024*1024*4 MB, blocksize of the LVM's
> > io.file.buffer.size: 4194304
> >
> > __HDFS-SITE.XML__
> >
> > # 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
> > dfs.block.size: 134217728
> > # Only 5 DN's, but this shouldn't hurt
> > dfs.namenode.handler.count: 40
> > # This got rid of the occasional "Could not obtain block"'s
> > dfs.datanode.max.xcievers: 4096
> >
> > __MAPRED-SITE.XML__
> >
> > mapred.tasktracker.map.tasks.maximum: 4
> > mapred.tasktracker.reduce.tasks.maximum: 4
> > mapred.child.java.opts: -Xmx2560m
> > mapreduce.reduce.shuffle.parallelcopies: 20
> > mapreduce.map.java.opts: -Xmx512m
> > mapreduce.reduce.java.opts: -Xmx512m
> > # Compression codecs are configured and seem to work fine
> > mapred.compress.map.output: true
> > mapred.map.output.compression.codec:
> com.hadoop.compression.lzo.LzoCodec
> >


Re: Stability issue - dead DN's

Posted by James Seigel <ja...@tynt.com>.
Evert,

What’s the stack trace and what version of hadoop do you have installed Sir!

James.
On 2011-05-11, at 3:23 AM, Evert Lammerts wrote:

> Hi list,
> 
> I notice that whenever our Hadoop installation is put under a heavy load we lose one or two (on a total of five) datanodes. This results in IOExceptions, and affects the overall performance of the job being run. Can anybody give me advise or best practices on a different configuration to increase the stability? Below I've included the specs of the cluster, the hadoop related config and an example of when which things go wrong. Any help is very much appreciated, and if I can provide any other info please let me know.
> 
> Cheers,
> Evert
> 
> == What goes wrong, and when ==
> 
> See attached a screenshot of Ganglia when the cluster is under load of a single job. This job:
> * reads ~1TB from HDFS
> * writes ~200GB to HDFS
> * runs 288 Mappers and 35 Reducers
> 
> When the job runs it takes all available Map and Reduce slots. The system starts swapping and there is a short time interval during which most cores are in WAIT. After that the job really starts running. At around half way, one or two datanodes become unreachable and are marked as dead nodes. The amount of under-replicated blocks becomes huge. Then some "java.io.IOException: Could not obtain block" are thrown in Mappers. The job does manage to finish successfully after around 3.5 hours, but my fear is that when we make the input much larger - which we want - the system becomes too unstable to finish the job.
> 
> Maybe worth mentioning - never know what might help diagnostics.  We notice that memory usage becomes less when we switch our keys from Text to LongWritable. Also, the Mappers are done in a fraction of the time. However, this for some reason results in much more network traffic and makes Reducers extremely slow. We're working on figuring out what causes this.
> 
> 
> == The cluster ==
> 
> We have a cluster that consists of 6 Sun Thumpers running Hadoop 0.20.2 on CentOS 5.5. One of them acts as NN and JT, the other 5 run DN's and TT's. Each node has:
> * 16GB RAM
> * 32GB swapspace
> * 4 cores
> * 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
> * non-HDFS stuff on separate disks
> * a 2x1GE bonded network interface for interconnects
> * a 2x1GE bonded network interface for external access
> 
> I realize that this is not a well balanced system, but it's what we had available for a prototype environment. We're working on putting together a specification for a much larger production environment.
> 
> 
> == Hadoop config ==
> 
> Here some properties that I think might be relevant:
> 
> __CORE-SITE.XML__
> 
> fs.inmemory.size.mb: 200
> mapreduce.task.io.sort.factor: 100
> mapreduce.task.io.sort.mb: 200
> # 1024*1024*4 MB, blocksize of the LVM's
> io.file.buffer.size: 4194304
> 
> __HDFS-SITE.XML__
> 
> # 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
> dfs.block.size: 134217728
> # Only 5 DN's, but this shouldn't hurt
> dfs.namenode.handler.count: 40
> # This got rid of the occasional "Could not obtain block"'s
> dfs.datanode.max.xcievers: 4096
> 
> __MAPRED-SITE.XML__
> 
> mapred.tasktracker.map.tasks.maximum: 4
> mapred.tasktracker.reduce.tasks.maximum: 4
> mapred.child.java.opts: -Xmx2560m
> mapreduce.reduce.shuffle.parallelcopies: 20
> mapreduce.map.java.opts: -Xmx512m
> mapreduce.reduce.java.opts: -Xmx512m
> # Compression codecs are configured and seem to work fine
> mapred.compress.map.output: true
> mapred.map.output.compression.codec: com.hadoop.compression.lzo.LzoCodec
>