You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sofia Georgiakaki <ge...@yahoo.com> on 2011/09/23 09:15:18 UTC

many killed tasks, long execution time


Good morning!

I would be grateful if anyone could help me about a serious problem that I'm facing.
I try to run a hadoop job on a 12-node luster (has 48 task capacity), and I have problems when dealing with big input data (10-20GB) which gets worse when I increase the number of reducers.
Many tasks get killed (for example 25 out of the 148 map tasks, and 15 out of 40 reducers) and the job struggles to finish.

The job is heavy in general, as it builds an Rtree on hdfs.
During the reduce phase, I also create and write some binary files on HDFS using FSDataOutputStream. and I noticed  that sometimes some tasks fail to write correctly to their particular binary file, throwing an IOexception when they try to execute  dataFileOut.write(m_buffer); .

I'm using 0.20.203 version and I had also tested the code on 0.20.2 before (facing the same problems with killed tasks!)


I would appreciate any advice/idea, as I have to finish my diploma thesis (it has taken me a year, I hope not to take longer).

Thank you very much in advance
Sofia

Re: many killed tasks, long execution time

Posted by Robert Evans <ev...@yahoo-inc.com>.

Sofia,

Speculative execution is great so long as you are not writing data off to HDFS on the side.  If you use a normal output format it can handle putting your output in a temporary location with a unique name, and then in the cleanup method when all tasks have finished it moves the files to their final location.

For what you have said it sounds almost like you are writing data out from your map task to HDFS and then reading that data back in to your reduce task from HDFS.   Is that correct?  This may be the cause of your slowdown when you add in more reducers.  It could also have something to do with the number of nodes that you have.  You said you have a 12 node cluster.  How many reducer slots are there per node?  If there is only one then you are adding in extra overhead not just to launch a new reducer, but also because the distribution to different nodes is uneven.  One node will get 2 reducers on right after the other, and all the others will just have to run one reducer.

I think it is more likely something like the second issue then the first.

--Bobby Evans


On 9/23/11 9:04 AM, "Sofia Georgiakaki" <ge...@yahoo.com> wrote:

Mr. Bobby, thank you for your reply.
The IOException was related with the speculative execution. In my Reducers I create some files written on the HDFS, so in some occasions multiple tasks attempted to write the same file. I turned the speculative mode off for the reduce tasks, and the problem was solved.

However, the major problem with the long execution time remains. I can assume now that all these failed map tasks have to do with the speculative execution too, so the source of the problem must be somewhere else.

I noticed that the average time for the map tasks (as well as the time e.g. the longer mapper finishes) increases as I increase the number of reducers! Is this normal??? The input is always the same, as well as the number of the map tasks (158 map tasks executed on the 12-node cluster. each node has capacity for 4 map tasks).
In addition, the performance of the Job is ok when the number of reducers are in the range 2-12, and then if I increase the reducers further, the performance gets worse and worse...

Any ideas would be helpful!
Thank you!





________________________________
From: Robert Evans <ev...@yahoo-inc.com>
To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>; Sofia Georgiakaki <ge...@yahoo.com>
Sent: Friday, September 23, 2011 4:28 PM
Subject: Re: many killed tasks, long execution time

Can you include the complete stack trace of the IOException you are seeing?

--Bobby Evans

On 9/23/11 2:15 AM, "Sofia Georgiakaki" <ge...@yahoo.com> wrote:




Good morning!

I would be grateful if anyone could help me about a serious problem that I'm facing.
I try to run a hadoop job on a 12-node luster (has 48 task capacity), and I have problems when dealing with big input data (10-20GB) which gets worse when I increase the number of reducers.
Many tasks get killed (for example 25 out of the 148 map tasks, and 15 out of 40 reducers) and the job struggles to finish.

The job is heavy in general, as it builds an Rtree on hdfs.
During the reduce phase, I also create and write some binary files on HDFS using FSDataOutputStream. and I noticed  that sometimes some tasks fail to write correctly to their particular binary file, throwing an IOexception when they try to execute  dataFileOut.write(m_buffer); .

I'm using 0.20.203 version and I had also tested the code on 0.20.2 before (facing the same problems with killed tasks!)


I would appreciate any advice/idea, as I have to finish my diploma thesis (it has taken me a year, I hope not to take longer).

Thank you very much in advance
Sofia

Re: many killed tasks, long execution time

Posted by Sofia Georgiakaki <ge...@yahoo.com>.

Mr. Bobby, thank you for your reply.
The IOException was related with the speculative execution. In my Reducers I create some files written on the HDFS, so in some occasions multiple tasks attempted to write the same file. I turned the speculative mode off for the reduce tasks, and the problem was solved.

However, the major problem with the long execution time remains. I can assume now that all these failed map tasks have to do with the speculative execution too, so the source of the problem must be somewhere else.

I noticed that the average time for the map tasks (as well as the time e.g. the longer mapper finishes) increases as I increase the number of reducers! Is this normal??? The input is always the same, as well as the number of the map tasks (158 map tasks executed on the 12-node cluster. each node has capacity for 4 map tasks).
In addition, the performance of the Job is ok when the number of reducers are in the range 2-12, and then if I increase the reducers further, the performance gets worse and worse...

Any ideas would be helpful!
Thank you!





________________________________
From: Robert Evans <ev...@yahoo-inc.com>
To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>; Sofia Georgiakaki <ge...@yahoo.com>
Sent: Friday, September 23, 2011 4:28 PM
Subject: Re: many killed tasks, long execution time

Can you include the complete stack trace of the IOException you are seeing?

--Bobby Evans

On 9/23/11 2:15 AM, "Sofia Georgiakaki" <ge...@yahoo.com> wrote:




Good morning!

I would be grateful if anyone could help me about a serious problem that I'm facing.
I try to run a hadoop job on a 12-node luster (has 48 task capacity), and I have problems when dealing with big input data (10-20GB) which gets worse when I increase the number of reducers.
Many tasks get killed (for example 25 out of the 148 map tasks, and 15 out of 40 reducers) and the job struggles to finish.

The job is heavy in general, as it builds an Rtree on hdfs.
During the reduce phase, I also create and write some binary files on HDFS using FSDataOutputStream. and I noticed  that sometimes some tasks fail to write correctly to their particular binary file, throwing an IOexception when they try to execute  dataFileOut.write(m_buffer); .

I'm using 0.20.203 version and I had also tested the code on 0.20.2 before (facing the same problems with killed tasks!)


I would appreciate any advice/idea, as I have to finish my diploma thesis (it has taken me a year, I hope not to take longer).

Thank you very much in advance
Sofia

Re: many killed tasks, long execution time

Posted by Robert Evans <ev...@yahoo-inc.com>.

Can you include the complete stack trace of the IOException you are seeing?

--Bobby Evans

On 9/23/11 2:15 AM, "Sofia Georgiakaki" <ge...@yahoo.com> wrote:




Good morning!

I would be grateful if anyone could help me about a serious problem that I'm facing.
I try to run a hadoop job on a 12-node luster (has 48 task capacity), and I have problems when dealing with big input data (10-20GB) which gets worse when I increase the number of reducers.
Many tasks get killed (for example 25 out of the 148 map tasks, and 15 out of 40 reducers) and the job struggles to finish.

The job is heavy in general, as it builds an Rtree on hdfs.
During the reduce phase, I also create and write some binary files on HDFS using FSDataOutputStream. and I noticed  that sometimes some tasks fail to write correctly to their particular binary file, throwing an IOexception when they try to execute  dataFileOut.write(m_buffer); .

I'm using 0.20.203 version and I had also tested the code on 0.20.2 before (facing the same problems with killed tasks!)


I would appreciate any advice/idea, as I have to finish my diploma thesis (it has taken me a year, I hope not to take longer).

Thank you very much in advance
Sofia