You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "paul sutter (JIRA)" <ji...@apache.org> on 2006/04/18 00:21:22 UTC

[jira] Created: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Disk thrashing / task timeouts during map output copy phase
-----------------------------------------------------------

         Key: HADOOP-141
         URL: http://issues.apache.org/jira/browse/HADOOP-141
     Project: Hadoop
        Type: Bug

  Components: mapred  
 Environment: linux
    Reporter: paul sutter



MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster).

There are several bugs behind this, but the following two changes improved matters considerably.

(1) 

The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. 

Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size.

(2)

I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented.

socket.setSendBufferSize(256*1024);
socket.setReceiveBufferSize(256*1024);
socket.setSoLinger(false, 0);
socket.setKeepAlive(true);


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-141?page=comments#action_12375401 ] 

Doug Cutting commented on HADOOP-141:
-------------------------------------

Some timeouts during the copy phase may not be bad.  If too many nodes are transferring from a given node, then it may time out additional requests.  And if a one node is already transferring from a another node for one task, then attempts by a second task to transfer may timeout (due to the shared connection pool).  These should not affect overall performance too much, especially if the timeout is relatively short.

> Disk thrashing / task timeouts during map output copy phase
> -----------------------------------------------------------
>
>          Key: HADOOP-141
>          URL: http://issues.apache.org/jira/browse/HADOOP-141
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>  Environment: linux
>     Reporter: paul sutter

>
> MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster).
> There are several bugs behind this, but the following two changes improved matters considerably.
> (1) 
> The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. 
> Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size.
> (2)
> I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented.
> socket.setSendBufferSize(256*1024);
> socket.setReceiveBufferSize(256*1024);
> socket.setSoLinger(false, 0);
> socket.setKeepAlive(true);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Posted by "p sutter (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-141?page=comments#action_12442193 ] 
            
p sutter commented on HADOOP-141:
---------------------------------


   [[ Old comment, sent by email on Wed, 2 Aug 2006 13:47:05 -0700 ]]

Close it out! The new shuffle path is really great.




> Disk thrashing / task timeouts during map output copy phase
> -----------------------------------------------------------
>
>                 Key: HADOOP-141
>                 URL: http://issues.apache.org/jira/browse/HADOOP-141
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>         Environment: linux
>            Reporter: p sutter
>
> MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster).
> There are several bugs behind this, but the following two changes improved matters considerably.
> (1) 
> The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. 
> Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size.
> (2)
> I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented.
> socket.setSendBufferSize(256*1024);
> socket.setReceiveBufferSize(256*1024);
> socket.setSoLinger(false, 0);
> socket.setKeepAlive(true);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Posted by "paul sutter (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-141?page=comments#action_12374822 ] 

paul sutter commented on HADOOP-141:
------------------------------------

reword that first sentance: 

Reduce progress grinds to a halt with lots of MapOutputProtocol timeouts and transferring the same file over and over again because of system thrashing.


> Disk thrashing / task timeouts during map output copy phase
> -----------------------------------------------------------
>
>          Key: HADOOP-141
>          URL: http://issues.apache.org/jira/browse/HADOOP-141
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>  Environment: linux
>     Reporter: paul sutter

>
> MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster).
> There are several bugs behind this, but the following two changes improved matters considerably.
> (1) 
> The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. 
> Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size.
> (2)
> I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented.
> socket.setSendBufferSize(256*1024);
> socket.setReceiveBufferSize(256*1024);
> socket.setSoLinger(false, 0);
> socket.setKeepAlive(true);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Commented: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

humm,

The client is timing out when it is getting data?  Maybe as long as  
it is getting data, it should reset its timer?  Maybe the server  
should fail a client if it is busy?  This would let you make informed  
decision.

On Apr 20, 2006, at 11:24 AM, paul sutter (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/HADOOP-141? 
> page=comments#action_12375411 ]
>
> paul sutter commented on HADOOP-141:
> ------------------------------------
>
>
> A few timeouts would be fine. The problem is when the same files  
> timeout over and over again, and progress ceases completely.
>
> I was able to make the problem go away by increasing the number of  
> mappers by 6X, making the map output files 1/6th as large, so I  
> have given up on finding the problem.
>
> So here is the summary:
>
> - with 700MB map output files (18 mappers), original code: the job  
> would never progress past reduce progress of  17% or 18%.
> - with 700MB map output files (18 mappers), large buffers: the job  
> completed in 27 hours
> - with 120MB map output files (106 mappers), and large buffers: the  
> job completed in 6 hours
>
> Im happy to share logs that include the timeouts and extended  
> logging information on MapOutputFile.java if anyone is interested,  
> but i wont post them here because they are several hundred megabytes.
>
> Otherwise I will continue to use the workaround of smaller map  
> output files.
>
>> Disk thrashing / task timeouts during map output copy phase
>> -----------------------------------------------------------
>>
>>          Key: HADOOP-141
>>          URL: http://issues.apache.org/jira/browse/HADOOP-141
>>      Project: Hadoop
>>         Type: Bug
>
>>   Components: mapred
>>  Environment: linux
>>     Reporter: paul sutter
>
>>
>> MapOutputProtocol connections cause timeouts because of system  
>> thrashing and transferring the same file over and over again,  
>> ultimately leading to making no forward progress(medium sized job,  
>> 500GB input file, map output about as large as the input, 10 node  
>> cluster).
>> There are several bugs behind this, but the following two changes  
>> improved matters considerably.
>> (1)
>> The buffersize in MapOutputFile is currently hardcoded to 8192  
>> bytes (for both reads and writes). By changing this buffer size to  
>> 256KB, the number of disk seeks are reduced and the problem went  
>> away.
>> Ideally there would be a buffer size parameter for this that is  
>> separate from the DFS io buffer size.
>> (2)
>> I also added the following code to the socket configuration in  
>> both Server.java and Client.java. No linger is a minor good idea  
>> in an enivronment with some packet loss (and you will have that  
>> when all the nodes get busy at once), but 256KB buffers is  
>> probably excessive, especially on a LAN, but it takes me two hours  
>> to test changes so I havent experimented.
>> socket.setSendBufferSize(256*1024);
>> socket.setReceiveBufferSize(256*1024);
>> socket.setSoLinger(false, 0);
>> socket.setKeepAlive(true);
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>

[jira] Commented: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Posted by "paul sutter (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-141?page=comments#action_12375411 ] 

paul sutter commented on HADOOP-141:
------------------------------------


A few timeouts would be fine. The problem is when the same files timeout over and over again, and progress ceases completely. 

I was able to make the problem go away by increasing the number of mappers by 6X, making the map output files 1/6th as large, so I have given up on finding the problem.

So here is the summary:

- with 700MB map output files (18 mappers), original code: the job would never progress past reduce progress of  17% or 18%.
- with 700MB map output files (18 mappers), large buffers: the job completed in 27 hours
- with 120MB map output files (106 mappers), and large buffers: the job completed in 6 hours

Im happy to share logs that include the timeouts and extended logging information on MapOutputFile.java if anyone is interested, but i wont post them here because they are several hundred megabytes.

Otherwise I will continue to use the workaround of smaller map output files.

> Disk thrashing / task timeouts during map output copy phase
> -----------------------------------------------------------
>
>          Key: HADOOP-141
>          URL: http://issues.apache.org/jira/browse/HADOOP-141
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>  Environment: linux
>     Reporter: paul sutter

>
> MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster).
> There are several bugs behind this, but the following two changes improved matters considerably.
> (1) 
> The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. 
> Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size.
> (2)
> I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented.
> socket.setSendBufferSize(256*1024);
> socket.setReceiveBufferSize(256*1024);
> socket.setSoLinger(false, 0);
> socket.setKeepAlive(true);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-141?page=comments#action_12425370 ] 
            
Owen O'Malley commented on HADOOP-141:
--------------------------------------

Paul,
   Is this still happening with the http map output transfer or can I close this?

> Disk thrashing / task timeouts during map output copy phase
> -----------------------------------------------------------
>
>                 Key: HADOOP-141
>                 URL: http://issues.apache.org/jira/browse/HADOOP-141
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>         Environment: linux
>            Reporter: paul sutter
>
> MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster).
> There are several bugs behind this, but the following two changes improved matters considerably.
> (1) 
> The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. 
> Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size.
> (2)
> I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented.
> socket.setSendBufferSize(256*1024);
> socket.setReceiveBufferSize(256*1024);
> socket.setSoLinger(false, 0);
> socket.setKeepAlive(true);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-141?page=comments#action_12375167 ] 

Doug Cutting commented on HADOOP-141:
-------------------------------------

It would be good to find which of these changes actually made the difference for you.  Each TaskTracker's map output server only accepts up to mapred.tasktracker.tasks.maximum connections at once.  Since this is typically around 2, I would be surprised if a small buffer size results in lots of seeks, since the OS should perform readaheads in its buffer cache.

What OS are you using?  What do you have mapred.tasktracker.tasks.maximum set to?

If on linux, what does 'iostat -x 1 10' show when things are slow?  How about 'sar -n DEV 1 10'?


> Disk thrashing / task timeouts during map output copy phase
> -----------------------------------------------------------
>
>          Key: HADOOP-141
>          URL: http://issues.apache.org/jira/browse/HADOOP-141
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>  Environment: linux
>     Reporter: paul sutter

>
> MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster).
> There are several bugs behind this, but the following two changes improved matters considerably.
> (1) 
> The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. 
> Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size.
> (2)
> I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented.
> socket.setSendBufferSize(256*1024);
> socket.setReceiveBufferSize(256*1024);
> socket.setSoLinger(false, 0);
> socket.setKeepAlive(true);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-141?page=all ]

Owen O'Malley resolved HADOOP-141.
----------------------------------

    Fix Version/s: 0.3.0
       Resolution: Fixed
         Assignee: Owen O'Malley

> Disk thrashing / task timeouts during map output copy phase
> -----------------------------------------------------------
>
>                 Key: HADOOP-141
>                 URL: http://issues.apache.org/jira/browse/HADOOP-141
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>         Environment: linux
>            Reporter: p sutter
>         Assigned To: Owen O'Malley
>             Fix For: 0.3.0
>
>
> MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster).
> There are several bugs behind this, but the following two changes improved matters considerably.
> (1) 
> The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. 
> Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size.
> (2)
> I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented.
> socket.setSendBufferSize(256*1024);
> socket.setReceiveBufferSize(256*1024);
> socket.setSoLinger(false, 0);
> socket.setKeepAlive(true);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase

Posted by "paul sutter (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-141?page=comments#action_12375204 ] 

paul sutter commented on HADOOP-141:
------------------------------------


As it turns out, my changes did not fix the problem, just changed the timing.

The thrashing was occucring because one reducer was in the merge phase, and the other reducer was in the file copy phase.  The particular file that was failing, was being copied from the local system.  I have the concurrent merges set to 24 and the task count set to 4.

I added logging statements, and the file was clearly being received in full by MapOutputFile, yet ReduceTaskRunner was getting a timeout on that file about 1 minute and 20 seconds later, request it again and again, and each time receive the file yet get a timeout just over a minute later.

I did find two interesting bug in RPC.java while trying to track this down (which im filing separately), but for now I am completely stumped.

At the moment the cluster is otherwise busy, so I cant do any more experiments until perhaps tomororw. Any suggestions would be very welcome. We are using Linux, and I'll try the commands you suggested when Im able to recreate it, but for now this does not look like a disk or TCP problem, it really looks like an RPC scheduling problem. 

> Disk thrashing / task timeouts during map output copy phase
> -----------------------------------------------------------
>
>          Key: HADOOP-141
>          URL: http://issues.apache.org/jira/browse/HADOOP-141
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>  Environment: linux
>     Reporter: paul sutter

>
> MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster).
> There are several bugs behind this, but the following two changes improved matters considerably.
> (1) 
> The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. 
> Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size.
> (2)
> I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented.
> socket.setSendBufferSize(256*1024);
> socket.setReceiveBufferSize(256*1024);
> socket.setSoLinger(false, 0);
> socket.setKeepAlive(true);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira