You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2006/05/26 08:51:30 UTC

[jira] Commented: (HADOOP-255) Client Calls are not cancelled after a call timeout

    [ http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12413382 ] 

Owen O'Malley commented on HADOOP-255:
--------------------------------------

There are three parts to this. The first part is that I'm just about done with converting the map output to use http rather than rpc. So that aspect of this problem will go away.

The second part is that this behavior happens on all of the rpc's, not just map output transfer. However, sending a message now to potentially prevent a future message is not necessarily a winning game. If you did send a "cancel call" message, you'd probably want to remove the call from the server's work queue if it is still waiting to be processed.

It is tempting to try a strategy where you check the age of a call when you start processing it on the server and reject messages that are too old, but the problem is that it is _not_ 10 seconds from the start of the call, but rather 10 seconds with no data received on the socket, which is hard for the server to estimate. 

The final part is that this characteristic that the server can not assume that the return message from the rpc call was received is a problem. For example, I had a problem with pollForNewTask timing out and dropping tasks. I fixed that by adding a timeout so that after a task is assigned to a task tracker, it it does not show up in a task tracker status message within 10 minutes it is considered lost. However, this applies to _all_ of the rpc messages. You always need to make sure that if the return value of the rpc call were to disappear into thin air that the problem would be detected eventually. There are other instances of this ind of problem that still exist in the code that need to be identified and fixed.

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>          Key: HADOOP-255
>          URL: http://issues.apache.org/jira/browse/HADOOP-255
>      Project: Hadoop
>         Type: Bug

>   Components: ipc
>     Versions: 0.2.1
>  Environment: Tested on Linux 2.6
>     Reporter: Naveen Nalam

>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira