You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Naveen Nalam (JIRA)" <ji...@apache.org> on 2006/05/26 00:31:29 UTC

[jira] Created: (HADOOP-255) Client Calls are not cancelled after a call timeout

Client Calls are not cancelled after a call timeout
---------------------------------------------------

Key: HADOOP-255
URL: http://issues.apache.org/jira/browse/HADOOP-255
Project: Hadoop
Type: Bug

Components: ipc
Versions: 0.2.1
Environment: Tested on Linux 2.6
Reporter: Naveen Nalam

In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.

What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.

The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.

My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.

I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-255?page=all ]

Owen O'Malley updated HADOOP-255:
---------------------------------

    Attachment: rpc-timeout.patch

This patch has the rpc server handlers discard any call that is older than 60% of the ipc.timeout.


> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>         Attachments: rpc-timeout.patch
>
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Assigned: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-255?page=all ]

Owen O'Malley reassigned HADOOP-255:
------------------------------------

    Assignee: Owen O'Malley

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-255?page=all ]

Owen O'Malley updated HADOOP-255:
---------------------------------

    Attachment: rpc-timeout-2.patch

This patch adds a fixed size limit to the number of calls in the rpc server's queue. If the queue is full, the oldest is discarded to make room for the new one.

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>             Fix For: 0.7.0
>
>         Attachments: rpc-timeout-2.patch, rpc-timeout.patch
>
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12440279 ] 
            
Owen O'Malley commented on HADOOP-255:
--------------------------------------

I'm going to hijack this bug. Clearly the original context was fixed by moving from the rpc getMapOutput to a jetty servlet. However, we are seeing cases where the dfs servers have trouble keeping up with the rpc calls. 

Therefore, I propose that we define a fraction of the ipc.timeout that is the maximum time the rpc calls can take before they are given to the handler.

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "paul sutter (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12413505 ] 

paul sutter commented on HADOOP-255:
------------------------------------


Doug,

We realize all of this will change with all the great copy/sort-path work being done at Yahoo, but here's what we're seeing: When our individual map output files get large, like >1GB, we get endless cascading timeouts.

Naveen found the actual cause of this:

- The map output response capacity of each tasktracker is equal to the _average_ number of concurrent requests sent to each tasktracker. Eg, with two tasks per node, the capacity is two tasks. The average number of requests is two, but that means some nodes get 4 requests and some nodes get 0.

- Tasktrackers queue up over-the-limit requests. When the files are large, the time it takes to complete the previous requests exceeds the timeout interval. This means the client stops waiting for the file, and requests it again.

- Meanwhile, the old request for thr file is eventually processed, and it runs to completion, snce the current clicnt RPC code merrily receives it although nobody is waiting for it.. Since that file is over a gigabyte, it holds up the works for a long time, long enough that waiting tasks timeout, causing more additional requests.

So what happens is, you get an endless cycle of timed-out requests, that are eventually satisfied, causing useless transfers of >1GB, which cause more timeouts, etc.

Obviously, the new efforts at Yahoo will change this considerably, for example _not_ multiplexing these transfers within a single TCP session will allow termination of the TCP session on a timeout, which will prevent large, useless transfers from happening.

But I just wanted to explain what we were seeing, and why it is definitely uncool to proceed with large transfers when nobody will use the results.

We'll try out the new Yahoo code, and see what we get. I suspect that this will help a lot.

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>          Key: HADOOP-255
>          URL: http://issues.apache.org/jira/browse/HADOOP-255
>      Project: Hadoop
>         Type: Bug

>   Components: ipc
>     Versions: 0.2.1
>  Environment: Tested on Linux 2.6
>     Reporter: Naveen Nalam

>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-255?page=all ]

Doug Cutting updated HADOOP-255:
--------------------------------

    Fix Version/s: 0.8.0
                       (was: 0.7.0)

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>             Fix For: 0.8.0
>
>         Attachments: rpc-timeout-2.patch, rpc-timeout.patch
>
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12440526 ] 
            
Doug Cutting commented on HADOOP-255:
-------------------------------------

I worry about letting the queued calls grow without bound.  If a synchronized server implementation spends a long time on a single request, while lots of other requests are coming in from other clients, then the queue could end up exhausting memory.  So perhaps we should discard stale requests as new requests are queued rather than when they're dequeued, so that we can limit the size of the queue?

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>         Attachments: rpc-timeout.patch
>
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-255?page=all ]

Owen O'Malley updated HADOOP-255:
---------------------------------

           Status: Patch Available  (was: Open)
    Fix Version/s: 0.7.0

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>             Fix For: 0.7.0
>
>         Attachments: rpc-timeout-2.patch, rpc-timeout.patch
>
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12440568 ] 
            
Sameer Paranjpye commented on HADOOP-255:
-----------------------------------------

+1 for Dougs comment

The thread populating the call queue should add calls to the end of the queue. Then examine calls at the front for staleness and discard calls that are too old.



> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>         Attachments: rpc-timeout.patch
>
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-255?page=all ]

Doug Cutting updated HADOOP-255:
--------------------------------

    Fix Version/s: 0.7.1
                       (was: 0.8.0)

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>             Fix For: 0.7.0
>
>         Attachments: rpc-timeout-2.patch, rpc-timeout.patch
>
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12413382 ] 

Owen O'Malley commented on HADOOP-255:
--------------------------------------

There are three parts to this. The first part is that I'm just about done with converting the map output to use http rather than rpc. So that aspect of this problem will go away.

The second part is that this behavior happens on all of the rpc's, not just map output transfer. However, sending a message now to potentially prevent a future message is not necessarily a winning game. If you did send a "cancel call" message, you'd probably want to remove the call from the server's work queue if it is still waiting to be processed.

It is tempting to try a strategy where you check the age of a call when you start processing it on the server and reject messages that are too old, but the problem is that it is _not_ 10 seconds from the start of the call, but rather 10 seconds with no data received on the socket, which is hard for the server to estimate. 

The final part is that this characteristic that the server can not assume that the return message from the rpc call was received is a problem. For example, I had a problem with pollForNewTask timing out and dropping tasks. I fixed that by adding a timeout so that after a task is assigned to a task tracker, it it does not show up in a task tracker status message within 10 minutes it is considered lost. However, this applies to _all_ of the rpc messages. You always need to make sure that if the return value of the rpc call were to disappear into thin air that the problem would be detected eventually. There are other instances of this ind of problem that still exist in the code that need to be identified and fixed.

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>          Key: HADOOP-255
>          URL: http://issues.apache.org/jira/browse/HADOOP-255
>      Project: Hadoop
>         Type: Bug

>   Components: ipc
>     Versions: 0.2.1
>  Environment: Tested on Linux 2.6
>     Reporter: Naveen Nalam

>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-255?page=all ]

Doug Cutting updated HADOOP-255:
--------------------------------

    Fix Version/s: 0.7.0
                       (was: 0.7.1)

This was included in the 0.7.0 release, but mistakenly marked for 0.8.0.

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>             Fix For: 0.7.0
>
>         Attachments: rpc-timeout-2.patch, rpc-timeout.patch
>
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12413490 ] 

Doug Cutting commented on HADOOP-255:
-------------------------------------

I think this is, in general, something that we won't fix.  It might be possible to improve things, but we cannot, without elaborate handshake protocols, guarantee that RPC responses are received by clients.  As Owen has indicated, we must instead make our applications tolerant of that.

Note that this is generally a problem for HTTP-based services too.  When someone hits "stop" in their browser for a slow request, the server generally continues to compute the request and only discovers that the connection has been closed when it attempts to write the response, if at all.  If the client times out after the server has flushed the response, then there's no way for the server to know this.

You sugguest that we might queue requests on the client so that only a single request to a particular server is outstanding at a time.  That would not work well for distributed search (Nutch's original IPC application).  In distributed search a front end typically has many queries outstanding.  Each query is broadcast to a number of back end servers.  Different queries take different amounts of time.  We do not want to make fast queries wait for slow queries, as that would make all queries slow and increase the burden on the front end servers.

Would folks object if I resolve this as a WONTFIX bug?

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>          Key: HADOOP-255
>          URL: http://issues.apache.org/jira/browse/HADOOP-255
>      Project: Hadoop
>         Type: Bug

>   Components: ipc
>     Versions: 0.2.1
>  Environment: Tested on Linux 2.6
>     Reporter: Naveen Nalam

>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Naveen Nalam (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12413498 ] 

Naveen Nalam commented on HADOOP-255:
-------------------------------------

Well the problem I was seeing is that a getFile RPC request for say 1GB was issued, but then the Call object timedout on the client. Yet the 1GB was still transferred fully to the client and then discarded since there was no waiting Call object. My system got into a situation where 1000s of getFile requests were queued up per node in the server's tcp receive buffers. So you can see how no progress would have ever been made.

Are you suggesting that this is not going to be a problem because all the large response body RPCs will now be done over HTTP?  I can't see how leaving the code as it is would be fine, unless all RPCs are going to be very quick to service and have small response bodies.

You mentioned that search queries is an example of an RPC that could be broadcast out. What would happen if the queries were taking too long to service and the client side rpc request was already timing out. I could see it get into a similar situation where the servers would become busy processing stale query requests.

So if all the RPCs will be small response bodies, then it seems fine to keep the connection always open and just read in the response and throw it away. And then why not add a CANCEL-RPC request type that can get sent over whenever the client request has timed out?

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>          Key: HADOOP-255
>          URL: http://issues.apache.org/jira/browse/HADOOP-255
>      Project: Hadoop
>         Type: Bug

>   Components: ipc
>     Versions: 0.2.1
>  Environment: Tested on Linux 2.6
>     Reporter: Naveen Nalam

>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-255) Client Calls are not cancelled after a call timeout

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-255?page=all ]

Doug Cutting updated HADOOP-255:
--------------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Fixed

I just committed this.  Thanks, Owen!

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>                 Key: HADOOP-255
>                 URL: http://issues.apache.org/jira/browse/HADOOP-255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.2.1
>         Environment: Tested on Linux 2.6
>            Reporter: Naveen Nalam
>         Assigned To: Owen O'Malley
>             Fix For: 0.7.0
>
>         Attachments: rpc-timeout-2.patch, rpc-timeout.patch
>
>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued up calls to timeout. Yet even though the caller has is no longer waiting, the request is still serviced on the server and the data is sent to the client. The client after receiving the full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation where data is being transferred around, but it's all data for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I set to true whenever the queued caller times out. And then when the client starts to pull over the response data (in Connection::run) to first check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam only when there is no outstanding request. This will allow closing the connection when receiving a response for a request we no longer have pending, reopen the connection, and resend the next queued request. I can provide a patch for this, but I've seen a lot of recent activity in this area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira