You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "sam rash (JIRA)" <ji...@apache.org> on 2010/05/12 18:30:41 UTC

[jira] Commented: (HADOOP-6762) exception while doing RPC I/O closes channel

    [ https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866619#action_12866619 ] 

sam rash commented on HADOOP-6762:
----------------------------------

the general problem is that 'client' threads hold the socket and do writes to it to send RPCs.  If a client thread receives an interrupt, it will leave the socket in an unusable state. 

i have a test for this general case and a patch which moves the actual writing to the socket to a thread owned by the Client object.  This means a client can be interrupted and not ruin the socket for other clients.

note:  other socket errors may occur that make the socket unusable. The patch doesn't handle this (only intended to help with interrupted cases since that is common with filesystem.close).

we might also want to consider finding a way to fail fast when RPC goes bad.  Near as I can tell from watching this happen, until the filesystem is closed, the underlying RPC is in a bad state.  It seems like we could fail on one operation, detect the bad socket and perhaps recreate the socket or the whole RPC object.  not sure where this retry logic goes

> exception while doing RPC I/O closes channel
> --------------------------------------------
>
>                 Key: HADOOP-6762
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6762
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: sam rash
>
> If a single process creates two unique fileSystems to the same NN using FileSystem.newInstance(), and one of them issues a close(), the leasechecker thread is interrupted.  This interrupt races with the rpc namenode.renew() and can cause a ClosedByInterruptException.  This closes the underlying channel and the other filesystem, sharing the connection will get errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.