You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2009/05/28 13:28:45 UTC

[jira] Created: (HADOOP-5933) Make it harder to accidentally close a shared DFSClient

Make it harder to accidentally close a shared DFSClient
-------------------------------------------------------

                 Key: HADOOP-5933
                 URL: https://issues.apache.org/jira/browse/HADOOP-5933
             Project: Hadoop Core
          Issue Type: Improvement
          Components: fs
    Affects Versions: 0.21.0
            Reporter: Steve Loughran
            Priority: Minor


Every so often I get stack traces telling me that DFSClient is closed, usually in {{org.apache.hadoop.hdfs.DFSClient.checkOpen() }} . The root cause of this is usually that one thread has closed a shared fsclient while another thread still has a reference to it. If the other thread then asks for a new client it will get one -and the cache repopulated- but if has one already, then I get to see a stack trace. 

It's effectively a race condition between clients in different threads. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5933) Make it harder to accidentally close a shared DFSClient

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714436#action_12714436 ] 

Steve Loughran commented on HADOOP-5933:
----------------------------------------

There isn't direct caching for DFSClients, but there is in Filesystem.get(), which is how I've been getting DFSClient instances (and those of other filesystems)

Up until march I could have different things get filesystems, do some work and then close them, but now in SVN_HEAD I'm seeing stack traces when different threads try to work with filesystem instances they have been holding on to. So one thread running a TaskTracker is happily spinning away, until something does a quick check on a different thread that a specific file exists in the fileysystem, does a close afterwards. 

My stack traces are here: http://jira.smartfrog.org/jira/browse/SFOS-1208

The semantics of {{FileSystem.get()}} have changed; if I moved my code to the new {{FileSystem.newInstance()}} method then things should work again. That doesn't mean we dont benefit from tracing who closed the instance, only that anyone else doing work in different threads who were getting the filesystem clients by way of {{FileSystem.get()}}  are going to encounter the same problems. I just saw them first :)

> Make it harder to accidentally close a shared DFSClient
> -------------------------------------------------------
>
>                 Key: HADOOP-5933
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5933
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.21.0
>            Reporter: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-5933.patch
>
>
> Every so often I get stack traces telling me that DFSClient is closed, usually in {{org.apache.hadoop.hdfs.DFSClient.checkOpen() }} . The root cause of this is usually that one thread has closed a shared fsclient while another thread still has a reference to it. If the other thread then asks for a new client it will get one -and the cache repopulated- but if has one already, then I get to see a stack trace. 
> It's effectively a race condition between clients in different threads. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5933) Make it harder to accidentally close a shared DFSClient

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran updated HADOOP-5933:
-----------------------------------

    Attachment: HADOOP-5933.patch

This is the first solution, some extra diagnostics. It's main cost when the log is not set to debug is one extra reference. 

I don't really like the log settings changing program behaviour, so I'm not sure if anyone does want to check this patch in; it's just what I put together to track down my problems. The real problem is that the caching system isn't compatible with the users of DfsClient calling close() on their clients.

> Make it harder to accidentally close a shared DFSClient
> -------------------------------------------------------
>
>                 Key: HADOOP-5933
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5933
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.21.0
>            Reporter: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-5933.patch
>
>
> Every so often I get stack traces telling me that DFSClient is closed, usually in {{org.apache.hadoop.hdfs.DFSClient.checkOpen() }} . The root cause of this is usually that one thread has closed a shared fsclient while another thread still has a reference to it. If the other thread then asks for a new client it will get one -and the cache repopulated- but if has one already, then I get to see a stack trace. 
> It's effectively a race condition between clients in different threads. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5933) Make it harder to accidentally close a shared DFSClient

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714121#action_12714121 ] 

Raghu Angadi commented on HADOOP-5933:
--------------------------------------

>  If the other thread then asks for a new client it will get one and the cache repopulated but if has one already, then I get to see a stack trace. 

Steve, what is this issue? I didn't think there was a cache for DFSClients. Can you post the stacktrace in your test?

> Make it harder to accidentally close a shared DFSClient
> -------------------------------------------------------
>
>                 Key: HADOOP-5933
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5933
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.21.0
>            Reporter: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-5933.patch
>
>
> Every so often I get stack traces telling me that DFSClient is closed, usually in {{org.apache.hadoop.hdfs.DFSClient.checkOpen() }} . The root cause of this is usually that one thread has closed a shared fsclient while another thread still has a reference to it. If the other thread then asks for a new client it will get one -and the cache repopulated- but if has one already, then I get to see a stack trace. 
> It's effectively a race condition between clients in different threads. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5933) Make it harder to accidentally close a shared DFSClient

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714095#action_12714095 ] 

Hong Tang commented on HADOOP-5933:
-----------------------------------

I am not sure if it would be a good idea to alter code execution paths based on logging levels - this could lead to harder-to-maintain-or-test code, and harder to understand logging messages (without reading the code, a user would expect debug level logging a superset of info/warn logging).

> Make it harder to accidentally close a shared DFSClient
> -------------------------------------------------------
>
>                 Key: HADOOP-5933
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5933
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.21.0
>            Reporter: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-5933.patch
>
>
> Every so often I get stack traces telling me that DFSClient is closed, usually in {{org.apache.hadoop.hdfs.DFSClient.checkOpen() }} . The root cause of this is usually that one thread has closed a shared fsclient while another thread still has a reference to it. If the other thread then asks for a new client it will get one -and the cache repopulated- but if has one already, then I get to see a stack trace. 
> It's effectively a race condition between clients in different threads. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5933) Make it harder to accidentally close a shared DFSClient

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714116#action_12714116 ] 

Raghu Angadi commented on HADOOP-5933:
--------------------------------------

> I am not sure if it would be a good idea to alter code execution paths based on logging levels

+1. If this feature is committed the behavior should be same with or without Debug enabled. As a practical matter it is pretty hard to ask users to enable debug since that prints boatloads of other stuff.

+1 for the feature. Looking at how hard it is for users debug such problems, this seems like a useful feature. User still need to add code to getCause().. that is ok.


> Make it harder to accidentally close a shared DFSClient
> -------------------------------------------------------
>
>                 Key: HADOOP-5933
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5933
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.21.0
>            Reporter: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-5933.patch
>
>
> Every so often I get stack traces telling me that DFSClient is closed, usually in {{org.apache.hadoop.hdfs.DFSClient.checkOpen() }} . The root cause of this is usually that one thread has closed a shared fsclient while another thread still has a reference to it. If the other thread then asks for a new client it will get one -and the cache repopulated- but if has one already, then I get to see a stack trace. 
> It's effectively a race condition between clients in different threads. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5933) Make it harder to accidentally close a shared DFSClient

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714027#action_12714027 ] 

Steve Loughran commented on HADOOP-5933:
----------------------------------------

Some options
# if the log is at debug level, generate an exception in close() and save it until the next checkOpen() call is reached -and use that exception as the nested cause of the exception that is raised there.
# some complicated reference count mechanism with its own leakage problems
# add the ability to reopen things if they were in cache and got purged. 

I've done the first one of these to track down problems, and while I now know where I shouldn't be calling close(), there's a risk that my code will now leak filesystem clients. 

> Make it harder to accidentally close a shared DFSClient
> -------------------------------------------------------
>
>                 Key: HADOOP-5933
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5933
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.21.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> Every so often I get stack traces telling me that DFSClient is closed, usually in {{org.apache.hadoop.hdfs.DFSClient.checkOpen() }} . The root cause of this is usually that one thread has closed a shared fsclient while another thread still has a reference to it. If the other thread then asks for a new client it will get one -and the cache repopulated- but if has one already, then I get to see a stack trace. 
> It's effectively a race condition between clients in different threads. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.