You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2006/11/28 06:53:21 UTC

[jira] Created: (HADOOP-752) Possible locking issues in HDFS Namenode

Possible locking issues in HDFS Namenode
----------------------------------------

                 Key: HADOOP-752
                 URL: http://issues.apache.org/jira/browse/HADOOP-752
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
            Reporter: dhruba borthakur
         Assigned To: dhruba borthakur


I have been investigating the cause of random Namenode memory corruptions/memory overflows, etc. Please comment.

 1. The functions datanodeReport() and DFSNodesStatus() do not acquire the global lock.
   This can race with another thread invoking registerDatanode(). registerDatanode()
   can remove a datanode (thru wipeDatanode()) while the datanodeReport thread is
   traversing the list of datanodes. This can cause exceptions to occur.

 2. The blocksMap is protected by the global lock. The setReplication() call does not acquire
   the global lock when it calls proccessOverReplicatedBlock(). This can cause corruption in blockMap.




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-752) Possible locking issues in HDFS Namenode

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-752?page=comments#action_12455440 ] 
            
Hadoop QA commented on HADOOP-752:
----------------------------------

+1, http://issues.apache.org/jira/secure/attachment/12346375/namenodelocking.patch applied and successfully tested against trunk revision 482393

> Possible locking issues in HDFS Namenode
> ----------------------------------------
>
>                 Key: HADOOP-752
>                 URL: http://issues.apache.org/jira/browse/HADOOP-752
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: namenodelocking.patch
>
>
> I have been investigating the cause of random Namenode memory corruptions/memory overflows, etc. Please comment.
>  1. The functions datanodeReport() and DFSNodesStatus() do not acquire the global lock.
>    This can race with another thread invoking registerDatanode(). registerDatanode()
>    can remove a datanode (thru wipeDatanode()) while the datanodeReport thread is
>    traversing the list of datanodes. This can cause exceptions to occur.
>  2. The blocksMap is protected by the global lock. The setReplication() call does not acquire
>    the global lock when it calls proccessOverReplicatedBlock(). This can cause corruption in blockMap.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-752) Possible locking issues in HDFS Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-752?page=comments#action_12454079 ] 
            
Raghu Angadi commented on HADOOP-752:
-------------------------------------


DFSNodesStatus() locks 'heartBeats' and 'datanodeMap'.  As you noted these are not locked in registerDatanode().

I think we should have explicitly stated policy about the locking, which locks are held to protect which state. This will help anyone writing new code or reading the code.


> Possible locking issues in HDFS Namenode
> ----------------------------------------
>
>                 Key: HADOOP-752
>                 URL: http://issues.apache.org/jira/browse/HADOOP-752
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>
> I have been investigating the cause of random Namenode memory corruptions/memory overflows, etc. Please comment.
>  1. The functions datanodeReport() and DFSNodesStatus() do not acquire the global lock.
>    This can race with another thread invoking registerDatanode(). registerDatanode()
>    can remove a datanode (thru wipeDatanode()) while the datanodeReport thread is
>    traversing the list of datanodes. This can cause exceptions to occur.
>  2. The blocksMap is protected by the global lock. The setReplication() call does not acquire
>    the global lock when it calls proccessOverReplicatedBlock(). This can cause corruption in blockMap.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-752) Possible locking issues in HDFS Namenode

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-752?page=all ]

dhruba borthakur updated HADOOP-752:
------------------------------------

    Status: Patch Available  (was: Open)

Added locking to setReplication(), datanodeReport & DFSNodesStatus

> Possible locking issues in HDFS Namenode
> ----------------------------------------
>
>                 Key: HADOOP-752
>                 URL: http://issues.apache.org/jira/browse/HADOOP-752
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: namenodelocking.patch
>
>
> I have been investigating the cause of random Namenode memory corruptions/memory overflows, etc. Please comment.
>  1. The functions datanodeReport() and DFSNodesStatus() do not acquire the global lock.
>    This can race with another thread invoking registerDatanode(). registerDatanode()
>    can remove a datanode (thru wipeDatanode()) while the datanodeReport thread is
>    traversing the list of datanodes. This can cause exceptions to occur.
>  2. The blocksMap is protected by the global lock. The setReplication() call does not acquire
>    the global lock when it calls proccessOverReplicatedBlock(). This can cause corruption in blockMap.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-752) Possible locking issues in HDFS Namenode

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-752?page=all ]

dhruba borthakur updated HADOOP-752:
------------------------------------

    Attachment: namenodelocking.patch

Added locking to setReplication, datanodeReport & DFSNodesStatus

> Possible locking issues in HDFS Namenode
> ----------------------------------------
>
>                 Key: HADOOP-752
>                 URL: http://issues.apache.org/jira/browse/HADOOP-752
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: namenodelocking.patch
>
>
> I have been investigating the cause of random Namenode memory corruptions/memory overflows, etc. Please comment.
>  1. The functions datanodeReport() and DFSNodesStatus() do not acquire the global lock.
>    This can race with another thread invoking registerDatanode(). registerDatanode()
>    can remove a datanode (thru wipeDatanode()) while the datanodeReport thread is
>    traversing the list of datanodes. This can cause exceptions to occur.
>  2. The blocksMap is protected by the global lock. The setReplication() call does not acquire
>    the global lock when it calls proccessOverReplicatedBlock(). This can cause corruption in blockMap.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-752) Possible locking issues in HDFS Namenode

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-752?page=all ]

Doug Cutting updated HADOOP-752:
--------------------------------

           Status: Resolved  (was: Patch Available)
    Fix Version/s: 0.10.0
       Resolution: Fixed

I just committed this.  Thanks, Dhruba!

> Possible locking issues in HDFS Namenode
> ----------------------------------------
>
>                 Key: HADOOP-752
>                 URL: http://issues.apache.org/jira/browse/HADOOP-752
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>             Fix For: 0.10.0
>
>         Attachments: namenodelocking.patch
>
>
> I have been investigating the cause of random Namenode memory corruptions/memory overflows, etc. Please comment.
>  1. The functions datanodeReport() and DFSNodesStatus() do not acquire the global lock.
>    This can race with another thread invoking registerDatanode(). registerDatanode()
>    can remove a datanode (thru wipeDatanode()) while the datanodeReport thread is
>    traversing the list of datanodes. This can cause exceptions to occur.
>  2. The blocksMap is protected by the global lock. The setReplication() call does not acquire
>    the global lock when it calls proccessOverReplicatedBlock(). This can cause corruption in blockMap.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira