You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Koji Noguchi (JIRA)" <ji...@apache.org> on 2008/09/03 20:05:44 UTC

[jira] Created: (HADOOP-4061) Large number of decommission freezes the Namenode

Large number of decommission freezes the Namenode
-------------------------------------------------

                 Key: HADOOP-4061
                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.17.2
            Reporter: Koji Noguchi


On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.

When decommission started, namenode's queue overflowed every 6 minutes.

Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.

{noformat}
  public synchronized void decommissionedDatanodeCheck() {
    for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
         it.hasNext();) {
      DatanodeDescriptor node = it.next();
      checkDecommissionStateInternal(node);
    }
  }
{noformat}






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650334#action_12650334 ] 

Konstantin Shvachko commented on HADOOP-4061:
---------------------------------------------

I think this code is actually simpler than the original version and, importantly, more efficient.
There is a clear iterator that hides all iterating logic inside.
If there is a way to avoid storing references then why store.
The way Nicholas did this is by using standard Java library classes and methods.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628677#action_12628677 ] 

Nigel Daley commented on HADOOP-4061:
-------------------------------------

This will need some repeatable testing at scale.  A decommission benchmark?

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649543#action_12649543 ] 

Konstantin Shvachko commented on HADOOP-4061:
---------------------------------------------

Nicholas, so what happens if a data-node dies and is removed from datanodeMap?
It will still be in the DecommissionManager queue.
Can we implement some kind of iterator of nodes in decommission-in-progress state by just scanning datanodeMap without storing extra references to data-nodes.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650430#action_12650430 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4061:
------------------------------------------------

{noformat}
     [exec] +1 overall.  

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.

     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
{noformat}

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch, 4061_20081124c.patch, HADOOP-4061.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650302#action_12650302 ] 

Konstantin Shvachko commented on HADOOP-4061:
---------------------------------------------

+1 I like the approach of using CyclicIterator.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628200#action_12628200 ] 

Raghu Angadi commented on HADOOP-4061:
--------------------------------------

A related fix that could reduce this load is that NameNode should delete a block from the decommissioned node once it is replicated. i.e. 

If block A belongs to decommissioned node Dx, NameNode replicates the block with another node Dy. Once Dy reports that it has the block, NN should remove A from Dx. This should handled in addStoreBlock I guess.


> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628201#action_12628201 ] 

Raghu Angadi commented on HADOOP-4061:
--------------------------------------

Deleting A from Dx is correct because that how NN handles excess replicas in general.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649517#action_12649517 ] 

Konstantin Shvachko commented on HADOOP-4061:
---------------------------------------------

Raghu, I don't think we should redesign decommission feature here. Removing blocks will be a redesign. The next question would be why this empty nodes are hanging around? Let's shut them all down. Please check HADOOP-681. We should optimize the process keeping the same behavior and semantics.
In current design we treat decommissioned nodes as read only.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Attachment: 4061_20081124b.patch

Suresh, thanks for taking a look of my patch!

> In CyclicIterable.java, navigablemap != null check not required
You are right.

> In DecommissionManager.java, move setting the firstkey in to the return condition.
We cannot move setting the firstkey since the iteration may ends before ++count == numNodesPerCheck.

4061_20081124b.patch: removed navigablemap != null check and changed some log messages.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Attachment: 4061_20081124.patch

4061_20081124.patch: fixed a bug and added a test for CyclicIterator

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4061:
---------------------------------

    Attachment: HADOOP-4061.patch

It is efficient, I think it is opposite? *Every* next() operation is a tree search.. Assuming that is ok, there is a patch that does pretty much the same and 150+ lines smaller.  So once in a while one iteration might scan less than 5, that is ok.. thats why we have shorter 30 second sleep.

Nicholas, please attache your patch again so that Hudson picks yours.



> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch, HADOOP-4061.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649521#action_12649521 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4061:
------------------------------------------------

Created HADOOP-4701.  Please continue the decommission performance improvement discussion there.  Thanks.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

        Assignee: Tsz Wo (Nicholas), SZE
    Release Note: Added a new conf property dfs.namenode.decommission.nodes.per.interval so that NameNode checks decommission status of x nodes for every y seconds, where x is the value of dfs.namenode.decommission.nodes.per.interval and y is the value of dfs.namenode.decommission.interval.
    Hadoop Flags: [Incompatible change]

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649501#action_12649501 ] 

Raghu Angadi commented on HADOOP-4061:
--------------------------------------

> Deleting already decommissioned blocks as Raghu proposes is also not very good. Until the node is shut down its blocks can be  accessed for read. We don't want to change that.

The proposal is to delete it *after* the block is properly replicated. That seems like the right thing to do and a must for scalability.  The main CPU problem here is because of keeping all these excess blocks around.

I don't see any use of keeping over rreplicated blocks. If it is useful then, we should not delete any over replicated block unless there is no space left.

Regd the patch, the loop should count 5 decommissioned nodes rather than any 5 nodes. Otherwise, if you have 2000 nodes then each decommissioned node  be checked only once in 16 hours or so (even if you decommission just one).


> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649821#action_12649821 ] 

Konstantin Shvachko commented on HADOOP-4061:
---------------------------------------------

This should work, but may be it would be even better not create any references at all.
I propose to store last checked {{DatanodeDescriptor lastDN}} in DatanodeManager and get a {{datanodeMap.tail(lastDN, nonInclusive)}} on each iteration. The tail map can be used instead of the queue, since it has analogous methods {{getFirstEntry()}} and {{pollFirstEntry()}}.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650304#action_12650304 ] 

Raghu Angadi commented on HADOOP-4061:
--------------------------------------

I think there is too much code that is not really required. probably complicating it much more than required.

- Storing reference should be be fine. This is java.. we should just check if the DN is already deleted.
- Alternately, the monitor can in each iteration :
     --- list all the decommissioned nodes and 
     --- the just pick five of them randomly to check.

just a suggestion.
    


> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Attachment: 4061_20081123.patch

bq. I propose to store last checked DatanodeDescriptor lastDN in DatanodeManager and get a datanodeMap.tail(lastDN, nonInclusive) on each iteration. The tail map can be used instead of the queue, since it has analogous methods getFirstEntry() and pollFirstEntry().

We can't poll elements from the tail map since the tail map is backed by the original map.  Polling elements from the tail map changes the original map.

We don't really need tail map.  We need a cyclic iterator so that it starts an iteration from any point of the map and follows the map's ordering.  If the iterator hits the last entry of the map, it will then continue from the first entry.

4061_20081123.patch: implement cyclic iterator

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648866#action_12648866 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4061:
------------------------------------------------

Instead of checking all blocks in all decommissioning nodes every 5 minutes, we should check a limited number of blocks in a shorter period.  It would throttle decommissioning in Namenode.  A drawback is that the decommission status update may be delayed.

> Deleting A from Dx is correct because that how NN handles excess replicas in general.

This is probably a good idea to improve the performance.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649525#action_12649525 ] 

rangadi edited comment on HADOOP-4061 at 11/20/08 3:12 PM:
----------------------------------------------------------------

(edit : correct jira number)
> I don't think we should redesign decommission feature here. Removing blocks will be a redesign. 

hmm.. this is not done in the patch anyway. It does not look like redesign in any way to me. It is fairly simple to me : if excess replica is on a normal DN, it will be deleted (right?), then there is no reason to keep it if that happens to be a decommissioned node. Alright, I noticed HADOOP-4701.

      was (Author: rangadi):
    > I don't think we should redesign decommission feature here. Removing blocks will be a redesign. 

hmm.. this is not done in the patch anyway. I does not look like redesign in any way to me. It is fairly simple to me : if excess replica is on a normal DN, it will be deleted (right?), then there is no reason to keep it if that happens to be a decommissioned node. Alright, I noticed HADOOP-4071.
  
> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Attachment: 4061_20081124c_0.18.patch

4061_20081124c_0.18.patch: for 0.18

The patch passed all tests locally.  I am going to commit this to 0.18 and above.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch, 4061_20081124c.patch, 4061_20081124c_0.18.patch, HADOOP-4061.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Incompatible change])
          Status: Patch Available  (was: Open)

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649504#action_12649504 ] 

Raghu Angadi commented on HADOOP-4061:
--------------------------------------

> I would rather go with the approach, which counts down decommissioned blocks as they are replicated. Then there is no need to scan all blocks to verify the node is decommissioned, just check the counter.

The above is also useful and should help a lot here. Note these unnecessary blocks have overhead in other operations too (not just when checking the decommisioned nodes). Essentially, there is no need to treat these seperately.. we just need to give preference to deleting blocks from decommissioned nodes when there is excess replication.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Attachment: 4061_20081120.patch

4061_20081120.patch:

Thanks, Konstantin and Raghu!

@Konstantin
- This is not true.  dfs.namenode.decommission.interval is always in seconds, not minutes.  Only the default value is changed.  Could you check it again?

- For code refactoring, I am going to do it in the next step.  As mentioned before, I am going to improve decomission performance.  For now, I will change the class name to DecommissionManager and have Monitor as an inner class.

- I would rather rename it to i for iterator and declare it inside the for-loop header.

Let me create a new issue for improving decomission performance and discuss it there.  I believe we need more thought.

@Raghu> the loop should count 5 decommissioned nodes

Yes, we should count nodes with decommission in progress.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650361#action_12650361 ] 

Raghu Angadi commented on HADOOP-4061:
--------------------------------------

A non-binding -1 from me. Moving extra code to a new file does not make it free (of maintanance) :-) 

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch, HADOOP-4061.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Attachment: 4061_20081120b.patch

4061_20081120b.patch: add only decommission-in-progress nodes to the DecommissionManager queue.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Attachment: 4061_20081124c.patch

> It is efficient, I think it is opposite? Every next() operation is a tree search.. 
This is a good point.

> there is a patch that does pretty much the same and 150+ lines smaller.
Your patch does pretty much the same for the default conf values, i.e. 30 seconds and 5 nodes per interval.  But what if set them to 1 hour and and 100 nodes?

Thanks for your comments, Raghu!

4061_20081124c.patch: use an Iterator instead of higherEntry(...) for linear time iteration.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch, 4061_20081124c.patch, HADOOP-4061.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649525#action_12649525 ] 

Raghu Angadi commented on HADOOP-4061:
--------------------------------------

> I don't think we should redesign decommission feature here. Removing blocks will be a redesign. 

hmm.. this is not done in the patch anyway. I does not look like redesign in any way to me. It is fairly simple to me : if excess replica is on a normal DN, it will be deleted (right?), then there is no reason to keep it if that happens to be a decommissioned node. Alright, I noticed HADOOP-4071.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed, Incompatible change])
          Status: Resolved  (was: Patch Available)

I just committed this.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch, 4061_20081124c.patch, 4061_20081124c_0.18.patch, HADOOP-4061.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Fix Version/s: 0.20.0
                   0.19.1
                   0.18.3
     Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed, Incompatible change])

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.18.3, 0.19.1, 0.20.0
>
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch, 4061_20081124c.patch, 4061_20081124c_0.18.patch, HADOOP-4061.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628126#action_12628126 ] 

Konstantin Shvachko commented on HADOOP-4061:
---------------------------------------------

{{decommissionedDatanodeCheck()}} is very inefficient.
Every 5 minutes it scans all blocks belonging to decommissioning nodes and checks their replication. 
With 400 nodes on a 2000 node cluster it is one fifth of all blocks checked under a single lock.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4061:
-------------------------------------------

    Attachment: 4061_20081119.patch

4061_20081119.patch: throttling decommission monitor.

I think it is better to work on performance improvement and benchmark in separated issues.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>         Attachments: 4061_20081119.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651051#action_12651051 ] 

Hudson commented on HADOOP-4061:
--------------------------------

Integrated in Hadoop-trunk #671 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/671/])
    . Throttle Datanode decommission monitoring in Namenode. (szetszwo)


> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.18.3, 0.19.1, 0.20.0
>
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch, 4061_20081124b.patch, 4061_20081124c.patch, 4061_20081124c_0.18.patch, HADOOP-4061.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649468#action_12649468 ] 

Konstantin Shvachko commented on HADOOP-4061:
---------------------------------------------

- You changed the time units in {{dfs.namenode.decommission.interval}} from minutes to seconds. Is it going to be a problem for those who use this config variable? If they set it to 5 (now in minutes) then it is going to be every 5 seconds after your patch.
- Do we want to introduce {{DecommissionMonitor decommissionManager}} member in {{FSNamesystem}}? Then we will be able to move all decommissioning logic into {{DecommissionMonitor}} or manager, which is probably partly a goal of this patch. 
-- {{FSNamesystem.checkDecommissionStateInternal()}} should be moved in to {{DecommissionMonitor}};
-- same as {{startDecommission()}} and {{stopDecommission()}}.
- In {{isReplicationInProgress()}} could you please rename  {{decommissionBlocks}} to {{nodeBlocks}}. It has nothing to do with decommission and is confising.

I think this throttling approach will solve the problem for now, but is not ideal. Say, if you have 500,000 blocks rather than 30,000 then you will have to reconfigure the throttler to scan even less nodes. 
Deleting already decommissioned blocks as Raghu proposes is also not very good. Until the node is shut down its blocks can be  accessed for read. We don't want to change that.
I would rather go with the approach, which counts down decommissioned blocks as they are replicated. Then there is no need to scan all blocks to verify the node is decommissioned, just check the counter. We can add the total block scan as a sanity check in stopDecommission(). The counter can also be a good indicator of how much decommissioning progress has been done at every moment.
We should create a separate jira for these changes.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode

Posted by "Suresh Srinivas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650338#action_12650338 ] 

Suresh Srinivas commented on HADOOP-4061:
-----------------------------------------

Minor nits:
In CyclicIterable.java, navigablemap != null check not required:
{noformat}
hasnext = navigablemap != null && !next.equals(first);
{noformat}

In DecommissionManager.java, move setting the firstkey in to the return condition. 

{noformat}
+          if (++count == numNodesPerCheck) {
>> add here          firstkey = entry.getKey();
+            return;
+          }
{noformat}


> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch, 4061_20081120b.patch, 4061_20081123.patch, 4061_20081124.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.