You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2008/09/07 21:18:46 UTC

[jira] Created: (HADOOP-4103) Alert for missing blocks

Alert for missing blocks
------------------------

                 Key: HADOOP-4103
                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
             Project: Hadoop Core
          Issue Type: New Feature
          Components: dfs
    Affects Versions: 0.17.2
            Reporter: Christian Kunz


A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.

Many processes started to fail because of the corrupted filesystem.

In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677852#action_12677852 ] 

Hadoop QA commented on HADOOP-4103:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12401076/HADOOP-4103.patch
  against trunk revision 748861.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 11 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 2 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/26/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/26/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/26/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/26/console

This message is automatically generated.

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678514#action_12678514 ] 

Hadoop QA commented on HADOOP-4103:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12401261/HADOOP-4103.patch
  against trunk revision 749318.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 11 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 2 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/40/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/40/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/40/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/40/console

This message is automatically generated.

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

    Fix Version/s: 0.20.0

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681736#action_12681736 ] 

Hudson commented on HADOOP-4103:
--------------------------------

Integrated in Hadoop-trunk #778 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/])
    

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103-branch-20.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666819#action_12666819 ] 

rangadi edited comment on HADOOP-4103 at 1/23/09 5:21 PM:
---------------------------------------------------------------

(Edit : formatting only)

The scope of the fix is narrowed to the following :

* NameNode webui shows in (probably in red) indicating if there are any missing blocks.
    ** will  mostly add simon stats for such a number.

* 'dfsadmin -metasave' can be used to find all the missing blocks
     ** a later jira will enhance -metasave or have different command that is more user friendly. currently -metasave is mainly meant for developers.

For this to be a straight forward fix, I need to make one policy change: currently if a block does not have any good replicas left it is not included in "neededReplications" list. I think this was done mainly as an "optimization". But a cluster should not have any blocks this state. even 'neededReplications' name implies such blocks should be included. It would be better if I don't need to add another list that need to be maintained.





      was (Author: rangadi):
    The scope of the fix is narrowed to the following :

# NameNode webui shows in (probably in red) indicating if there are any missing blocks.
     #will  mostly add simon stats for such a number.

# 'dfsadmin -metasave' can be used to find all the missing blocks
     ## later jira will enhance -metasave or have different command that is more user friendly. currently -metasave is mainly meant for developers.

For this to be a straight forward fix, I need to make one policy change: currently if a block does not have any good replicas left it is not included in "neededReplications" list. I think this was done mainly as an "optimization". But a cluster should not have any blocks this state. even 'neededReplications' name implies such blocks should be included. It would be better if I don't need to add another list that need to be maintained.




  
> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

    Attachment: HADOOP-4103.patch

minor fix to a string in the unit test.

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

      Resolution: Fixed
    Release Note: Modified dfsadmin -report to report under replicated blocks. blocks with corrupt replicas, and missing blocks".  (was: Modified dfsadmin -report to count under replicated blocks. blocks with corrupt replicas, and missing blocks".)
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed, Incompatible change])
          Status: Resolved  (was: Patch Available)

I just committed this.

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103-branch-20.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666819#action_12666819 ] 

Raghu Angadi commented on HADOOP-4103:
--------------------------------------

The scope of the fix is narrowed to the following :

# NameNode webui shows in (probably in red) indicating if there are any missing blocks.
     #will  mostly add simon stats for such a number.

# 'dfsadmin -metasave' can be used to find all the missing blocks
     ## later jira will enhance -metasave or have different command that is more user friendly. currently -metasave is mainly meant for developers.

For this to be a straight forward fix, I need to make one policy change: currently if a block does not have any good replicas left it is not included in "neededReplications" list. I think this was done mainly as an "optimization". But a cluster should not have any blocks this state. even 'neededReplications' name implies such blocks should be included. It would be better if I don't need to add another list that need to be maintained.





> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

    Attachment: HADOOP-4103.patch


The patch for missing block alerts. A user can monitor this in multiple ways :

   # 'bin/hdfs dfsadmin -report' reports this count.
   # A warning is pasted in red on NameNode front page
   # new stat is added (for Simon, for e.g.). 
        ** Also added a stat to report size of corrupt replicas map
  
Once the alert is noticed, admin can run 'dfsadmin -metasave' to find out which specific blocks are missing. 'metasave' is improved a bit to list replica info for each block in 'neededReplication' list and the line for a missing blocks contains the word "MISSING".

This is a very non-intrusive change, thus fairly safe for backporting. No new state or data structures for NN to track.

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

    Attachment: HADOOP-4103.patch


Thanks Suresh.

Updated patch includes all the suggestions. 

'dfsadmin -report' now prints 3 extra lines one for each of "Under replicated blocks" "Blocks with corrupt replicas" "Missing blocks". The last two counts should be zero normally. The first count should be low and should keep going down.

Regd whether it should be treated as "imcompatible" change.. I personally don't think so. But does not matter either way. 

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-4103:
------------------------------------

    Release Note: Modified dfsadmin -report to count under replicated blocks. blocks with corrupt replicas, and missing blocks".
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed])

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Suresh Srinivas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677222#action_12677222 ] 

Suresh Srinivas commented on HADOOP-4103:
-----------------------------------------

+1 for the patch

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

    Hadoop Flags: [Reviewed]
          Status: Patch Available  (was: Open)

I hope this gets marked for 0.20. It is pretty safe. Otherwise , I am pretty sure I will have to back port it again in near future and duplicate considerable constant effort associated with a new jira and a commit.
 

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678526#action_12678526 ] 

Raghu Angadi commented on HADOOP-4103:
--------------------------------------

Failed contrib test is a known issue : HADOOP-5068

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

    Status: Patch Available  (was: Open)

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678201#action_12678201 ] 

Raghu Angadi commented on HADOOP-4103:
--------------------------------------

If there are no objections, I am planning to commit this to 0.20. 

This is a pretty useful feature for admins and is pretty safe patch. Please let me know if there are concerns.


> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi reassigned HADOOP-4103:
------------------------------------

    Assignee: Raghu Angadi

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

    Attachment: HADOOP-4103-branch-20.patch

Patch 0.20 is attached. The trunk patch conflicts with 0.20.

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103-branch-20.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-4103:
-----------------------------------

    Description: 
A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.

Many processes started to fail because of the corrupted filesystem.

In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

  was:
A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.

Many processes started to fail because of the corrupted filesystem.

In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should the fact of corruption on the GUI.


> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

    Attachment: HADOOP-4103.patch

Thanks Suresh.

Attached patch fixes both. The new stat for corrupt block is not required since it is already there. I didn't see that earlier.

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4103:
---------------------------------

    Status: Open  (was: Patch Available)

I forgot to run the test again after the changes to patch based on review.

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Suresh Srinivas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677155#action_12677155 ] 

Suresh Srinivas commented on HADOOP-4103:
-----------------------------------------

Comments:
# {{DFSAdmin.java}} please remove the space before {{:}} in the newly introduced output
# {{NameNodeMetrics.numBlocksCorrupted}}  exposes the same data as {{FSNamesystemMetrics.corruptReplicaBlocks}}. Not sure where the new metrics introduced by this patch should go into

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Suresh Srinivas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676475#action_12676475 ] 

Suresh Srinivas commented on HADOOP-4103:
-----------------------------------------

1. NamenodeProtocol.getStats() method documentation needs to be updated about the fourth stat that is being reported
2. DFSAdmin.java - remove space before {{:}} in  {{"Missing Blocks (approx) : "}}. Additionally is it a good idea to print number of corrupt blocks, pending replication, scheduled replication and under replicated block counts in the report? Currently what is printed in dfsadmin report is also printed in the cluster summary on namenode web page. It may be a good idea to keep both of them consistent.
3. FSNamesystem.java {{computeReplicationWork()}} move the added code block that sets {{missingBlocksInCurIter, missingBlocksInPrevIter}} to zero, above the comments preceding it.

Would this change be incompatible because of change in the output of dfsadmin report command?


> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Bill Au (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677630#action_12677630 ] 

Bill Au commented on HADOOP-4103:
---------------------------------

I think this feature is very useful and would like to see it for 0.20 too.

> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4103.patch, HADOOP-4103.patch, HADOOP-4103.patch
>
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4103) Alert for missing blocks

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650746#action_12650746 ] 

Raghu Angadi commented on HADOOP-4103:
--------------------------------------

I thinking of implementing a background fsck on NameNode. This will share/reuse most of the code with current Fsck. The extra features will be to facilitate an admin to quickly check if there something odd (e.g. ability list last 100 or so blocks in inconsistent state).

 Based on this background check there could be further improvements to monitoring more alarms over time.. as well as reducing latency of detection.

This feature will be optional. Scan period could be around a day. 



> Alert for missing blocks
> ------------------------
>
>                 Key: HADOOP-4103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4103
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>
> A whole bunch of datanodes became dead because of some network problems resulting in  heartbeat timeouts although datanodes were fine.
> Many processes started to fail because of the corrupted filesystem.
> In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.