You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Brian Bockelman (JIRA)" <ji...@apache.org> on 2009/01/08 19:52:59 UTC

[jira] Created: (HADOOP-4995) Offline Namenode fsImage verification

Offline Namenode fsImage verification
-------------------------------------

                 Key: HADOOP-4995
                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
             Project: Hadoop Core
          Issue Type: New Feature
            Reporter: Brian Bockelman


Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.

One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.

To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4995) Offline Namenode fsImage verification

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662194#action_12662194 ] 

Brian Bockelman commented on HADOOP-4995:
-----------------------------------------

Re: Konstantin: I still consider the secondary name-node part of the "online system".

I want to be able to take a completely offline image - perhaps something we pulled off the tape - and make sure that it's at least valid enough that a namenode could load it into memory.  It'd be a way that we can do a "light audit" of our backup copies.

Currently, the best we can do is "try and pray" (and it's a manual process).

> Offline Namenode fsImage verification
> -------------------------------------
>
>                 Key: HADOOP-4995
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Brian Bockelman
>
> Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.
> One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.
> To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4995) Offline Namenode fsImage verification

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662413#action_12662413 ] 

dhruba borthakur commented on HADOOP-4995:
------------------------------------------

Cool. sounds good to me.

> Offline Namenode fsImage verification
> -------------------------------------
>
>                 Key: HADOOP-4995
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Brian Bockelman
>
> Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.
> One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.
> To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4995) Offline Namenode fsImage verification

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662481#action_12662481 ] 

Konstantin Shvachko commented on HADOOP-4995:
---------------------------------------------

{{namenode  -checkimage}} startup option would be good to have. Glad we clarified it.

> Offline Namenode fsImage verification
> -------------------------------------
>
>                 Key: HADOOP-4995
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Brian Bockelman
>
> Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.
> One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.
> To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4995) Offline Namenode fsImage verification

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662248#action_12662248 ] 

dhruba borthakur commented on HADOOP-4995:
------------------------------------------

I agree with Konstantin. The best tool to verify that the image is good is to run the namenode. The only caveat is that running the namenode actually merges the fsimage and edits log. 

If there is a way to start the namenode with a "-checkimage" or some such parameter.. in this case, the namenode can just load both the fsimage and edits and then exits.

> Offline Namenode fsImage verification
> -------------------------------------
>
>                 Key: HADOOP-4995
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Brian Bockelman
>
> Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.
> One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.
> To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4995) Offline Namenode fsImage verification

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662221#action_12662221 ] 

Konstantin Shvachko commented on HADOOP-4995:
---------------------------------------------

Are you going to use name-node methods to verify fsimage or are you planning to implement a completely independent tool to do that.
If the former then you will probably need the same amount of memory as the name-node uses and therefore you might just use the real name-node or the secondary one and do the "try and pray".
If the latter then it will be hard to keep it in sync with the changing image layout and the name-node code. Suppose the tool has a bug (which might be just that the real image layout was not reflected in the tool code) and it reports the image is good or bad, how do you trust it. Who is going to verify the tool's correctness?
I am saying that there is no better tool for verifying image correctness than the name-node itself. And may be the "try and pray" is the only approach you can really trust in the end. You can do it offline rather than online if required.
Do I miss your point?


> Offline Namenode fsImage verification
> -------------------------------------
>
>                 Key: HADOOP-4995
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Brian Bockelman
>
> Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.
> One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.
> To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4995) Offline Namenode fsImage verification

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662184#action_12662184 ] 

Raghu Angadi commented on HADOOP-4995:
--------------------------------------

Some time back I had a proposal to checksum the fsimage. Here, each record (about a few hundred bytes) is checksumed rather than the whole file. This helps both with the verification as well as better recovery from multiple copies. In case of multiple copies, the image can be recovered as long as both copies are not damaged at the same location.

> Offline Namenode fsImage verification
> -------------------------------------
>
>                 Key: HADOOP-4995
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Brian Bockelman
>
> Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.
> One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.
> To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4995) Offline Namenode fsImage verification

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662411#action_12662411 ] 

Brian Bockelman commented on HADOOP-4995:
-----------------------------------------

Hey Konstantin, Dhruba,

I think we're approximately on the same page.  The best thing would be to verify image correctness is the namenode itself.  I believe Dhruba expressed this most succinctly: the "offline fsImage verification" could simply be a "-checkimage" flag where the namenode would load the fsImage / edits, then exit 0 if nothing bad happened and exit 1 if there was some error.

I wasn't proposing a completely separate tool to verify an image for the reasons Konstantin pointed out - the only sane way to verify the image is usable is by the namenode is to use the namenode itself; it'd be impossible to try and sync two separate implementations.

> Offline Namenode fsImage verification
> -------------------------------------
>
>                 Key: HADOOP-4995
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Brian Bockelman
>
> Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.
> One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.
> To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4995) Offline Namenode fsImage verification

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4995:
-------------------------------------

    Component/s: dfs

> Offline Namenode fsImage verification
> -------------------------------------
>
>                 Key: HADOOP-4995
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Brian Bockelman
>
> Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.
> One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.
> To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4995) Offline Namenode fsImage verification

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662165#action_12662165 ] 

Konstantin Shvachko commented on HADOOP-4995:
---------------------------------------------

If you do periodic checkpoints using secondary name-node don't you check the correctness of the namespace image by that?

> Offline Namenode fsImage verification
> -------------------------------------
>
>                 Key: HADOOP-4995
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4995
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Brian Bockelman
>
> Currently, there is no way to verify that a copy of the fsImage is not corrupt.  I propose that we should have an offline tool that loads the fsImage into memory to see if it is usable.  This will allow us to automate backup testing to some extent.
> One can start a namenode process on the fsImage to see if it can be loaded, but this is not easy to automate.
> To use HDFS in production, it is greatly desired to have both checkpoints - and have some idea that the checkpoints are valid!  No one wants to see the day where they reload from backup only to find that the fsImage in the backup wasn't usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.