You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Espen Amble Kolstad <es...@trank.no> on 2006/12/08 21:59:39 UTC

Corrupt DFS edits-file

Hi,

I run hadoop-0.9-dev and my edits-file has become corrupt. When I try to
start the namenode I get the following error:
2006-12-08 20:38:57,431 ERROR dfs.NameNode -
java.io.FileNotFoundException: Parent path does not exist:
/user/trank/dotno/segments/20061208154235/parse_data/part-00000
        at
org.apache.hadoop.dfs.FSDirectory$INode.addNode(FSDirectory.java:186)
        at
org.apache.hadoop.dfs.FSDirectory.unprotectedMkdir(FSDirectory.java:714)
        at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:254)
        at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:191)
        at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:320)
        at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:226)
        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:142)
        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:134)
        at org.apache.hadoop.dfs.NameNode.main(NameNode.java:585)

I've grep'ed through my edits-file, to see what's wrong. It seems the
edits-file is missing an OP_MKDIR for
/user/trank/dotno/segments/20061208154235/parse_data.

Is there a tool for fixing an edits-file, or to put in an OP_MKDIR ?

- Espen

Re: RE: Corrupt DFS edits-file

Posted by Albert Chern <al...@gmail.com>.
Hi Dhruba,

This happened some time ago so my memory's sketchy, but I'll do my
best to answer the questions:

> 1. Did you have more than one directory in dfs.name.dir?
> 2. Was this a new cluster or was it an existing cluster and was upgraded to
> 0.9.0 recently?

It happened after we restarted the DFS when upgrading from 0.7 to 0.8.
 In the process, we also added multiple directories to dfs.name.dir.

> 3. Did any unnatural  Namenode restarts occur immediately before the problem
> started occurring?

Not sure about this one.

> 1. Will it help to make the fsimage/edit file ascii, so that it can be
> easily edited by hand?

That's a good idea, but I don't know if the effect on backwards
compatibility is worth it.  Editing these files is probably not
something that most people will do.  Maybe some sort of conversion
tool that goes from the binary to text and vice versa would be more
useful.

> 2. Does it make sense for HDFS to automatically create a directory
> equivalent to /lost+found? While EditLog processing, if the parent directory
> of a file does not exist, the file can go into /lost+found?

Yes.  At least this way people can start up their DFSs after corruption.

On 12/8/06, Dhruba Borthakur <dh...@yahoo-inc.com> wrote:
> Hi Albert and Espen,
>
> With an eye on debugging more on this issue, I have the following questions:
> 1. Did you have more than one directory in dfs.name.dir?
> 2. Was this a new cluster or was it an existing cluster and was upgraded to
> 0.9.0 recently?
> 3. Did any unnatural  Namenode restarts occur immediately before the problem
> started occurring?
>
> With an eye on making it easier to recover from such a corruption:
> 1. Will it help to make the fsimage/edit file ascii, so that it can be
> easily edited by hand?
> 2. Does it make sense for HDFS to automatically create a directory
> equivalent to /lost+found? While EditLog processing, if the parent directory
> of a file does not exist, the file can go into /lost+found?
>
> Thanks,
> dhruba
>
> -----Original Message-----
> From: Albert Chern [mailto:albert.chern@gmail.com]
> Sent: Friday, December 08, 2006 1:43 PM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Corrupt DFS edits-file
>
> This happened to me too, but the problem was the OP_MKDIR instructions
> were in the wrong order.  That is, in the edits file the parent
> directory was created after the child.  Maybe you should check to see
> if that's the case.
>
> I fixed it by using vi in combination with xxd.  When you have the
> file open in vi, press escape and issue the command "%!xxd".  This
> will convert the binary file to hexadecimal.  Then you can search
> through and perform the necessary edits.  I don't remember what the
> bytes were, but it was something like opcode, length of path (in
> binary), path.  After you're done, issue the command "%!xxd -r" to
> revert it to binary.  Remember to back up your files when you do this!
>  I also had to kick off a trailing byte that got tagged on for some
> reason during the binary/hex conversion.
>
> Anyhow, this is a serious bug and could lead to data loss for a lot of
> people.  I think we should report it.
>
> On 12/8/06, Espen Amble Kolstad <es...@trank.no> wrote:
> > Hi,
> >
> > I run hadoop-0.9-dev and my edits-file has become corrupt. When I try to
> > start the namenode I get the following error:
> > 2006-12-08 20:38:57,431 ERROR dfs.NameNode -
> > java.io.FileNotFoundException: Parent path does not exist:
> > /user/trank/dotno/segments/20061208154235/parse_data/part-00000
> >         at
> > org.apache.hadoop.dfs.FSDirectory$INode.addNode(FSDirectory.java:186)
> >         at
> > org.apache.hadoop.dfs.FSDirectory.unprotectedMkdir(FSDirectory.java:714)
> >         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:254)
> >         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:191)
> >         at
> > org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:320)
> >         at
> org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:226)
> >         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:142)
> >         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:134)
> >         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:585)
> >
> > I've grep'ed through my edits-file, to see what's wrong. It seems the
> > edits-file is missing an OP_MKDIR for
> > /user/trank/dotno/segments/20061208154235/parse_data.
> >
> > Is there a tool for fixing an edits-file, or to put in an OP_MKDIR ?
> >
> > - Espen
> >
>
>

Re: Corrupt DFS edits-file

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.
Could you please add your comments to HADOOP-745.
http://issues.apache.org/jira/browse/HADOOP-745

It could be helpful for who ever is going to fix it.

Christian Kunz wrote:

>FYI: there is an open issue for this:
>HADOOP-745
>
>-Christian 
>
>-----Original Message-----
>From: Dhruba Borthakur [mailto:dhruba@yahoo-inc.com] 
>Sent: Friday, December 08, 2006 2:46 PM
>To: hadoop-user@lucene.apache.org
>Subject: RE: Corrupt DFS edits-file
>
>Hi Albert and Espen,
>
>With an eye on debugging more on this issue, I have the following questions:
>1. Did you have more than one directory in dfs.name.dir? 
>2. Was this a new cluster or was it an existing cluster and was upgraded to
>0.9.0 recently?
>3. Did any unnatural  Namenode restarts occur immediately before the problem
>started occurring?
>
>With an eye on making it easier to recover from such a corruption:
>1. Will it help to make the fsimage/edit file ascii, so that it can be
>easily edited by hand?
>2. Does it make sense for HDFS to automatically create a directory
>equivalent to /lost+found? While EditLog processing, if the parent directory
>of a file does not exist, the file can go into /lost+found?
>
>Thanks,
>dhruba
>
>-----Original Message-----
>From: Albert Chern [mailto:albert.chern@gmail.com]
>Sent: Friday, December 08, 2006 1:43 PM
>To: hadoop-user@lucene.apache.org
>Subject: Re: Corrupt DFS edits-file
>
>This happened to me too, but the problem was the OP_MKDIR instructions were
>in the wrong order.  That is, in the edits file the parent directory was
>created after the child.  Maybe you should check to see if that's the case.
>
>I fixed it by using vi in combination with xxd.  When you have the file open
>in vi, press escape and issue the command "%!xxd".  This will convert the
>binary file to hexadecimal.  Then you can search through and perform the
>necessary edits.  I don't remember what the bytes were, but it was something
>like opcode, length of path (in binary), path.  After you're done, issue the
>command "%!xxd -r" to revert it to binary.  Remember to back up your files
>when you do this!
> I also had to kick off a trailing byte that got tagged on for some reason
>during the binary/hex conversion.
>
>Anyhow, this is a serious bug and could lead to data loss for a lot of
>people.  I think we should report it.
>
>On 12/8/06, Espen Amble Kolstad <es...@trank.no> wrote:
>  
>
>>Hi,
>>
>>I run hadoop-0.9-dev and my edits-file has become corrupt. When I try 
>>to start the namenode I get the following error:
>>2006-12-08 20:38:57,431 ERROR dfs.NameNode -
>>java.io.FileNotFoundException: Parent path does not exist:
>>/user/trank/dotno/segments/20061208154235/parse_data/part-00000
>>        at
>>org.apache.hadoop.dfs.FSDirectory$INode.addNode(FSDirectory.java:186)
>>        at
>>org.apache.hadoop.dfs.FSDirectory.unprotectedMkdir(FSDirectory.java:714)
>>        at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:254)
>>        at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:191)
>>        at
>>org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:320)
>>        at
>>    
>>
>org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:226)
>  
>
>>        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:142)
>>        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:134)
>>        at org.apache.hadoop.dfs.NameNode.main(NameNode.java:585)
>>
>>I've grep'ed through my edits-file, to see what's wrong. It seems the 
>>edits-file is missing an OP_MKDIR for 
>>/user/trank/dotno/segments/20061208154235/parse_data.
>>
>>Is there a tool for fixing an edits-file, or to put in an OP_MKDIR ?
>>
>>- Espen
>>
>>    
>>
>
>
>
>
>  
>


RE: Corrupt DFS edits-file

Posted by Christian Kunz <ck...@yahoo-inc.com>.
FYI: there is an open issue for this:
HADOOP-745

-Christian 

-----Original Message-----
From: Dhruba Borthakur [mailto:dhruba@yahoo-inc.com] 
Sent: Friday, December 08, 2006 2:46 PM
To: hadoop-user@lucene.apache.org
Subject: RE: Corrupt DFS edits-file

Hi Albert and Espen,

With an eye on debugging more on this issue, I have the following questions:
1. Did you have more than one directory in dfs.name.dir? 
2. Was this a new cluster or was it an existing cluster and was upgraded to
0.9.0 recently?
3. Did any unnatural  Namenode restarts occur immediately before the problem
started occurring?

With an eye on making it easier to recover from such a corruption:
1. Will it help to make the fsimage/edit file ascii, so that it can be
easily edited by hand?
2. Does it make sense for HDFS to automatically create a directory
equivalent to /lost+found? While EditLog processing, if the parent directory
of a file does not exist, the file can go into /lost+found?

Thanks,
dhruba

-----Original Message-----
From: Albert Chern [mailto:albert.chern@gmail.com]
Sent: Friday, December 08, 2006 1:43 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Corrupt DFS edits-file

This happened to me too, but the problem was the OP_MKDIR instructions were
in the wrong order.  That is, in the edits file the parent directory was
created after the child.  Maybe you should check to see if that's the case.

I fixed it by using vi in combination with xxd.  When you have the file open
in vi, press escape and issue the command "%!xxd".  This will convert the
binary file to hexadecimal.  Then you can search through and perform the
necessary edits.  I don't remember what the bytes were, but it was something
like opcode, length of path (in binary), path.  After you're done, issue the
command "%!xxd -r" to revert it to binary.  Remember to back up your files
when you do this!
 I also had to kick off a trailing byte that got tagged on for some reason
during the binary/hex conversion.

Anyhow, this is a serious bug and could lead to data loss for a lot of
people.  I think we should report it.

On 12/8/06, Espen Amble Kolstad <es...@trank.no> wrote:
> Hi,
>
> I run hadoop-0.9-dev and my edits-file has become corrupt. When I try 
> to start the namenode I get the following error:
> 2006-12-08 20:38:57,431 ERROR dfs.NameNode -
> java.io.FileNotFoundException: Parent path does not exist:
> /user/trank/dotno/segments/20061208154235/parse_data/part-00000
>         at
> org.apache.hadoop.dfs.FSDirectory$INode.addNode(FSDirectory.java:186)
>         at
> org.apache.hadoop.dfs.FSDirectory.unprotectedMkdir(FSDirectory.java:714)
>         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:254)
>         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:191)
>         at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:320)
>         at
org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:226)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:142)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:134)
>         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:585)
>
> I've grep'ed through my edits-file, to see what's wrong. It seems the 
> edits-file is missing an OP_MKDIR for 
> /user/trank/dotno/segments/20061208154235/parse_data.
>
> Is there a tool for fixing an edits-file, or to put in an OP_MKDIR ?
>
> - Espen
>



RE: Corrupt DFS edits-file

Posted by Dhruba Borthakur <dh...@yahoo-inc.com>.
Hi Albert and Espen,

With an eye on debugging more on this issue, I have the following questions:
1. Did you have more than one directory in dfs.name.dir? 
2. Was this a new cluster or was it an existing cluster and was upgraded to
0.9.0 recently?
3. Did any unnatural  Namenode restarts occur immediately before the problem
started occurring?

With an eye on making it easier to recover from such a corruption:
1. Will it help to make the fsimage/edit file ascii, so that it can be
easily edited by hand?
2. Does it make sense for HDFS to automatically create a directory
equivalent to /lost+found? While EditLog processing, if the parent directory
of a file does not exist, the file can go into /lost+found?

Thanks,
dhruba

-----Original Message-----
From: Albert Chern [mailto:albert.chern@gmail.com] 
Sent: Friday, December 08, 2006 1:43 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Corrupt DFS edits-file

This happened to me too, but the problem was the OP_MKDIR instructions
were in the wrong order.  That is, in the edits file the parent
directory was created after the child.  Maybe you should check to see
if that's the case.

I fixed it by using vi in combination with xxd.  When you have the
file open in vi, press escape and issue the command "%!xxd".  This
will convert the binary file to hexadecimal.  Then you can search
through and perform the necessary edits.  I don't remember what the
bytes were, but it was something like opcode, length of path (in
binary), path.  After you're done, issue the command "%!xxd -r" to
revert it to binary.  Remember to back up your files when you do this!
 I also had to kick off a trailing byte that got tagged on for some
reason during the binary/hex conversion.

Anyhow, this is a serious bug and could lead to data loss for a lot of
people.  I think we should report it.

On 12/8/06, Espen Amble Kolstad <es...@trank.no> wrote:
> Hi,
>
> I run hadoop-0.9-dev and my edits-file has become corrupt. When I try to
> start the namenode I get the following error:
> 2006-12-08 20:38:57,431 ERROR dfs.NameNode -
> java.io.FileNotFoundException: Parent path does not exist:
> /user/trank/dotno/segments/20061208154235/parse_data/part-00000
>         at
> org.apache.hadoop.dfs.FSDirectory$INode.addNode(FSDirectory.java:186)
>         at
> org.apache.hadoop.dfs.FSDirectory.unprotectedMkdir(FSDirectory.java:714)
>         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:254)
>         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:191)
>         at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:320)
>         at
org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:226)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:142)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:134)
>         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:585)
>
> I've grep'ed through my edits-file, to see what's wrong. It seems the
> edits-file is missing an OP_MKDIR for
> /user/trank/dotno/segments/20061208154235/parse_data.
>
> Is there a tool for fixing an edits-file, or to put in an OP_MKDIR ?
>
> - Espen
>


Re: Corrupt DFS edits-file

Posted by Albert Chern <al...@gmail.com>.
This happened to me too, but the problem was the OP_MKDIR instructions
were in the wrong order.  That is, in the edits file the parent
directory was created after the child.  Maybe you should check to see
if that's the case.

I fixed it by using vi in combination with xxd.  When you have the
file open in vi, press escape and issue the command "%!xxd".  This
will convert the binary file to hexadecimal.  Then you can search
through and perform the necessary edits.  I don't remember what the
bytes were, but it was something like opcode, length of path (in
binary), path.  After you're done, issue the command "%!xxd -r" to
revert it to binary.  Remember to back up your files when you do this!
 I also had to kick off a trailing byte that got tagged on for some
reason during the binary/hex conversion.

Anyhow, this is a serious bug and could lead to data loss for a lot of
people.  I think we should report it.

On 12/8/06, Espen Amble Kolstad <es...@trank.no> wrote:
> Hi,
>
> I run hadoop-0.9-dev and my edits-file has become corrupt. When I try to
> start the namenode I get the following error:
> 2006-12-08 20:38:57,431 ERROR dfs.NameNode -
> java.io.FileNotFoundException: Parent path does not exist:
> /user/trank/dotno/segments/20061208154235/parse_data/part-00000
>         at
> org.apache.hadoop.dfs.FSDirectory$INode.addNode(FSDirectory.java:186)
>         at
> org.apache.hadoop.dfs.FSDirectory.unprotectedMkdir(FSDirectory.java:714)
>         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:254)
>         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:191)
>         at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:320)
>         at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:226)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:142)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:134)
>         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:585)
>
> I've grep'ed through my edits-file, to see what's wrong. It seems the
> edits-file is missing an OP_MKDIR for
> /user/trank/dotno/segments/20061208154235/parse_data.
>
> Is there a tool for fixing an edits-file, or to put in an OP_MKDIR ?
>
> - Espen
>

Re: Corrupt DFS edits-file

Posted by Andrzej Bialecki <ab...@getopt.org>.
Konstantin Shvachko wrote:
> In your case the log is trying to create a directory named
> /user/trank/dotno/segments/20061208154235/parse_data/part-00000
> which is wrong, since part-00000 is supposed to be a file.

Not so - parse_data is created using MapFileOutputFormat, which creates 
as many part-xxxxx subdirs (MapFile's) as there are reduce tasks, and 
puts {data, index} in them. So, .../parse_data/part-00000 should be a 
directory ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Urgent: Production Issues

Posted by Jagadeesh <ja...@gmail.com>.
Hi All,

I am running Hadoop 0.7.2 in a production environment and it has stored
~170GB of data. Please read below the deployment architecture I am using.

I am using 4 nodes with 1.3TB storage each and the master node is not being
used for storage. So I have 5 servers in total out of which 4 servers are
running Hadoop nodes. This setup was working fine for the last 20-25 days
and there were no issues. As mentioned earlier, now the total storage has
gone upto ~170GB. 

Couple of days back, I noticed an error where Hadoop was not accepting new
files, I mean the upload always failed, but download was still working
great. I was getting the exception, writing <filename>.crc failed. When I
tried restarting the service, I was getting the message, jobtracker not
available and tasktracker not available. Then I had to kill all the
processes in the master node as well as in the client nodes to restart the
service.

After that everything worked fine for a day more and now I keep on getting
the message 

failure closing block of file /user/root/.LICENSE.txt2233331.crc to node
node1:50010

Even if I restart the service, I get this message after 10 minutes.

I read in the mailing list that this issues is resolved in 0.9.0, but I am a
bit skeptical about moving to 0.9.0 as I don't know whether I will end up
loosing the files that are already stored. Kindly confirm this and I wil
move to 0.9.0 and also please tell me the steps or pre-cautions I should
take before moving to 0.9.0.

Thanks and Regards
Jugs


Re: Corrupt DFS edits-file

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.
Philippe,
Periodic checkpointing will bound the size of the edits file.
So it will not grow as big as it does now, and even if it will get 
corrupted that will be a relatively small amount
of information compared to current state when one can loose weeks of 
data if the name-node is not restarted periodically.

Another thing is that the name-node should fall into safe mode when a 
log edit transaction fails, and
wait until the administrator fixes the problem and turns safe mode off.

Espen,
I once had a corrupted edits file. Don't remember what was corrupted, 
but the behavior was similar, the name-node
won't start. I included some custom code into FSImage.loadFSImage to 
deal with the inconsistency.
Once the correct image was created I discarded the custom code.
In your case the log is trying to create a directory named
/user/trank/dotno/segments/20061208154235/parse_data/part-00000
which is wrong, since part-00000 is supposed to be a file.
Have you already restored your image?

--Konstantin

Philippe Gassmann wrote:

>
> Espen Amble Kolstad a écrit :
>
>> Hi,
>>
>> I run hadoop-0.9-dev and my edits-file has become corrupt. When I try to
>> start the namenode I get the following error:
>> 2006-12-08 20:38:57,431 ERROR dfs.NameNode -
>> java.io.FileNotFoundException: Parent path does not exist:
>> /user/trank/dotno/segments/20061208154235/parse_data/part-00000
>>         at
>> org.apache.hadoop.dfs.FSDirectory$INode.addNode(FSDirectory.java:186)
>>         at
>> org.apache.hadoop.dfs.FSDirectory.unprotectedMkdir(FSDirectory.java:714)
>>         at 
>> org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:254)
>>         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:191)
>>         at
>> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:320)
>>         at 
>> org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:226)
>>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:142)
>>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:134)
>>         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:585)
>>
>> I've grep'ed through my edits-file, to see what's wrong. It seems the
>> edits-file is missing an OP_MKDIR for
>> /user/trank/dotno/segments/20061208154235/parse_data.
>>
>> Is there a tool for fixing an edits-file, or to put in an OP_MKDIR ?
>>
>> - Espen
>
>
> Hi all,
>
> Some time ago, I had a similar issue 
> (http://issues.apache.org/jira/browse/HADOOP-760 that duplicates 
> http://issues.apache.org/jira/browse/HADOOP-227).
>
> My first thougths about that was to do automatic checkpointing by 
> merging edits logs to the fsimage (as described in HADOOP-227).
>
> But this approach cannot be considered if edits logs are corrupted (= 
> non mergeable). So I believe we should think about another recovery 
> method.
>
> AFAIK, datanodes are only aware about blocks they are owning. I think 
> we could add a little bit more information with each blocks : the path 
> on the filesystem and the block number. If the namenode is totally 
> crashed (corrupted edit logs), the fs image could be quite easily 
> rebuilt by quierying all datanodes about their blocks.
>
> WDYT ?
>
> cheers,
> -- 
> Philippe.
>
>
>


Re: Corrupt DFS edits-file

Posted by Philippe Gassmann <ph...@anyware-tech.com>.
Espen Amble Kolstad a écrit :
> Hi,
> 
> I run hadoop-0.9-dev and my edits-file has become corrupt. When I try to
> start the namenode I get the following error:
> 2006-12-08 20:38:57,431 ERROR dfs.NameNode -
> java.io.FileNotFoundException: Parent path does not exist:
> /user/trank/dotno/segments/20061208154235/parse_data/part-00000
>         at
> org.apache.hadoop.dfs.FSDirectory$INode.addNode(FSDirectory.java:186)
>         at
> org.apache.hadoop.dfs.FSDirectory.unprotectedMkdir(FSDirectory.java:714)
>         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:254)
>         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:191)
>         at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:320)
>         at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:226)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:142)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:134)
>         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:585)
> 
> I've grep'ed through my edits-file, to see what's wrong. It seems the
> edits-file is missing an OP_MKDIR for
> /user/trank/dotno/segments/20061208154235/parse_data.
> 
> Is there a tool for fixing an edits-file, or to put in an OP_MKDIR ?
> 
> - Espen

Hi all,

Some time ago, I had a similar issue 
(http://issues.apache.org/jira/browse/HADOOP-760 that duplicates 
http://issues.apache.org/jira/browse/HADOOP-227).

My first thougths about that was to do automatic checkpointing by 
merging edits logs to the fsimage (as described in HADOOP-227).

But this approach cannot be considered if edits logs are corrupted (= 
non mergeable). So I believe we should think about another recovery method.

AFAIK, datanodes are only aware about blocks they are owning. I think we 
could add a little bit more information with each blocks : the path on 
the filesystem and the block number. If the namenode is totally crashed 
(corrupted edit logs), the fs image could be quite easily rebuilt by 
quierying all datanodes about their blocks.

WDYT ?

cheers,
--
Philippe.