You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Gabi Kazav <Ga...@pursway.com> on 2011/09/22 10:48:21 UTC

corrupted edits log after power failure

Hi,

I had Power Failure.
I have backup of files: edits, fsimage.

I am backing it up with:

curl -s http://nameNode:50070/getimage?getimage=1 > fsimage
curl -s http://nameNode:50070/getimage?getedits=1 > edits

When I am trying to start the HDFS with the recovered files, I got error about the edits file : "Error replaying edit log at offset 1921"

Also, I have edits.new file, when I rename it to edits I got: "ERROR org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log at offset 2494103"

What is the problem?!


And from now on, how can I do a backup that works?! :)

Thanks,
Gabi.




Gabi Kazav
IT Manager And Infrastructure Engineer
Gabi.Kazav@pursway.com<ma...@pursway.com> | www.pursway.com<http://www.pursway.com/>
Mailing address PO Box 4125, Herzliya 46140
Address 8 Hamada St., Herzliya, IL | Tel +972 527 772457| Fax + 972 9 958 4736

Re: corrupted edits log after power failure

Posted by Steve Loughran <st...@apache.org>.

On 22/09/11 20:15, Brian Bockelman wrote:
> Hi Gabi,
>
> I'd be a bit scared of that backup strategy; what happens if the TCP connection gets cut suddenly during curl?  What happens if there's a TCP corruption?  Such things have happened before.

Curl might work for long-haul backups, but I'd use HTTPS for its better 
checksums, and have alternate in-cluster strategies, such as shared HA 
filesystems

>
> Personally, we have the SNN merge the edits every 15 minutes.  If it hasn't happened in 30 minutes, people get emailed.  If it doesn't happen in 45 minutes, people get paged.

That's a good technique for verifying the SNN is actually working. 
Thinking it is working, when it isn't is danger

> In addition to writing out copies to a few disks and to NFS, we also have a versioned backup of the checkpoint.prev.
>
> The worst case scenario would be if the SNN corrupts the image and uploads the corrupt image (it's a theoretical situation so far...); this would be caught at the next merge, meaning we trash up to 30 minutes of work.  This would ruin someone's day, but not someone's week.
>
> The NN is a SPOF, and should be treated with an appropriate level of paranoia (and, because it is a SPOF, assume that it will fail anyway and make sure you can accept the consequences).

That is: test your handling of the outage on a regular basis.

Re: corrupted edits log after power failure

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hi Gabi,

I'd be a bit scared of that backup strategy; what happens if the TCP connection gets cut suddenly during curl?  What happens if there's a TCP corruption?  Such things have happened before.

Personally, we have the SNN merge the edits every 15 minutes.  If it hasn't happened in 30 minutes, people get emailed.  If it doesn't happen in 45 minutes, people get paged.

In addition to writing out copies to a few disks and to NFS, we also have a versioned backup of the checkpoint.prev.

The worst case scenario would be if the SNN corrupts the image and uploads the corrupt image (it's a theoretical situation so far...); this would be caught at the next merge, meaning we trash up to 30 minutes of work.  This would ruin someone's day, but not someone's week.

The NN is a SPOF, and should be treated with an appropriate level of paranoia (and, because it is a SPOF, assume that it will fail anyway and make sure you can accept the consequences).

Brian

On Sep 22, 2011, at 3:48 AM, Gabi Kazav wrote:

> Hi,
> 
> I had Power Failure.
> I have backup of files: edits, fsimage.
> 
> I am backing it up with:
> 
> curl -s http://nameNode:50070/getimage?getimage=1 > fsimage
> curl -s http://nameNode:50070/getimage?getedits=1 > edits
> 
> When I am trying to start the HDFS with the recovered files, I got error about the edits file : "Error replaying edit log at offset 1921"
> 
> Also, I have edits.new file, when I rename it to edits I got: "ERROR org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log at offset 2494103"
> 
> What is the problem?!
> 
> 
> And from now on, how can I do a backup that works?! :)
> 
> Thanks,
> Gabi.
> 
> 
> 
> 
> Gabi Kazav
> IT Manager And Infrastructure Engineer
> Gabi.Kazav@pursway.com<ma...@pursway.com> | www.pursway.com<http://www.pursway.com/>
> Mailing address PO Box 4125, Herzliya 46140
> Address 8 Hamada St., Herzliya, IL | Tel +972 527 772457| Fax + 972 9 958 4736
>

Re: corrupted edits log after power failure

Posted by Kihwal Lee <ki...@yahoo-inc.com>.

Does the backup process include syncing? On-drive write cache can also trick you.
For absolutely critical data, it is a good idea to use a controller with battery-backed write cache or a service/product that guarantees durability.

Kihwal

On 9/22/11 3:48 AM, "Gabi Kazav" <Ga...@pursway.com> wrote:

Hi,

I had Power Failure.
I have backup of files: edits, fsimage.

I am backing it up with:

curl -s http://nameNode:50070/getimage?getimage=1 > fsimage
curl -s http://nameNode:50070/getimage?getedits=1 > edits

When I am trying to start the HDFS with the recovered files, I got error about the edits file : "Error replaying edit log at offset 1921"

Also, I have edits.new file, when I rename it to edits I got: "ERROR org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log at offset 2494103"

What is the problem?!


And from now on, how can I do a backup that works?! :)

Thanks,
Gabi.




Gabi Kazav
IT Manager And Infrastructure Engineer
Gabi.Kazav@pursway.com<ma...@pursway.com> | www.pursway.com<http://www.pursway.com/>
Mailing address PO Box 4125, Herzliya 46140
Address 8 Hamada St., Herzliya, IL | Tel +972 527 772457| Fax + 972 9 958 4736