You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Elmar Grote <el...@optivo.de> on 2013/03/01 09:36:47 UTC

fsimage.ckpt are not deleted - Exception in doCheckpoint

Hi,

we are writing our fsimage and edits file on the namenode and secondary namenode and additional on a nfs share.

In these folders we found a a lot of fsimage.ckpt_000000000........
. files, the oldest is from 9. Aug 2012. 
As far a i know these files should be deleted after the secondary namenodes creates the new fsimage file. 
I looked in our log files from the namenode and secondary namenode to see what happen at that time.

As example i searched for this file:
20. Feb 04:02 fsimage.ckpt_0000000000726216952

In the namenode log i found this:
2013-02-20 04:02:51,404 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: Input/output error
2013-02-20   04:02:51,409 WARN org.mortbay.log: /getimage: java.io.IOException:   GetImage failed. java.io.IOException: Input/output error

In the secondary namenode i think this is the relevant part:
2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Image has not changed. Will not download image.
2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726172233&endTxId=726216952&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file edits_0000000000726172233-0000000000726216952 size 6881797 bytes.
2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.namenode.Checkpointer: Checkpointer about to load edits from 1 stream(s).
2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952 expecting start txid #726172233
2013-02-20 04:01:16,987 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952 of size 6881797 edits # 44720 loaded in 0 seconds.
2013-02-20 04:01:18,023 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage.ckpt_0000000000726216952 using no compression
2013-02-20 04:01:18,031 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image file /var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage.ckpt_0000000000726216952 using no compression
2013-02-20 04:01:40,854 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 1211973003 saved in 22 seconds.
2013-02-20 04:01:50,762 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 1211973003 saved in 32 seconds.
2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 726172232
2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726121750, cpktTxId=0000000000726121750)
2013-02-20 04:01:51,000 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage_0000000000726121750, cpktTxId=0000000000726121750)
2013-02-20 04:01:51,379 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs older than 725172233
2013-02-20 04:01:51,381 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs older than 725172233
2013-02-20 04:01:51,400 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:02:51,411 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException: Image transfer servlet at http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4 failed with status code 410
Response message:
GetImage failed. java.io.IOException: Input/output error  at sun.nio.ch.FileChannelImpl.force0(Native Method)  at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:358)  at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:303)  at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.downloadImageToStorage(TransferFsImage.java:75)  at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:169)  at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:111)  at java.security.AccessController.doPrivileged(Native Method)  at javax.security.auth.Subject.doAs(Subject.java:396)  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)  at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:111)  at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)  at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)  at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:947)  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)  at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)  at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)  at org.mortbay.jetty.Server.handle(Server.java:326)  at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)  at org.mortbay.jetty.Htt
    at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:216)
    at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:126)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:478)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:334)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:301)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:438)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
    at java.lang.Thread.run(Thread.java:619)
2013-02-20 04:04:52,592 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://s_namenode.domain.local:50070/getimage?getimage=1&txid=726172232&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:05:36,976 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000726172232 size 1212016242 bytes.
2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage:   Skipping download of remote edit log [726172233,726216952] since it   already is stored locally at /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726216953&endTxId=726262269&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file edits_0000000000726216953-0000000000726262269 size 7013503 bytes.
2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loading image file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726172232 using no compression
2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files = 8795086
2013-02-20 04:06:13,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files under construction = 32
2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 1212016242 loaded in 35 seconds.
2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded image for txid 726172232 from /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726172232
2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.Checkpointer: Checkpointer about to load edits from 2 stream(s).
2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952 expecting start txid #726172233
2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952 of size 6881797 edits # 44720 loaded in 0 seconds.
2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726216953-0000000000726262269 expecting start txid #726216953
2013-02-20 04:06:14,372 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726216953-0000000000726262269 of size 7013503 edits # 45317 loaded in 0 seconds.
2013-02-20 04:06:15,285 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage.ckpt_0000000000726262269 using no compression
2013-02-20 04:06:15,289 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image file /var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage.ckpt_0000000000726262269 using no compression
2013-02-20 04:06:38,530 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 1212107279 saved in 23 seconds.
2013-02-20 04:06:45,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 1212107279 saved in 30 seconds.
2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 726216952
2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726172232, cpktTxId=0000000000726172232)
2013-02-20 04:06:45,646 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage_0000000000726172232, cpktTxId=0000000000726172232)
2013-02-20 04:06:46,115 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs older than 725216953
2013-02-20 04:06:46,118 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs older than 725216953
2013-02-20 04:06:46,145 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726262269&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:07:31,010 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with txid 726262269 to namenode at s_namenode.domain.local:50070
2013-02-20 04:07:31,011 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done. New Image Size: 1212107279
2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 726216952
2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs older than 725216953
2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs older than 725216953

At that time we are copying our whole data from this cluster to a second cluster. So we are reading a lot in this cluster.
In   our monitoring graphs i can not see any peaks at the time this happens.   Only the secondary namenode takes more memory than usual, about 1G   more. 

What happen here? Was the creation of the new fsimage successful? If so why where the old fsimag.ckpt not deleted?
Or did we lose some data?

Regards Elmar

Re: fsimage.ckpt are not deleted - Exception in doCheckpoint

Posted by Harsh J <ha...@cloudera.com>.
I don't think there's a data loss here. However, I think you may have been
affected by https://issues.apache.org/jira/browse/HDFS-4301 due to a large
fsimage size perhaps, which you can workaround by increasing the default
timeout via property dfs.image.transfer.timeout (default 60000ms, i.e. 1
minute) to > 10 minutes or so in ms value.

On Fri, Mar 1, 2013 at 2:06 PM, Elmar Grote <el...@optivo.de> wrote:

> **
> Hi,
>
> we are writing our fsimage and edits file on the namenode and secondary
> namenode and additional on a nfs share.
>
> In these folders we found a a lot of fsimage.ckpt_000000000........
> **. files, the oldest is from 9. Aug 2012.
> As far a i know these files should be deleted after the secondary
> namenodes creates the new fsimage file.
> I looked in our log files from the namenode and secondary namenode to see
> what happen at that time.
>
> As example i searched for this file:
> 20. Feb 04:02 fsimage.ckpt_**0000000000726216952
>
> In the namenode log i found this:
> 2013-02-20 04:02:51,404 ERROR org.apache.hadoop.security.**UserGroupInformation:
> PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException:
> Input/output error
> 2013-02-20 04:02:51,409 WARN org.mortbay.log: /getimage:
> java.io.IOException: GetImage failed. java.io.IOException: Input/output
> error
>
> In the secondary namenode i think this is the relevant part:
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Image has not changed. Will not download image.
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726172233&endTxId=**
> 726216952&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726172233&endTxId=726216952&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726172233-**0000000000726216952 size
> 6881797 bytes.
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 1 stream(s).
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:01:16,987 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:01:18,023 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726216952 using no compression
> 2013-02-20 04:01:18,031 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726216952 using no
> compression
> 2013-02-20 04:01:40,854 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 22 seconds.
> 2013-02-20 04:01:50,762 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 32 seconds.
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726172232
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,000 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,379 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,381 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,400 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:02:51,411 ERROR org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Exception in doCheckpoint
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage$**HttpGetFailedException:
> Image transfer servlet at http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>failed with status code 410
> Response message:
> GetImage failed. java.io.IOException: Input/output error  at
> sun.nio.ch.FileChannelImpl.**force0(Native Method)  at
> sun.nio.ch.FileChannelImpl.**force(FileChannelImpl.java:**358)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:303)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> downloadImageToStorage(**TransferFsImage.java:75)  at
> org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:169)
> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:111)
> at java.security.**AccessController.doPrivileged(**Native Method)  at
> javax.security.auth.Subject.**doAs(Subject.java:396)  at
> org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1232)  at org.apache.hadoop.hdfs.server.**
> namenode.GetImageServlet.**doGet(GetImageServlet.java:**111)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:707)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:820)  at
> org.mortbay.jetty.servlet.**ServletHolder.handle(**ServletHolder.java:511)
> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1221)  at org.apache.hadoop.http.**
> HttpServer$QuotingInputFilter.**doFilter(HttpServer.java:947)  at
> org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1212)  at org.mortbay.jetty.servlet.**
> ServletHandler.handle(**ServletHandler.java:399)  at
> org.mortbay.jetty.security.**SecurityHandler.handle(**SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.**SessionHandler.handle(**SessionHandler.java:182)
> at org.mortbay.jetty.handler.**ContextHandler.handle(**ContextHandler.java:766)
> at org.mortbay.jetty.webapp.**WebAppContext.handle(**WebAppContext.java:450)
> at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
> ContextHandlerCollection.java:**230)  at org.mortbay.jetty.handler.**
> HandlerWrapper.handle(**HandlerWrapper.java:152)  at
> org.mortbay.jetty.Server.**handle(Server.java:326)  at org.mortbay.jetty.*
> *HttpConnection.handleRequest(**HttpConnection.java:542)  at
> org.mortbay.jetty.Htt
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:216)
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> uploadImageFromStorage(**TransferFsImage.java:126)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doCheckpoint(**SecondaryNameNode.java:478)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doWork(SecondaryNameNode.java:**334)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$2.**
> run(SecondaryNameNode.java:**301)
>     at org.apache.hadoop.security.**SecurityUtil.**doAsLoginUserOrFatal(**
> SecurityUtil.java:438)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> run(SecondaryNameNode.java:**297)
>     at java.lang.Thread.run(Thread.**java:619)
> 2013-02-20 04:04:52,592 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getimage=**1&txid=726172232&storageInfo=-**
> 40:1814856193:1341996094997:**CID-064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getimage=1&txid=726172232&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:36,976 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file fsimage.ckpt_**0000000000726172232 size 1212016242 bytes.
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Skipping download of remote edit log [726172233,726216952] since it already
> is stored locally at /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**edits_0000000000726172233-**
> 0000000000726216952
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726216953&endTxId=**
> 726262269&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726216953&endTxId=726262269&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726216953-**0000000000726262269 size
> 7013503 bytes.
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loading image file /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232 using no
> compression
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files = 8795086
> 2013-02-20 04:06:13,678 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files under construction = 32
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212016242 loaded in 35 seconds.
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loaded image for txid 726172232 from /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 2 stream(s).
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 expecting start txid
> #726216953
> 2013-02-20 04:06:14,372 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 of size 7013503 edits #
> 45317 loaded in 0 seconds.
> 2013-02-20 04:06:15,285 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726262269 using no compression
> 2013-02-20 04:06:15,289 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726262269 using no
> compression
> 2013-02-20 04:06:38,530 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 23 seconds.
> 2013-02-20 04:06:45,380 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 30 seconds.
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:45,646 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:46,115 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,118 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,145 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726262269&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726262269&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:07:31,010 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Uploaded image with txid 726262269 to namenode at
> s_namenode.domain.local:50070
> 2013-02-20 04:07:31,011 WARN org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Checkpoint done. New Image Size: 1212107279
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
>
> At that time we are copying our whole data from this cluster to a second
> cluster. So we are reading a lot in this cluster.
> In our monitoring graphs i can not see any peaks at the time this happens.
> Only the secondary namenode takes more memory than usual, about 1G more.
>
> What happen here? Was the creation of the new fsimage successful? If so
> why where the old fsimag.ckpt not deleted?
> Or did we lose some data?
>
> Regards Elmar
>



-- 
Harsh J

Re: fsimage.ckpt are not deleted - Exception in doCheckpoint

Posted by Harsh J <ha...@cloudera.com>.
I don't think there's a data loss here. However, I think you may have been
affected by https://issues.apache.org/jira/browse/HDFS-4301 due to a large
fsimage size perhaps, which you can workaround by increasing the default
timeout via property dfs.image.transfer.timeout (default 60000ms, i.e. 1
minute) to > 10 minutes or so in ms value.

On Fri, Mar 1, 2013 at 2:06 PM, Elmar Grote <el...@optivo.de> wrote:

> **
> Hi,
>
> we are writing our fsimage and edits file on the namenode and secondary
> namenode and additional on a nfs share.
>
> In these folders we found a a lot of fsimage.ckpt_000000000........
> **. files, the oldest is from 9. Aug 2012.
> As far a i know these files should be deleted after the secondary
> namenodes creates the new fsimage file.
> I looked in our log files from the namenode and secondary namenode to see
> what happen at that time.
>
> As example i searched for this file:
> 20. Feb 04:02 fsimage.ckpt_**0000000000726216952
>
> In the namenode log i found this:
> 2013-02-20 04:02:51,404 ERROR org.apache.hadoop.security.**UserGroupInformation:
> PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException:
> Input/output error
> 2013-02-20 04:02:51,409 WARN org.mortbay.log: /getimage:
> java.io.IOException: GetImage failed. java.io.IOException: Input/output
> error
>
> In the secondary namenode i think this is the relevant part:
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Image has not changed. Will not download image.
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726172233&endTxId=**
> 726216952&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726172233&endTxId=726216952&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726172233-**0000000000726216952 size
> 6881797 bytes.
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 1 stream(s).
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:01:16,987 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:01:18,023 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726216952 using no compression
> 2013-02-20 04:01:18,031 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726216952 using no
> compression
> 2013-02-20 04:01:40,854 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 22 seconds.
> 2013-02-20 04:01:50,762 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 32 seconds.
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726172232
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,000 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,379 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,381 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,400 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:02:51,411 ERROR org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Exception in doCheckpoint
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage$**HttpGetFailedException:
> Image transfer servlet at http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>failed with status code 410
> Response message:
> GetImage failed. java.io.IOException: Input/output error  at
> sun.nio.ch.FileChannelImpl.**force0(Native Method)  at
> sun.nio.ch.FileChannelImpl.**force(FileChannelImpl.java:**358)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:303)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> downloadImageToStorage(**TransferFsImage.java:75)  at
> org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:169)
> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:111)
> at java.security.**AccessController.doPrivileged(**Native Method)  at
> javax.security.auth.Subject.**doAs(Subject.java:396)  at
> org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1232)  at org.apache.hadoop.hdfs.server.**
> namenode.GetImageServlet.**doGet(GetImageServlet.java:**111)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:707)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:820)  at
> org.mortbay.jetty.servlet.**ServletHolder.handle(**ServletHolder.java:511)
> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1221)  at org.apache.hadoop.http.**
> HttpServer$QuotingInputFilter.**doFilter(HttpServer.java:947)  at
> org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1212)  at org.mortbay.jetty.servlet.**
> ServletHandler.handle(**ServletHandler.java:399)  at
> org.mortbay.jetty.security.**SecurityHandler.handle(**SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.**SessionHandler.handle(**SessionHandler.java:182)
> at org.mortbay.jetty.handler.**ContextHandler.handle(**ContextHandler.java:766)
> at org.mortbay.jetty.webapp.**WebAppContext.handle(**WebAppContext.java:450)
> at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
> ContextHandlerCollection.java:**230)  at org.mortbay.jetty.handler.**
> HandlerWrapper.handle(**HandlerWrapper.java:152)  at
> org.mortbay.jetty.Server.**handle(Server.java:326)  at org.mortbay.jetty.*
> *HttpConnection.handleRequest(**HttpConnection.java:542)  at
> org.mortbay.jetty.Htt
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:216)
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> uploadImageFromStorage(**TransferFsImage.java:126)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doCheckpoint(**SecondaryNameNode.java:478)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doWork(SecondaryNameNode.java:**334)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$2.**
> run(SecondaryNameNode.java:**301)
>     at org.apache.hadoop.security.**SecurityUtil.**doAsLoginUserOrFatal(**
> SecurityUtil.java:438)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> run(SecondaryNameNode.java:**297)
>     at java.lang.Thread.run(Thread.**java:619)
> 2013-02-20 04:04:52,592 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getimage=**1&txid=726172232&storageInfo=-**
> 40:1814856193:1341996094997:**CID-064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getimage=1&txid=726172232&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:36,976 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file fsimage.ckpt_**0000000000726172232 size 1212016242 bytes.
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Skipping download of remote edit log [726172233,726216952] since it already
> is stored locally at /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**edits_0000000000726172233-**
> 0000000000726216952
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726216953&endTxId=**
> 726262269&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726216953&endTxId=726262269&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726216953-**0000000000726262269 size
> 7013503 bytes.
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loading image file /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232 using no
> compression
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files = 8795086
> 2013-02-20 04:06:13,678 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files under construction = 32
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212016242 loaded in 35 seconds.
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loaded image for txid 726172232 from /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 2 stream(s).
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 expecting start txid
> #726216953
> 2013-02-20 04:06:14,372 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 of size 7013503 edits #
> 45317 loaded in 0 seconds.
> 2013-02-20 04:06:15,285 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726262269 using no compression
> 2013-02-20 04:06:15,289 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726262269 using no
> compression
> 2013-02-20 04:06:38,530 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 23 seconds.
> 2013-02-20 04:06:45,380 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 30 seconds.
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:45,646 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:46,115 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,118 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,145 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726262269&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726262269&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:07:31,010 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Uploaded image with txid 726262269 to namenode at
> s_namenode.domain.local:50070
> 2013-02-20 04:07:31,011 WARN org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Checkpoint done. New Image Size: 1212107279
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
>
> At that time we are copying our whole data from this cluster to a second
> cluster. So we are reading a lot in this cluster.
> In our monitoring graphs i can not see any peaks at the time this happens.
> Only the secondary namenode takes more memory than usual, about 1G more.
>
> What happen here? Was the creation of the new fsimage successful? If so
> why where the old fsimag.ckpt not deleted?
> Or did we lose some data?
>
> Regards Elmar
>



-- 
Harsh J

Re: fsimage.ckpt are not deleted - Exception in doCheckpoint

Posted by Harsh J <ha...@cloudera.com>.
I don't think there's a data loss here. However, I think you may have been
affected by https://issues.apache.org/jira/browse/HDFS-4301 due to a large
fsimage size perhaps, which you can workaround by increasing the default
timeout via property dfs.image.transfer.timeout (default 60000ms, i.e. 1
minute) to > 10 minutes or so in ms value.

On Fri, Mar 1, 2013 at 2:06 PM, Elmar Grote <el...@optivo.de> wrote:

> **
> Hi,
>
> we are writing our fsimage and edits file on the namenode and secondary
> namenode and additional on a nfs share.
>
> In these folders we found a a lot of fsimage.ckpt_000000000........
> **. files, the oldest is from 9. Aug 2012.
> As far a i know these files should be deleted after the secondary
> namenodes creates the new fsimage file.
> I looked in our log files from the namenode and secondary namenode to see
> what happen at that time.
>
> As example i searched for this file:
> 20. Feb 04:02 fsimage.ckpt_**0000000000726216952
>
> In the namenode log i found this:
> 2013-02-20 04:02:51,404 ERROR org.apache.hadoop.security.**UserGroupInformation:
> PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException:
> Input/output error
> 2013-02-20 04:02:51,409 WARN org.mortbay.log: /getimage:
> java.io.IOException: GetImage failed. java.io.IOException: Input/output
> error
>
> In the secondary namenode i think this is the relevant part:
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Image has not changed. Will not download image.
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726172233&endTxId=**
> 726216952&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726172233&endTxId=726216952&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726172233-**0000000000726216952 size
> 6881797 bytes.
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 1 stream(s).
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:01:16,987 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:01:18,023 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726216952 using no compression
> 2013-02-20 04:01:18,031 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726216952 using no
> compression
> 2013-02-20 04:01:40,854 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 22 seconds.
> 2013-02-20 04:01:50,762 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 32 seconds.
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726172232
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,000 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,379 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,381 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,400 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:02:51,411 ERROR org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Exception in doCheckpoint
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage$**HttpGetFailedException:
> Image transfer servlet at http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>failed with status code 410
> Response message:
> GetImage failed. java.io.IOException: Input/output error  at
> sun.nio.ch.FileChannelImpl.**force0(Native Method)  at
> sun.nio.ch.FileChannelImpl.**force(FileChannelImpl.java:**358)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:303)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> downloadImageToStorage(**TransferFsImage.java:75)  at
> org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:169)
> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:111)
> at java.security.**AccessController.doPrivileged(**Native Method)  at
> javax.security.auth.Subject.**doAs(Subject.java:396)  at
> org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1232)  at org.apache.hadoop.hdfs.server.**
> namenode.GetImageServlet.**doGet(GetImageServlet.java:**111)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:707)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:820)  at
> org.mortbay.jetty.servlet.**ServletHolder.handle(**ServletHolder.java:511)
> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1221)  at org.apache.hadoop.http.**
> HttpServer$QuotingInputFilter.**doFilter(HttpServer.java:947)  at
> org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1212)  at org.mortbay.jetty.servlet.**
> ServletHandler.handle(**ServletHandler.java:399)  at
> org.mortbay.jetty.security.**SecurityHandler.handle(**SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.**SessionHandler.handle(**SessionHandler.java:182)
> at org.mortbay.jetty.handler.**ContextHandler.handle(**ContextHandler.java:766)
> at org.mortbay.jetty.webapp.**WebAppContext.handle(**WebAppContext.java:450)
> at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
> ContextHandlerCollection.java:**230)  at org.mortbay.jetty.handler.**
> HandlerWrapper.handle(**HandlerWrapper.java:152)  at
> org.mortbay.jetty.Server.**handle(Server.java:326)  at org.mortbay.jetty.*
> *HttpConnection.handleRequest(**HttpConnection.java:542)  at
> org.mortbay.jetty.Htt
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:216)
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> uploadImageFromStorage(**TransferFsImage.java:126)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doCheckpoint(**SecondaryNameNode.java:478)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doWork(SecondaryNameNode.java:**334)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$2.**
> run(SecondaryNameNode.java:**301)
>     at org.apache.hadoop.security.**SecurityUtil.**doAsLoginUserOrFatal(**
> SecurityUtil.java:438)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> run(SecondaryNameNode.java:**297)
>     at java.lang.Thread.run(Thread.**java:619)
> 2013-02-20 04:04:52,592 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getimage=**1&txid=726172232&storageInfo=-**
> 40:1814856193:1341996094997:**CID-064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getimage=1&txid=726172232&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:36,976 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file fsimage.ckpt_**0000000000726172232 size 1212016242 bytes.
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Skipping download of remote edit log [726172233,726216952] since it already
> is stored locally at /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**edits_0000000000726172233-**
> 0000000000726216952
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726216953&endTxId=**
> 726262269&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726216953&endTxId=726262269&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726216953-**0000000000726262269 size
> 7013503 bytes.
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loading image file /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232 using no
> compression
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files = 8795086
> 2013-02-20 04:06:13,678 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files under construction = 32
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212016242 loaded in 35 seconds.
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loaded image for txid 726172232 from /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 2 stream(s).
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 expecting start txid
> #726216953
> 2013-02-20 04:06:14,372 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 of size 7013503 edits #
> 45317 loaded in 0 seconds.
> 2013-02-20 04:06:15,285 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726262269 using no compression
> 2013-02-20 04:06:15,289 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726262269 using no
> compression
> 2013-02-20 04:06:38,530 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 23 seconds.
> 2013-02-20 04:06:45,380 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 30 seconds.
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:45,646 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:46,115 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,118 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,145 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726262269&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726262269&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:07:31,010 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Uploaded image with txid 726262269 to namenode at
> s_namenode.domain.local:50070
> 2013-02-20 04:07:31,011 WARN org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Checkpoint done. New Image Size: 1212107279
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
>
> At that time we are copying our whole data from this cluster to a second
> cluster. So we are reading a lot in this cluster.
> In our monitoring graphs i can not see any peaks at the time this happens.
> Only the secondary namenode takes more memory than usual, about 1G more.
>
> What happen here? Was the creation of the new fsimage successful? If so
> why where the old fsimag.ckpt not deleted?
> Or did we lose some data?
>
> Regards Elmar
>



-- 
Harsh J

Re: fsimage.ckpt are not deleted - Exception in doCheckpoint

Posted by Harsh J <ha...@cloudera.com>.
I don't think there's a data loss here. However, I think you may have been
affected by https://issues.apache.org/jira/browse/HDFS-4301 due to a large
fsimage size perhaps, which you can workaround by increasing the default
timeout via property dfs.image.transfer.timeout (default 60000ms, i.e. 1
minute) to > 10 minutes or so in ms value.

On Fri, Mar 1, 2013 at 2:06 PM, Elmar Grote <el...@optivo.de> wrote:

> **
> Hi,
>
> we are writing our fsimage and edits file on the namenode and secondary
> namenode and additional on a nfs share.
>
> In these folders we found a a lot of fsimage.ckpt_000000000........
> **. files, the oldest is from 9. Aug 2012.
> As far a i know these files should be deleted after the secondary
> namenodes creates the new fsimage file.
> I looked in our log files from the namenode and secondary namenode to see
> what happen at that time.
>
> As example i searched for this file:
> 20. Feb 04:02 fsimage.ckpt_**0000000000726216952
>
> In the namenode log i found this:
> 2013-02-20 04:02:51,404 ERROR org.apache.hadoop.security.**UserGroupInformation:
> PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException:
> Input/output error
> 2013-02-20 04:02:51,409 WARN org.mortbay.log: /getimage:
> java.io.IOException: GetImage failed. java.io.IOException: Input/output
> error
>
> In the secondary namenode i think this is the relevant part:
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Image has not changed. Will not download image.
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726172233&endTxId=**
> 726216952&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726172233&endTxId=726216952&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726172233-**0000000000726216952 size
> 6881797 bytes.
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 1 stream(s).
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:01:16,987 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:01:18,023 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726216952 using no compression
> 2013-02-20 04:01:18,031 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726216952 using no
> compression
> 2013-02-20 04:01:40,854 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 22 seconds.
> 2013-02-20 04:01:50,762 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 32 seconds.
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726172232
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,000 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,379 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,381 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,400 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:02:51,411 ERROR org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Exception in doCheckpoint
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage$**HttpGetFailedException:
> Image transfer servlet at http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>failed with status code 410
> Response message:
> GetImage failed. java.io.IOException: Input/output error  at
> sun.nio.ch.FileChannelImpl.**force0(Native Method)  at
> sun.nio.ch.FileChannelImpl.**force(FileChannelImpl.java:**358)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:303)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> downloadImageToStorage(**TransferFsImage.java:75)  at
> org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:169)
> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:111)
> at java.security.**AccessController.doPrivileged(**Native Method)  at
> javax.security.auth.Subject.**doAs(Subject.java:396)  at
> org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1232)  at org.apache.hadoop.hdfs.server.**
> namenode.GetImageServlet.**doGet(GetImageServlet.java:**111)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:707)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:820)  at
> org.mortbay.jetty.servlet.**ServletHolder.handle(**ServletHolder.java:511)
> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1221)  at org.apache.hadoop.http.**
> HttpServer$QuotingInputFilter.**doFilter(HttpServer.java:947)  at
> org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1212)  at org.mortbay.jetty.servlet.**
> ServletHandler.handle(**ServletHandler.java:399)  at
> org.mortbay.jetty.security.**SecurityHandler.handle(**SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.**SessionHandler.handle(**SessionHandler.java:182)
> at org.mortbay.jetty.handler.**ContextHandler.handle(**ContextHandler.java:766)
> at org.mortbay.jetty.webapp.**WebAppContext.handle(**WebAppContext.java:450)
> at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
> ContextHandlerCollection.java:**230)  at org.mortbay.jetty.handler.**
> HandlerWrapper.handle(**HandlerWrapper.java:152)  at
> org.mortbay.jetty.Server.**handle(Server.java:326)  at org.mortbay.jetty.*
> *HttpConnection.handleRequest(**HttpConnection.java:542)  at
> org.mortbay.jetty.Htt
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:216)
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> uploadImageFromStorage(**TransferFsImage.java:126)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doCheckpoint(**SecondaryNameNode.java:478)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doWork(SecondaryNameNode.java:**334)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$2.**
> run(SecondaryNameNode.java:**301)
>     at org.apache.hadoop.security.**SecurityUtil.**doAsLoginUserOrFatal(**
> SecurityUtil.java:438)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> run(SecondaryNameNode.java:**297)
>     at java.lang.Thread.run(Thread.**java:619)
> 2013-02-20 04:04:52,592 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getimage=**1&txid=726172232&storageInfo=-**
> 40:1814856193:1341996094997:**CID-064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getimage=1&txid=726172232&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:36,976 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file fsimage.ckpt_**0000000000726172232 size 1212016242 bytes.
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Skipping download of remote edit log [726172233,726216952] since it already
> is stored locally at /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**edits_0000000000726172233-**
> 0000000000726216952
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726216953&endTxId=**
> 726262269&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726216953&endTxId=726262269&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726216953-**0000000000726262269 size
> 7013503 bytes.
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loading image file /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232 using no
> compression
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files = 8795086
> 2013-02-20 04:06:13,678 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files under construction = 32
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212016242 loaded in 35 seconds.
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loaded image for txid 726172232 from /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 2 stream(s).
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 expecting start txid
> #726216953
> 2013-02-20 04:06:14,372 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 of size 7013503 edits #
> 45317 loaded in 0 seconds.
> 2013-02-20 04:06:15,285 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726262269 using no compression
> 2013-02-20 04:06:15,289 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726262269 using no
> compression
> 2013-02-20 04:06:38,530 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 23 seconds.
> 2013-02-20 04:06:45,380 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 30 seconds.
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:45,646 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:46,115 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,118 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,145 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726262269&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726262269&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:07:31,010 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Uploaded image with txid 726262269 to namenode at
> s_namenode.domain.local:50070
> 2013-02-20 04:07:31,011 WARN org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Checkpoint done. New Image Size: 1212107279
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
>
> At that time we are copying our whole data from this cluster to a second
> cluster. So we are reading a lot in this cluster.
> In our monitoring graphs i can not see any peaks at the time this happens.
> Only the secondary namenode takes more memory than usual, about 1G more.
>
> What happen here? Was the creation of the new fsimage successful? If so
> why where the old fsimag.ckpt not deleted?
> Or did we lose some data?
>
> Regards Elmar
>



-- 
Harsh J