You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Chris Nauroth (JIRA)" <ji...@apache.org> on 2013/05/09 23:41:16 UTC
[jira] [Created] (HDFS-4811) race condition between 2 namenodes in
standby that are trying to checkpoint with one another can delete or
corrupt a good fsimage
Chris Nauroth created HDFS-4811:
-----------------------------------
Summary: race condition between 2 namenodes in standby that are trying to checkpoint with one another can delete or corrupt a good fsimage
Key: HDFS-4811
URL: https://issues.apache.org/jira/browse/HDFS-4811
Project: Hadoop HDFS
Issue Type: Bug
Components: ha
Affects Versions: 3.0.0, 2.0.5-beta
Reporter: Chris Nauroth
The problem occurs under concurrent execution of the namenode running its own checkpoint in {{StandbyCheckpointer}} in thread 1 while also getting a checkpoint from a different namenode in {{GetImageServlet}} in thread 2. It is possible for thread 2 to finish writing the checkpoint to the directory, but then get suspended before it has a chance to rename it to its final destination as an fsimage file. Then, thread 1 wakes up and starts writing its own data to the checkpoint file. When thread 2 resumes, it then tries to rename the file that thread 1 still holds open for writing. Depending on OS, this either moves thread 1's incomplete checkpoint to fsimage, or it just outright deletes the existing good fsimage until thread 1 finishes writing and renames.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira