You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ravi Phulari (JIRA)" <ji...@apache.org> on 2009/10/13 23:33:31 UTC
[jira] Updated: (HADOOP-306) Safe mode and name node startup
procedures
[ https://issues.apache.org/jira/browse/HADOOP-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ravi Phulari updated HADOOP-306:
--------------------------------
Attachment: SafeMode.html
> Safe mode and name node startup procedures
> ------------------------------------------
>
> Key: HADOOP-306
> URL: https://issues.apache.org/jira/browse/HADOOP-306
> Project: Hadoop Common
> Issue Type: New Feature
> Affects Versions: 0.3.2
> Reporter: Konstantin Shvachko
> Assignee: Konstantin Shvachko
> Fix For: 0.7.0
>
> Attachments: SafeMode.html, SafeMode.patch, SafeModeEnum.patch
>
>
> This is a proposal to improve DFS cluster startup process.
> The data node startup procedures were described and implemented in HADOOP-124.
> I'm trying to extend them to the name node here.
> The main idea is to introduce safe mode, which can be entered manually for administration
> purposes, or automatically when a configurable threshold of active data nodes is breached,
> or at startup when the node stays in safe mode until the minimal limit of active
> nodes is reached.
> This are high level requirements intended to improve the name node and cluster reliability.
> = The name node safe mode means that the name node is not changing the state of the
> file system. Meta data is read-only, and block replication / removal is not taking place.
> = In safe mode the name node accepts data node registrations and
> processes their block reports.
> = The name node always starts in safe mode and stays safe until the majority
> (a configurable parameter: safemode.threshold) of data nodes (or blocks?)
> is reported.
> = The name node can also fall into safe mode when the number of non-active
> (heartbeats stopped coming in) data nodes becomes critical.
> = The startup "silent period", when the name node is in safe mode and is
> not issuing any block requests to the data nodes, is initially set to a
> configurable value safemode.timeout.increment. By the end of the timeout
> the name node checks the safemode.threshold and decides whether to switch
> to the normal mode or to stay in safe. If the normal mode criteria is not
> met, then the silent period is extended by incrementing the safemode timeout.
> = The name node stays in safe mode not longer than a configurable value of
> safemode.timeout.max, in which case it logs missing data nodes and shuts
> itself down.
> = When the name node switches to normal mode it checks whether all required
> data nodes have actually registered, based on the list of active data storages
> from the last session. Then it logs missing nodes, if any, and starts
> replicating and/or deleting blocks as required.
> = A historical list of data storages (nodes) ever registered with the cluster is
> persistently stored in the image and log files. The list is used in two ways:
> a) at startup to verify whether all nodes have registered, and to report
> missing nodes;
> b) at runtime if a data node registers with a new storage id the
> name node verifies that no new blocks are reported from that storage,
> which would prevent us from accidentally connecting data nodes from a
> different cluster.
> = The name node should have an option to run in safe mode. Starting with
> that option would mean it never leaves safe mode.
> This is useful for testing the cluster.
> = Data nodes that can not connect to the name node for a long time (configurable)
> should shut down themselves.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.