You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Konstantin Shvachko (JIRA)" <ji...@apache.org> on 2008/05/03 21:16:55 UTC

[jira] Commented: (HADOOP-3022) Fast Cluster Restart

    [ https://issues.apache.org/jira/browse/HADOOP-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594019#action_12594019 ] 

Konstantin Shvachko commented on HADOOP-3022:
---------------------------------------------

h2. Estimates.
The name-node startup consists of 4 major steps:
# load fsimage
# load edits
# save new merged fsimage
# process block report

I am estimating the name-node startup time based on a 10 million objects image.
The estimates are based on my experiments with a real cluster.
The numbers are scaled proportionally (4.3)  to 10 mln objects.

Under the assumption that each file on average has 1.5 blocks 10 mln objects 
translates into 4 mln files and directories, and 6 mln blocks.
Other claster parameters are summarized in the following table.

|objects|10 mln|
|files & dirs| 4 mln|
|blocks| 6 mln|
|heap size| 3.275 GB|
|image size| 0.6 GB|
|image load time| 132 sec|
|image save time| 70 sec|
|edits size per day| 0.27 GB|
|edits load time| 311 sec|
|data-nodes| 500|
|blocks per node| 36,000|
|block report processing| 320 sec|
|total startup time| 14 min|

This leads to the total startup time of 14 minutes, out of which
|load fsimage| 16%|
|load edits| 37%|
|save new fsimage| 8%|
|process block reports| 39%|

h2. Optimization.
# VM memory heap parameters play key role in the startup process as discussed in HADOOP-3248.
It is highly recommended to set the initial heap size -Xms close to the maximum heap size because 
all that memory will be used by the name-node any way.
# Optimization of loading and saving is substantially related to reducing object allocation.
In case of loading, object allocations are unavoidable, so we will need to make sure
there are no intermediate allocations of temporary objects.
In case of saving, all object allocations should be eliminated, which is done in HADOOP-3248.
# During loading for each file or directory the name-node performs a lookup in the
namespace tree starting from the root in order to find the object's parent directory and then to insert the object into it.
We can take advantage here of that directory children are not interleaving with children of other directories in the image. 
So we can find a parent once and then add then include all its children without repeating the search.
# Block reporting is the most expensive single part of the startup process.
According to my experiments most of the processing time here goes into first adding all blocks into needed replication queue, 
and then removing them from the queue. During startup all blocks are guaranteed to be under-replicated in the beginning.
But most of them and in the regular case all of them will be removed from that list.
So in a sense we create dynamically a huge temporary structure just to make sure that all blocks have enough replicas.
In addition to that in pre HADOOP-2606 versions (before release 0.17) the replication monitor would start processing
 those under-replicated blocks, and try to assign nodes for copying blocks.
The structure works fine during regular operation because it contains only those blocks that are in fact under-replicated.
The processing of block reports goes 5 times faster if the addition to the needed replications queue is removed.
# We also should reduce the number of block lookups in the blocksMap. I counted 5 lookups just in addstoredBlocks() while 
only one lookup is necessary because the name-space is locked and there is only thread that modifies it.

> Fast Cluster Restart
> --------------------
>
>                 Key: HADOOP-3022
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3022
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Robert Chansler
>            Assignee: Konstantin Shvachko
>             Fix For: 0.18.0
>
>
> This item introduces a discussion of how to reduce the time necessary to start a large cluster from tens of minutes to a handful of minutes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.