You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Cagdas Gerede <ca...@gmail.com> on 2008/04/24 01:37:26 UTC

Please Help: Namenode Safemode

I have a hadoop distributed file system with 3 datanodes. I only have 150
blocks in each datanode. It takes a little more than a minute for namenode
to start and pass safemode phase.

The steps for namenode start, as much as I understand, are:
1) Datanode send a heartbeat to namenode. Namenode tells datanode to send
blockreport as a piggyback to heartbeat.
2) Datanode computes the block report.
3) Datanode sends it to Namenode.
4) Namenode processes the block report.
5) Namenode safe mode thread monitor checks for exiting, and namenode exist
if threshold is reached and the extension time is passed.

Here are my numbers:
Step 1) Datanodes send heartbeats every 3 seconds.
Step 2) Datanode computes the block report. (this takes about 20 miliseconds
- as shown in the datanodes' logs)
Step 3) No idea? (Depends on the size of blockreport. I suspect this should
not be more than a couple of seconds).
Step 4) No idea? Shouldn't be more than a couple of seconds.
Step 5) Thread checks every second. The extension value in my configuration
is 0. So there is no wait if threshold is achieved.

Given these numbers, can any body explain where does one minute come from?
Shouldn't this step take 10-20 seconds?
Please help. I am very confused.



-- 
------------
Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info

RE: Please Help: Namenode Safemode

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.
Ok, cool. The randon delay is used to ensure that the Namenode does not
have to process large number of simultaneous block reports, otherwise
the situation becomes really bad when the Namenode restarts and all
Datanodes sends their block reports at the same time. This becomes worse
if the number of Datanodes is large.

 

-dhruba

 

________________________________

From: Cagdas Gerede [mailto:cagdas.gerede@gmail.com] 
Sent: Thursday, April 24, 2008 11:56 AM
To: dhruba Borthakur
Cc: core-user@hadoop.apache.org
Subject: Re: Please Help: Namenode Safemode

 

Hi Dhruba,
Thanks for your answer. But I think you missed what I mentioned. I
mentioned that the extenstion is already 0 in my  configuration file.

After spending quite some time on the code, I found the reason. The
reason is dfs.blockreport.initialDelay.
If you do not set this in your config file, then it is 60,000 by
default. In datanodes, a random number between 0-60,000 is chosen.
Then, each datanode delays as long as this random value (in miliseconds)
to send the block report when they register with the namenode. As a
result, this value can be as much as 1 minute. If you want your namenode
start quicker, then you should put a smaller number for
dfs.blockreport.initialDelay.

When I set it to 0, the namenode now starts up in 1-2 seconds.


-- 
------------
Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info 



On Wed, Apr 23, 2008 at 4:44 PM, dhruba Borthakur <dh...@yahoo-inc.com>
wrote:

By default, there is a variable called dfs.safemode.extension set in
hadoop-default.xml that is set to 30 seconds. This means that once the
Namenode has one replica of every block, it still waits for 30 more
seconds before exiting Safemode.

 

dhruba

 

________________________________

From: Cagdas Gerede [mailto:cagdas.gerede@gmail.com] 
Sent: Wednesday, April 23, 2008 4:37 PM
To: core-user@hadoop.apache.org
Cc: dhruba Borthakur
Subject: Please Help: Namenode Safemode

 

I have a hadoop distributed file system with 3 datanodes. I only have
150 blocks in each datanode. It takes a little more than a minute for
namenode to start and pass safemode phase.

The steps for namenode start, as much as I understand, are:
1) Datanode send a heartbeat to namenode. Namenode tells datanode to
send blockreport as a piggyback to heartbeat.
2) Datanode computes the block report. 
3) Datanode sends it to Namenode.
4) Namenode processes the block report.
5) Namenode safe mode thread monitor checks for exiting, and namenode
exist if threshold is reached and the extension time is passed.

Here are my numbers:
Step 1) Datanodes send heartbeats every 3 seconds. 
Step 2) Datanode computes the block report. (this takes about 20
miliseconds - as shown in the datanodes' logs)
Step 3) No idea? (Depends on the size of blockreport. I suspect this
should not be more than a couple of seconds).
Step 4) No idea? Shouldn't be more than a couple of seconds.
Step 5) Thread checks every second. The extension value in my
configuration is 0. So there is no wait if threshold is achieved.

Given these numbers, can any body explain where does one minute come
from? Shouldn't this step take 10-20 seconds? 
Please help. I am very confused.



-- 
------------
Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info 






Re: Please Help: Namenode Safemode

Posted by Cagdas Gerede <ca...@gmail.com>.
Hi Dhruba,
Thanks for your answer. But I think you missed what I mentioned. I mentioned
that the extenstion is already 0 in my  configuration file.

After spending quite some time on the code, I found the reason. The reason
is dfs.blockreport.initialDelay.
If you do not set this in your config file, then it is 60,000 by default. In
datanodes, a random number between 0-60,000 is chosen.
Then, each datanode delays as long as this random value (in miliseconds) to
send the block report when they register with the namenode. As a result,
this value can be as much as 1 minute. If you want your namenode start
quicker, then you should put a smaller number for
dfs.blockreport.initialDelay.

When I set it to 0, the namenode now starts up in 1-2 seconds.


-- 
------------
Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info


On Wed, Apr 23, 2008 at 4:44 PM, dhruba Borthakur <dh...@yahoo-inc.com>
wrote:

>  By default, there is a variable called dfs.safemode.extension set in
> hadoop-default.xml that is set to 30 seconds. This means that once the
> Namenode has one replica of every block, it still waits for 30 more seconds
> before exiting Safemode.
>
>
>
> dhruba
>
>
>  ------------------------------
>
> *From:* Cagdas Gerede [mailto:cagdas.gerede@gmail.com]
> *Sent:* Wednesday, April 23, 2008 4:37 PM
> *To:* core-user@hadoop.apache.org
> *Cc:* dhruba Borthakur
> *Subject:* Please Help: Namenode Safemode
>
>
>
> I have a hadoop distributed file system with 3 datanodes. I only have 150
> blocks in each datanode. It takes a little more than a minute for namenode
> to start and pass safemode phase.
>
> The steps for namenode start, as much as I understand, are:
> 1) Datanode send a heartbeat to namenode. Namenode tells datanode to send
> blockreport as a piggyback to heartbeat.
> 2) Datanode computes the block report.
> 3) Datanode sends it to Namenode.
> 4) Namenode processes the block report.
> 5) Namenode safe mode thread monitor checks for exiting, and namenode exist
> if threshold is reached and the extension time is passed.
>
> Here are my numbers:
> Step 1) Datanodes send heartbeats every 3 seconds.
> Step 2) Datanode computes the block report. (this takes about 20
> miliseconds - as shown in the datanodes' logs)
> Step 3) No idea? (Depends on the size of blockreport. I suspect this should
> not be more than a couple of seconds).
> Step 4) No idea? Shouldn't be more than a couple of seconds.
> Step 5) Thread checks every second. The extension value in my configuration
> is 0. So there is no wait if threshold is achieved.
>
> Given these numbers, can any body explain where does one minute come from?
> Shouldn't this step take 10-20 seconds?
> Please help. I am very confused.
>
>
>
> --
> ------------
> Best Regards, Cagdas Evren Gerede
> Home Page: http://cagdasgerede.info
>

RE: Please Help: Namenode Safemode

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.
By default, there is a variable called dfs.safemode.extension set in
hadoop-default.xml that is set to 30 seconds. This means that once the
Namenode has one replica of every block, it still waits for 30 more
seconds before exiting Safemode.

 

dhruba

 

________________________________

From: Cagdas Gerede [mailto:cagdas.gerede@gmail.com] 
Sent: Wednesday, April 23, 2008 4:37 PM
To: core-user@hadoop.apache.org
Cc: dhruba Borthakur
Subject: Please Help: Namenode Safemode

 

I have a hadoop distributed file system with 3 datanodes. I only have
150 blocks in each datanode. It takes a little more than a minute for
namenode to start and pass safemode phase.

The steps for namenode start, as much as I understand, are:
1) Datanode send a heartbeat to namenode. Namenode tells datanode to
send blockreport as a piggyback to heartbeat.
2) Datanode computes the block report. 
3) Datanode sends it to Namenode.
4) Namenode processes the block report.
5) Namenode safe mode thread monitor checks for exiting, and namenode
exist if threshold is reached and the extension time is passed.

Here are my numbers:
Step 1) Datanodes send heartbeats every 3 seconds. 
Step 2) Datanode computes the block report. (this takes about 20
miliseconds - as shown in the datanodes' logs)
Step 3) No idea? (Depends on the size of blockreport. I suspect this
should not be more than a couple of seconds).
Step 4) No idea? Shouldn't be more than a couple of seconds.
Step 5) Thread checks every second. The extension value in my
configuration is 0. So there is no wait if threshold is achieved.

Given these numbers, can any body explain where does one minute come
from? Shouldn't this step take 10-20 seconds? 
Please help. I am very confused.



-- 
------------
Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info