You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Kuo <ku...@gmail.com> on 2010/08/06 07:42:33 UTC

Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

As part of our experimentation, the plan is to pull 4 slave nodes out of a
8-slave/1-master cluster.  With replication factor set to 3, I thought
losing half of the cluster may be too much for hdfs to recover.  Thus I
copied out all relevant data from hdfs to local disk and reconfigure the
cluster.

The 4 slave nodes started okay but hdfs never left safe mode.  The nn.log
has the following line.  What is the best way to deal with this?  Shall I
restart the cluster with 8-node and then delete
/data/hadoop-hadoop/mapred/system?  Or shall I reformat hdfs?

Thanks.

2010-08-05 22:28:12,921 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=hadoop,hadoop       ip=/10.128.135.100      cmd=listStatus
src=/data/hadoop-hadoop/mapred/system   dst=null        perm=null
2010-08-05 22:28:12,923 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 9000, call delete(/data/hadoop-hadoop/mapred/system, true) from
10.128.135.100:52368: error:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
/data/hadoop-hadoop/mapred/system. Name node is in safe mode.
The reported blocks 64 needs additional 3 blocks to reach the threshold
0.9990 of total blocks 68. Safe mode will be turned off automatically.
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
/data/hadoop-hadoop/mapred/system. Name node is in safe mode.
The reported blocks 64 needs additional 3 blocks to reach the threshold
0.9990 of total blocks 68. Safe mode will be turned off automatically.
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1741)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1721)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:565)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)

Re: fail to get parameters in new API

Posted by Lance Norskog <go...@gmail.com>.
Thank you from Newbie Central!

On Fri, Aug 6, 2010 at 10:49 PM, Harsh J <qw...@gmail.com> wrote:
> Use this to get the actual path in the New API:
> ((FileSplit) context.getInputSplit()).getPath()
>
> As explained in HADOOP-5973.
>
> On Sat, Aug 7, 2010 at 7:26 AM, Lance Norskog <go...@gmail.com> wrote:
>> I have the same request. My use case is that I want to do a database
>> join on three CSV files from different files from the DB. So, if I can
>> read the file name, I can deduce which table it is. The map knows the
>> field names from each table file, and maps each file row using the
>> database id as the key. The reducer receives the different sets of
>> fields for the same key and writes out the complete join.
>>
>> Is there any way to find at least the file name, even if not the complete URL?
>>
>> Lance
>>
>> On Fri, Aug 6, 2010 at 7:16 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>>> Hi all,
>>> I want to know which file (path) I am processing in each map task. In the old
>>> API, I can get it by JobConf.get("map.input.file");. When it comes to new API,
>>> the context.getConfiguration.get("map.input.file") returns null. Does that mean
>>> parameter "map.input.file" does not exist in new API? How about other job/task
>>> specific parameters? Is there any documentation talking about this?
>>>
>>> Thanks,
>>> -Gang
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
>
>
> --
> Harsh J
> www.harshj.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: fail to get parameters in new API

Posted by Harsh J <qw...@gmail.com>.
Use this to get the actual path in the New API:
((FileSplit) context.getInputSplit()).getPath()

As explained in HADOOP-5973.

On Sat, Aug 7, 2010 at 7:26 AM, Lance Norskog <go...@gmail.com> wrote:
> I have the same request. My use case is that I want to do a database
> join on three CSV files from different files from the DB. So, if I can
> read the file name, I can deduce which table it is. The map knows the
> field names from each table file, and maps each file row using the
> database id as the key. The reducer receives the different sets of
> fields for the same key and writes out the complete join.
>
> Is there any way to find at least the file name, even if not the complete URL?
>
> Lance
>
> On Fri, Aug 6, 2010 at 7:16 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>> Hi all,
>> I want to know which file (path) I am processing in each map task. In the old
>> API, I can get it by JobConf.get("map.input.file");. When it comes to new API,
>> the context.getConfiguration.get("map.input.file") returns null. Does that mean
>> parameter "map.input.file" does not exist in new API? How about other job/task
>> specific parameters? Is there any documentation talking about this?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Harsh J
www.harshj.com

Re: fail to get parameters in new API

Posted by Lance Norskog <go...@gmail.com>.
I have the same request. My use case is that I want to do a database
join on three CSV files from different files from the DB. So, if I can
read the file name, I can deduce which table it is. The map knows the
field names from each table file, and maps each file row using the
database id as the key. The reducer receives the different sets of
fields for the same key and writes out the complete join.

Is there any way to find at least the file name, even if not the complete URL?

Lance

On Fri, Aug 6, 2010 at 7:16 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi all,
> I want to know which file (path) I am processing in each map task. In the old
> API, I can get it by JobConf.get("map.input.file");. When it comes to new API,
> the context.getConfiguration.get("map.input.file") returns null. Does that mean
> parameter "map.input.file" does not exist in new API? How about other job/task
> specific parameters? Is there any documentation talking about this?
>
> Thanks,
> -Gang
>
>
>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: counter is not correct in new API

Posted by Amareshwari Sri Ramadasu <am...@yahoo-inc.com>.
How are you accessing the counter? You should access it  through the enum org.apache.hadoop.mapreduce.TaskCounter.REDUCE_OUTPUT_RECORDS
-Amareshwari

On 8/8/10 2:12 AM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Hi all,
I am using new API and find that the reduce output record counter shows 0.
Actually my reducers output the result correctly. What to do to correct the
counter error?

Thanks,
-Gang





counter is not correct in new API

Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi all,
I am using new API and find that the reduce output record counter shows 0. 
Actually my reducers output the result correctly. What to do to correct the 
counter error?

Thanks,
-Gang


      

fail to get parameters in new API

Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi all,
I want to know which file (path) I am processing in each map task. In the old 
API, I can get it by JobConf.get("map.input.file");. When it comes to new API, 
the context.getConfiguration.get("map.input.file") returns null. Does that mean 
parameter "map.input.file" does not exist in new API? How about other job/task 
specific parameters? Is there any documentation talking about this?

Thanks,
-Gang



      

Re: Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

Posted by Steve Kuo <ku...@gmail.com>.
Thanks Allen for your advice.

Re: Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Aug 6, 2010, at 8:35 AM, He Chen wrote:

> Way#3
> 
> 1) bring up all 8 dn and the nn
> 2) retire one of your 4 nodes:
>           kill the datanode process
>           hadoop dfsadmin -refreshNodes  (this should be done on nn)

No need to refresh nodes.  It only re-reads the dfs.hosts.* files.


> 3) do 2) extra three times

Depending upon what the bandwidth param is, this should theoretically take a significantly longer time.  Since you need for the grid to get back to healthy before each kill.

Re: Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

Posted by He Chen <ai...@gmail.com>.
Way#3

1) bring up all 8 dn and the nn
2) retire one of your 4 nodes:
           kill the datanode process
           hadoop dfsadmin -refreshNodes  (this should be done on nn)
3) do 2) extra three times

On Fri, Aug 6, 2010 at 1:21 AM, Allen Wittenauer
<aw...@linkedin.com>wrote:

>
> On Aug 5, 2010, at 10:42 PM, Steve Kuo wrote:
>
> > As part of our experimentation, the plan is to pull 4 slave nodes out of
> a
> > 8-slave/1-master cluster. With replication factor set to 3, I thought
> > losing half of the cluster may be too much for hdfs to recover.  Thus I
> > copied out all relevant data from hdfs to local disk and reconfigure the
> > cluster.
>
> It depends.  If you have configured Hadoop to have a topology such that the
> 8 nodes were in 2 logical racks, then it would have worked just fine.  If
> you didn't have any topology configured, then each node is considered its
> own rack.  So pulling half of the grid down means you are likely losing a
> good chunk of all your blocks.
>
>
>
>
> >
> > The 4 slave nodes started okay but hdfs never left safe mode.  The nn.log
> > has the following line.  What is the best way to deal with this?  Shall I
> > restart the cluster with 8-node and then delete
> > /data/hadoop-hadoop/mapred/system?  Or shall I reformat hdfs?
>
> Two ways to go:
>
> Way #1:
>
> 1) configure dfs.hosts
> 2) bring up all 8 nodes
> 3) configure dfs.hosts.exclude to include the 4 you don't want
> 4) dfsadmin -refreshNodes to start decommissioning the 4 you don't want
>
> Way #2:
>
> 1) configure a topology
> 2) bring up all 8 nodes
> 3) setrep all files +1
> 4) wait for nn to finish replication
> 5) pull 4 nodes
> 6) bring down nn
> 7) remove topology
> 8) bring nn up
> 9) setrep -1
>
>
>
>


-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Research Assistant of Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588

Re: Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Aug 5, 2010, at 10:42 PM, Steve Kuo wrote:

> As part of our experimentation, the plan is to pull 4 slave nodes out of a
> 8-slave/1-master cluster. With replication factor set to 3, I thought
> losing half of the cluster may be too much for hdfs to recover.  Thus I
> copied out all relevant data from hdfs to local disk and reconfigure the
> cluster.

It depends.  If you have configured Hadoop to have a topology such that the 8 nodes were in 2 logical racks, then it would have worked just fine.  If you didn't have any topology configured, then each node is considered its own rack.  So pulling half of the grid down means you are likely losing a good chunk of all your blocks.




> 
> The 4 slave nodes started okay but hdfs never left safe mode.  The nn.log
> has the following line.  What is the best way to deal with this?  Shall I
> restart the cluster with 8-node and then delete
> /data/hadoop-hadoop/mapred/system?  Or shall I reformat hdfs?

Two ways to go:

Way #1:

1) configure dfs.hosts
2) bring up all 8 nodes
3) configure dfs.hosts.exclude to include the 4 you don't want
4) dfsadmin -refreshNodes to start decommissioning the 4 you don't want

Way #2:

1) configure a topology
2) bring up all 8 nodes
3) setrep all files +1
4) wait for nn to finish replication
5) pull 4 nodes
6) bring down nn
7) remove topology
8) bring nn up
9) setrep -1