You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by vikas <pv...@gmail.com> on 2008/04/21 12:36:12 UTC

New bee quick questions :-)

Hi,

I'm new to HADOOP. And aiming to develop good amount of code with it. I've
some quick questions it would be highly appreciable if some one can answer
them.

I was able to run HADOOP in cygwin environment. run the examples both in
standalone mode as well as in a 2 node cluster.

1) How can I over come the difficulty of giving password for SSH logins when
ever DataNodes are getting started.

2) I've put some 1.5 GB of file in my Master node where even a DataNode is
running. I want to see how load balancing can be done so that disk space
will be utilized even from other datanodes.

3) How can I add a new DataNode without stopping HADOOP.

4) Let us suppose I want to shutdown one datanode for maintenance  purpose.
is there any way to inform Hadoop saying that this particular datanode is
going done -- please make sure the data in it is replicated else where ?

5) I was going through some videos on MAP-Reduce and few Yahoo tech talks.
in that they were specifying a Hadoop cluster has multiple cores -- what
does this mean ?

  5.1) can I have multiple instance of namenodes running in a cluster apart
from secondary nodes ?

6) If I go on create huge files will they be balanced among all the
datanodes ? or do I need to change the creation of file location in the
application.

Expecting your kind response,

Thanking you,
-Vikas.

Re: New bee quick questions :-)

Posted by Luca <ra...@yahoo.it>.

vikas wrote:
> Hi,
> 
> I'm new to HADOOP. And aiming to develop good amount of code with it. I've
> some quick questions it would be highly appreciable if some one can answer
> them.
> 
> I was able to run HADOOP in cygwin environment. run the examples both in
> standalone mode as well as in a 2 node cluster.
> 
> 1) How can I over come the difficulty of giving password for SSH logins when
> ever DataNodes are getting started.
> 

Creating SSH keys (either user or host based) and pairing the hosts with 
these keys. This is the first URL I got for a "ssh public key 
authentication): http://sial.org/howto/openssh/publickey-auth/

> 2) I've put some 1.5 GB of file in my Master node where even a DataNode is
> running. I want to see how load balancing can be done so that disk space
> will be utilized even from other datanodes.
> 

Not sure how to answer to this one. HDFS has knowledge of three 
entities: the node, the rack, and the rest. In the default 
configuration, each block is replicated 3 times, one for each entity. If 
you don't have racks and so you might want to fine tune replication of 
files through HDFS shell.

> 3) How can I add a new DataNode without stopping HADOOP.

Just add it to the slaves and run start-dfs.sh. Already running nodes 
won't be touched.

> 
> 4) Let us suppose I want to shutdown one datanode for maintenance  purpose.
> is there any way to inform Hadoop saying that this particular datanode is
> going done -- please make sure the data in it is replicated else where ?
> 

Replication of blocks with a factor >= 2 should do the job. In the 
general case, default replication is 3. You can check the replication 
factor through HDFS shell.

> 5) I was going through some videos on MAP-Reduce and few Yahoo tech talks.
> in that they were specifying a Hadoop cluster has multiple cores -- what
> does this mean ?
>

Are you talking about multi-core processors?

>   5.1) can I have multiple instance of namenodes running in a cluster apart
> from secondary nodes ?
> 

Not sure on this, but as far as I know there's only one namenode that 
should be running.

> 6) If I go on create huge files will they be balanced among all the
> datanodes ? or do I need to change the creation of file location in the
> application.
> 

Files are divided in blocks. Then blocks are replicated. Huge files are 
simply composed by a larger set of blocks. In principle, you don't know 
where your blocks will end up, apart from the entities I mentioned 
before. And in principle, you shouldn't care about where they end with, 
because Hadoop applications will take care of sending tasks where the 
data reside.

Ciao,
Luca

Re: New bee quick questions :-)

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.



On 4/21/08 3:36 AM, "vikas" <pv...@gmail.com> wrote:

    Most of your questions have been answered by Luca, from what I can see,
so let me tackle the rest a bit...

> 4) Let us suppose I want to shutdown one datanode for maintenance  purpose.
> is there any way to inform Hadoop saying that this particular datanode is
> going done -- please make sure the data in it is replicated else where ?

    You want to do datanode decommissioning.  See
http://wiki.apache.org/hadoop/FAQ#17 for details.

> 5) I was going through some videos on MAP-Reduce and few Yahoo tech talks.
> in that they were specifying a Hadoop cluster has multiple cores -- what
> does this mean ?

    I haven't watched the tech talks in ages, but we generally refer to
cores in a variety of ways.  There is the single physical box verson--an
individual processor has more than one execution unit, thereby giving it a
degree of parallelism.  Then there is the complete grid count--an individual
grid can have lots and lots of processors with lots and lots of individual
cores on those processors.... which works out to be a pretty good rough
estimation of how many individual Hadoop tasks can be run simultaneously.

>   5.1) can I have multiple instance of namenodes running in a cluster apart
> from secondary nodes ?

    No.  The name node is a single point of failure in the system.
 
> 6) If I go on create huge files will they be balanced among all the
> datanodes ? or do I need to change the creation of file location in the
> application.

    In addition to what Luca said, be aware that if you load a file on a
machine with a data node process, the data for that file will *always* get
loaded to that machine.  This can cause your data nodes to get extremely
unbalanced.   You are much better off doing data loads *off grid*/from
another machine.  Since you only need the hadoop configuration and binaries
available (in other words, no hadoop processes need be running), this
usually isn't too painful to do.

    In 0.16.x, there is a rebalancer to help fix this situation, but I have
no practical experience with it yet to say whether or not it works.