You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Foss User <fo...@gmail.com> on 2009/04/05 10:14:03 UTC

Newbie questions on H adoop local directories?

I am trying to learn Hadoop and a lot of questions come to my mind
when I try to learn it. So, I will be asking a few questions here from
time to time until I feel completely comfortable with it. Here are
some questions now:

1. Is it true that Hadoop should be installed on the same location on
all Linux machines? As per what I have understood, it is necessary to
install them on the same machine on all nodes as if I am going to use
bin/start-dfs.sh and bin/start-mapred.sh to start the data nodes and
task trackers on all slaves. Otherwise, it is not required. How
correct I am?

2. Say, a slave goes down (due to network problems or power cut) while
a word count job was going on. When it comes up again, what are the
tasks I need to do? bin/hadoop-daemon.sh start datanode and
bin/hadoop-daemon.sh start tasktracker is enough for recovery? Do, I
have to delete any /tmp/hadoop-hadoop directories before starting? Is
it guaranteed that on starting, any corrupt files in tmp directory
would be discarded and everything would be restored to normalcy?

3. Say, I have 1 master and 4 slaves and I start datanode on 2 slaves
and tasktracker on the other two. I put files in the HDFS. it means
that the files would be stored in the first two datanodes. Then I run
a word count job. This means that the word count jobs would run on the
two task trackers. How would the two task trackers now get the files
to do the word counting? In the documentations I was reading that the
jobs are run on those nodes which have the data. but in this setup,
the data nodes and job trackers are separate. So, how will the word
count job do its work?

Re: Newbie questions on H adoop local directories?

Posted by Todd Lipcon <to...@cloudera.com>.

On Sun, Apr 5, 2009 at 1:14 AM, Foss User <fo...@gmail.com> wrote:

> I am trying to learn Hadoop and a lot of questions come to my mind
> when I try to learn it. So, I will be asking a few questions here from
> time to time until I feel completely comfortable with it. Here are
> some questions now:
>
> 1. Is it true that Hadoop should be installed on the same location on
> all Linux machines? As per what I have understood, it is necessary to
> install them on the same machine on all nodes as if I am going to use
> bin/start-dfs.sh and bin/start-mapred.sh to start the data nodes and
> task trackers on all slaves. Otherwise, it is not required. How
> correct I am?

That's correct. To use those scripts, the "hadoop" script needs to be in the
same location. The different machines could theoretically have different
hadoop-site.xml files, though, which pointed dfs.name.dir to different
locations. This makes management a bit trickier, but is useful if you have
different disk setups on different machines.

>
>
> 2. Say, a slave goes down (due to network problems or power cut) while
> a word count job was going on. When it comes up again, what are the
> tasks I need to do? bin/hadoop-daemon.sh start datanode and
> bin/hadoop-daemon.sh start tasktracker is enough for recovery? Do, I
> have to delete any /tmp/hadoop-hadoop directories before starting? Is
> it guaranteed that on starting, any corrupt files in tmp directory
> would be discarded and everything would be restored to normalcy?
>

Yes - just starting the daemons should be enough. They'll clean up their
temporary files on their own.

>
> 3. Say, I have 1 master and 4 slaves and I start datanode on 2 slaves
> and tasktracker on the other two. I put files in the HDFS. it means
> that the files would be stored in the first two datanodes. Then I run
> a word count job. This means that the word count jobs would run on the
> two task trackers. How would the two task trackers now get the files
> to do the word counting? In the documentations I was reading that the
> jobs are run on those nodes which have the data. but in this setup,
> the data nodes and job trackers are separate. So, how will the word
> count job do its work?
>

Hadoop will *try* to schedule jobs with data locality in mind, but if that's
impossible, it will read data off of remote nodes. Even when a task is being
run data-local, it uses the same TCP-based protocol to get data off the
datanode (this is something that is currently being worked on) Data-locality
is an optimization to avoid network IO, but not necessary.

FYI, you shouldn't run with fewer than 3 datanodes with the default
configuration. This may be the source of some of your problems in other
messages youv'e sent recently. The default value for dfs.replication in
hadoop-default.xml is 3, meaning that it will try to place blocks on at
least 3 machines. If there are only 2 machines up, all of your blocks by
definition will be under-replicated, and your cluster will be somewhat
grumpy.

-Todd