You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sandy <sn...@gmail.com> on 2008/09/10 23:59:02 UTC

installing hadoop on a OS X cluster

I am starting an install of hadoop on a new cluster. However, I am a little
confused what set of instructions I should follow, having only installed and
played around with hadoop on a single node ubuntu box with 2 cores (on a
single board) and 2 GB of RAM.
The new machine has 2 internal nodes, each with 4 cores. I would like to run
Hadoop to run in a distributed context over these 8 cores. One of my biggest
issues is the definition of the word "node". From the Hadoop wiki and
documentation, it seems that "node" means "machine", and not a board. So, by
this definition, our cluster is really one "node". Is this correct?

If this is the case, then I shouldn't be using the "cluster setup"
instructions, located here:
http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html

But this one:
http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html

Which is what I've been doing. But what should the operation be? I don't
think it should be standalone. Should it be Psuedo-distributed? If so, how
can I guarantee that it will be spread over all the 8 processors? What is
necessary for the hadoop-site.xml file?

Here are the specs of the machine.
    -Mac Pro RAID Card  065-7214
    -Two 3.0GHz Quad-Core Intel Xeon (8-core)   065-7534

    -16GB RAM (4 x 4GB)     065-7179
    -1TB 7200-rpm Serial ATA 3Gb/s      065-7544

    -1TB 7200-rpm Serial ATA 3Gb/s      065-7546

    -1TB 7200-rpm Serial ATA 3Gb/s      065-7193

    -1TB 7200-rpm Serial ATA 3Gb/s      065-7548


Could someone please point me to the correct mode of operation/instructions
to install things correctly on this machine? I found some information how to
install on a OS X machine in the archives, but they are a touch outdated and
seems to be missing some things.

Thank you very much for you time.

-SM

Re: installing hadoop on a OS X cluster

Posted by Sandy <sn...@gmail.com>.

Thanks for the swift response.

I have 4 disk drives (please see specs), so I'm not sure if the hard disk
will still be a bottleneck. Would you agree?
I think we are dealing with data intensive jobs... my input data can be as
large as a few gigabytes in size (though theoretically it could be larger).
I understand that in comparison to what some people run this may seem small
though. I tried running something on my old machine, and it took several
hours to complete the reduce in the first map reduce phase before running
out of memory (and this was after I increased the heap size).

I'm trying to increase the max heap size on this machine in hadoop-env.sh
past 2000, but hadoop gives me errors. Is this normal? I'm running
hadoop-0.17.2. Is there anywhere else I need to specify a heap increase?

Lastly, I think one more modification I will be needing to make is
increasing the maximum number of map/reduce tasks to 8 (one per core). I
made that change in hadoop-site.xml, by adding an additional property:

<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>8</value>
<description>The maximum number of tasks that will be run simultaneously by
a
a task tracker
</description>
</property>

I don't see a mapred-default.xml file in the conf folder. I'm guessing this
was removed in later versions? Is there anywhere else I would need to
specify an increase in map and reduce tasks aside from

JobConf.setNumMapTasks and JobConf.setNumReduceTasks?

Thanks again for your time.

-SM
PS - I'm going to update the wiki with installation instructions for OS X as
soon as I get everything finshed up :-)

On Wed, Sep 10, 2008 at 5:23 PM, Jim Twensky <ji...@gmail.com> wrote:

> Apparently you have one node with 2 processors where each processor has 4
> cores. What do you want to use Hadoop for? If you have a single disk drive
> and multiple cores on one node then pseudo distributed environment seems
> like the best approach to me as long as you are not dealing with large
> amounts of data. If you have a single disk drive and huge amount of data to
> process, then the disk drive might be a bottleneck for your applications.
> Hadoop is usually used for data intensive applications whereas your
> hardware
> seems more like to be designed for cpu intensive job considering 8 cores on
> a single node.
>
> Tim
>
> On Wed, Sep 10, 2008 at 4:59 PM, Sandy <sn...@gmail.com> wrote:
>
> > I am starting an install of hadoop on a new cluster. However, I am a
> little
> > confused what set of instructions I should follow, having only installed
> > and
> > played around with hadoop on a single node ubuntu box with 2 cores (on a
> > single board) and 2 GB of RAM.
> > The new machine has 2 internal nodes, each with 4 cores. I would like to
> > run
> > Hadoop to run in a distributed context over these 8 cores. One of my
> > biggest
> > issues is the definition of the word "node". From the Hadoop wiki and
> > documentation, it seems that "node" means "machine", and not a board. So,
> > by
> > this definition, our cluster is really one "node". Is this correct?
> >
> > If this is the case, then I shouldn't be using the "cluster setup"
> > instructions, located here:
> > http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html
> >
> > But this one:
> > http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html
> >
> > Which is what I've been doing. But what should the operation be? I don't
> > think it should be standalone. Should it be Psuedo-distributed? If so,
> how
> > can I guarantee that it will be spread over all the 8 processors? What is
> > necessary for the hadoop-site.xml file?
> >
> > Here are the specs of the machine.
> >    -Mac Pro RAID Card  065-7214
> >    -Two 3.0GHz Quad-Core Intel Xeon (8-core)   065-7534
> >
> >    -16GB RAM (4 x 4GB)     065-7179
> >    -1TB 7200-rpm Serial ATA 3Gb/s      065-7544
> >
> >    -1TB 7200-rpm Serial ATA 3Gb/s      065-7546
> >
> >    -1TB 7200-rpm Serial ATA 3Gb/s      065-7193
> >
> >    -1TB 7200-rpm Serial ATA 3Gb/s      065-7548
> >
> >
> > Could someone please point me to the correct mode of
> operation/instructions
> > to install things correctly on this machine? I found some information how
> > to
> > install on a OS X machine in the archives, but they are a touch outdated
> > and
> > seems to be missing some things.
> >
> > Thank you very much for you time.
> >
> > -SM
> >
>

Re: installing hadoop on a OS X cluster

Posted by Jim Twensky <ji...@gmail.com>.

Apparently you have one node with 2 processors where each processor has 4
cores. What do you want to use Hadoop for? If you have a single disk drive
and multiple cores on one node then pseudo distributed environment seems
like the best approach to me as long as you are not dealing with large
amounts of data. If you have a single disk drive and huge amount of data to
process, then the disk drive might be a bottleneck for your applications.
Hadoop is usually used for data intensive applications whereas your hardware
seems more like to be designed for cpu intensive job considering 8 cores on
a single node.

Tim

On Wed, Sep 10, 2008 at 4:59 PM, Sandy <sn...@gmail.com> wrote:

> I am starting an install of hadoop on a new cluster. However, I am a little
> confused what set of instructions I should follow, having only installed
> and
> played around with hadoop on a single node ubuntu box with 2 cores (on a
> single board) and 2 GB of RAM.
> The new machine has 2 internal nodes, each with 4 cores. I would like to
> run
> Hadoop to run in a distributed context over these 8 cores. One of my
> biggest
> issues is the definition of the word "node". From the Hadoop wiki and
> documentation, it seems that "node" means "machine", and not a board. So,
> by
> this definition, our cluster is really one "node". Is this correct?
>
> If this is the case, then I shouldn't be using the "cluster setup"
> instructions, located here:
> http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html
>
> But this one:
> http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html
>
> Which is what I've been doing. But what should the operation be? I don't
> think it should be standalone. Should it be Psuedo-distributed? If so, how
> can I guarantee that it will be spread over all the 8 processors? What is
> necessary for the hadoop-site.xml file?
>
> Here are the specs of the machine.
>    -Mac Pro RAID Card  065-7214
>    -Two 3.0GHz Quad-Core Intel Xeon (8-core)   065-7534
>
>    -16GB RAM (4 x 4GB)     065-7179
>    -1TB 7200-rpm Serial ATA 3Gb/s      065-7544
>
>    -1TB 7200-rpm Serial ATA 3Gb/s      065-7546
>
>    -1TB 7200-rpm Serial ATA 3Gb/s      065-7193
>
>    -1TB 7200-rpm Serial ATA 3Gb/s      065-7548
>
>
> Could someone please point me to the correct mode of operation/instructions
> to install things correctly on this machine? I found some information how
> to
> install on a OS X machine in the archives, but they are a touch outdated
> and
> seems to be missing some things.
>
> Thank you very much for you time.
>
> -SM
>