You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Allen Wittenauer <aw...@yahoo-inc.com> on 2008/09/01 16:32:09 UTC

Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment



On 8/18/08 11:33 AM, "Filippo Spiga" <sp...@gmail.com> wrote:
> Well but I haven't understand how I should configurate HOD to work in this
> manner.
> 
> For HDFS I folllow this sequence of steps
> - conf/master contain only master node of my cluster
> - conf/slaves contain all nodes
> - I start HDFS using bin/start-dfs.sh

    Right, fine...

> Potentially I would allow to use all nodes for MapReduce.
> For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I
> change only the gridservice-hdfs section?

    I was hoping the HOD folks would answer this question for you, but they
are apparently sleeping. :)

    Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it should
use that as the -default- HDFS. That doesn't prevent a user from using HOD
to create a custom HDFS as part of their job submission.

Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment

Posted by Hemanth Yamijala <yh...@yahoo-inc.com>.

Hemanth Yamijala wrote:
> Filippo Spiga wrote:
>> This procedure allows me to
>> - use persistent HDFS on all cluster, placing namenode to frontend 
>> (always
>> up and running) and datanode to other nodes
>> - submit a lot of jobs to resource manager trasparently without any 
>> problem
>> and manage jobs priority/reservation with MAUI as simple as other 
>> classical
>> HPC jobs
>> - execute jobtracker and tasktracker services on the nodes chosen by 
>> TORQUE
>> (in particular, the first node selected becomes the jobtracker)
>> - store logs for different users into separated directory
>> - run only one job at time (but probably multiple map/reduce jobs can
>> runs together
>> because different jobs use different subset of nodes)
>>
>> Probably HOD does what I can do with my raw script... it's possibile 
>> that I
>> don't understand well the userguide...
>>
>>   
> Filippo, HOD indeed allows you to do all these things, and a little 
> bit more. On the other
> hand your script executes the jobtracker on the first node always, 
> which also seems
> useful to me. It will be nice if  you can still try HOD and see if it 
> makes your life
> simpler in any way. :-)
>
Some things that HOD does automatically:
- Set up log directories differently for different users
- Port numbers need not be fixed, HOD detects free ports and provisions 
the services to use
them
- Depending on need, you can also use a custom tarball of hadoop to 
deploy, rather than use
a pre-installed version.

Also, since HOD is only a thin wrapper around the resource manager, all 
policies that you
can set up for Maui can automatically apply for HOD-run clusters.

>> Sorry for my english :-P
>>
>> Regards
>>
>> 2008/9/2 Hemanth Yamijala <yh...@yahoo-inc.com>
>>
>>  
>>> Allen Wittenauer wrote:
>>>
>>>    
>>>> On 8/18/08 11:33 AM, "Filippo Spiga" <sp...@gmail.com> wrote:
>>>>
>>>>
>>>>      
>>>>> Well but I haven't understand how I should configurate HOD to work in
>>>>> this
>>>>> manner.
>>>>>
>>>>> For HDFS I folllow this sequence of steps
>>>>> - conf/master contain only master node of my cluster
>>>>> - conf/slaves contain all nodes
>>>>> - I start HDFS using bin/start-dfs.sh
>>>>>
>>>>>
>>>>>         
>>>>    Right, fine...
>>>>
>>>>
>>>>
>>>>      
>>>>> Potentially I would allow to use all nodes for MapReduce.
>>>>> For HOD which parameter should I set in contrib/hod/conf/hodrc? 
>>>>> Should I
>>>>> change only the gridservice-hdfs section?
>>>>>
>>>>>
>>>>>         
>>>>    I was hoping the HOD folks would answer this question for you, 
>>>> but they
>>>> are apparently sleeping. :)
>>>>
>>>>
>>>>
>>>>       
>>> Woops ! Sorry, I missed this.
>>>
>>>    
>>>>    Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it 
>>>> should
>>>> use that as the -default- HDFS. That doesn't prevent a user from 
>>>> using HOD
>>>> to create a custom HDFS as part of their job submission.
>>>>
>>>>
>>>>
>>>>       
>>> Allen's answer is perfect. Please refer to
>>> http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS 
>>>
>>> for more information about how to set up the gridservice-hdfs 
>>> section to
>>> use a static or
>>> external HDFS.
>>>
>>>
>>>
>>>     
>>
>>
>>   
>

Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment

Posted by Hemanth Yamijala <yh...@yahoo-inc.com>.

Filippo Spiga wrote:
> This procedure allows me to
> - use persistent HDFS on all cluster, placing namenode to frontend (always
> up and running) and datanode to other nodes
> - submit a lot of jobs to resource manager trasparently without any problem
> and manage jobs priority/reservation with MAUI as simple as other classical
> HPC jobs
> - execute jobtracker and tasktracker services on the nodes chosen by TORQUE
> (in particular, the first node selected becomes the jobtracker)
> - store logs for different users into separated directory
> - run only one job at time (but probably multiple map/reduce jobs can
> runs together
> because different jobs use different subset of nodes)
>
> Probably HOD does what I can do with my raw script... it's possibile that I
> don't understand well the userguide...
>
>   
Filippo, HOD indeed allows you to do all these things, and a little bit 
more. On the other
hand your script executes the jobtracker on the first node always, which 
also seems
useful to me. It will be nice if  you can still try HOD and see if it 
makes your life
simpler in any way. :-)

> Sorry for my english :-P
>
> Regards
>
> 2008/9/2 Hemanth Yamijala <yh...@yahoo-inc.com>
>
>   
>> Allen Wittenauer wrote:
>>
>>     
>>> On 8/18/08 11:33 AM, "Filippo Spiga" <sp...@gmail.com> wrote:
>>>
>>>
>>>       
>>>> Well but I haven't understand how I should configurate HOD to work in
>>>> this
>>>> manner.
>>>>
>>>> For HDFS I folllow this sequence of steps
>>>> - conf/master contain only master node of my cluster
>>>> - conf/slaves contain all nodes
>>>> - I start HDFS using bin/start-dfs.sh
>>>>
>>>>
>>>>         
>>>    Right, fine...
>>>
>>>
>>>
>>>       
>>>> Potentially I would allow to use all nodes for MapReduce.
>>>> For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I
>>>> change only the gridservice-hdfs section?
>>>>
>>>>
>>>>         
>>>    I was hoping the HOD folks would answer this question for you, but they
>>> are apparently sleeping. :)
>>>
>>>
>>>
>>>       
>> Woops ! Sorry, I missed this.
>>
>>     
>>>    Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it should
>>> use that as the -default- HDFS. That doesn't prevent a user from using HOD
>>> to create a custom HDFS as part of their job submission.
>>>
>>>
>>>
>>>       
>> Allen's answer is perfect. Please refer to
>> http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS
>> for more information about how to set up the gridservice-hdfs section to
>> use a static or
>> external HDFS.
>>
>>
>>
>>     
>
>
>

Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment

Posted by Hemanth Yamijala <yh...@yahoo-inc.com>.

Filippo Spiga wrote:
> This procedure allows me to
> - use persistent HDFS on all cluster, placing namenode to frontend (always
> up and running) and datanode to other nodes
> - submit a lot of jobs to resource manager trasparently without any problem
> and manage jobs priority/reservation with MAUI as simple as other classical
> HPC jobs
> - execute jobtracker and tasktracker services on the nodes chosen by TORQUE
> (in particular, the first node selected becomes the jobtracker)
> - store logs for different users into separated directory
> - run only one job at time (but probably multiple map/reduce jobs can
> runs together
> because different jobs use different subset of nodes)
>
> Probably HOD does what I can do with my raw script... it's possibile that I
> don't understand well the userguide...
>
>   
Filippo, HOD indeed allows you to do all these things, and a little bit 
more. On the other
hand your script executes the jobtracker on the first node always, which 
also seems
useful to me. It will be nice if  you can still try HOD and see if it 
makes your life
simpler in any way. :-)

> Sorry for my english :-P
>
> Regards
>
> 2008/9/2 Hemanth Yamijala <yh...@yahoo-inc.com>
>
>   
>> Allen Wittenauer wrote:
>>
>>     
>>> On 8/18/08 11:33 AM, "Filippo Spiga" <sp...@gmail.com> wrote:
>>>
>>>
>>>       
>>>> Well but I haven't understand how I should configurate HOD to work in
>>>> this
>>>> manner.
>>>>
>>>> For HDFS I folllow this sequence of steps
>>>> - conf/master contain only master node of my cluster
>>>> - conf/slaves contain all nodes
>>>> - I start HDFS using bin/start-dfs.sh
>>>>
>>>>
>>>>         
>>>    Right, fine...
>>>
>>>
>>>
>>>       
>>>> Potentially I would allow to use all nodes for MapReduce.
>>>> For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I
>>>> change only the gridservice-hdfs section?
>>>>
>>>>
>>>>         
>>>    I was hoping the HOD folks would answer this question for you, but they
>>> are apparently sleeping. :)
>>>
>>>
>>>
>>>       
>> Woops ! Sorry, I missed this.
>>
>>     
>>>    Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it should
>>> use that as the -default- HDFS. That doesn't prevent a user from using HOD
>>> to create a custom HDFS as part of their job submission.
>>>
>>>
>>>
>>>       
>> Allen's answer is perfect. Please refer to
>> http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS
>> for more information about how to set up the gridservice-hdfs section to
>> use a static or
>> external HDFS.
>>
>>
>>
>>     
>
>
>

Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment

Posted by Filippo Spiga <sp...@gmail.com>.

Hi guys, sorry for my later reply. For my needs I have adopted this strategy
for run HADOOP on my HPC cluster..

I created a user "hadoop". This user runs HDFS services mapped to all nodes
of my cluster (namenode to frontend node and datanodes to calculus nodes).
After that, I copied the folder "conf" of hadoop into my home and I created
a new folder for my logs.

I deleted "master" and "slaves" files into the copied "conf" folder and I
added the following line into "hadoop-env.sh" file
export HADOOP_LOG_DIR=~/HADOOP-LOGS

Also I created a new file called "hadoop-site.xml.template" with the same
contents of the file hadoop-site.xml except that this part

<property>
  <name>mapred.job.tracker</name>
  <value>scilx:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

changed in this way

<property>
  <name>mapred.job.tracker</name>
  <value>XXXXXX:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

FInally, with a script like this

#!/bin/bash
#PBS -V
#PBS -S /bin/bash
#PBS -o testjob.out
#PBS -e testjob.err
#PBS -N HADOOP
#PBS -l nodes=4:ppn=2
#PBS -l walltime=1:00:00
#PBS -q hadoop

echo "Selected nodes for HADOOP Map-Reduce job..."

cat ${PBS_NODEFILE} |  sed -e "s/ /\n/g" | uniq >
${PBS_O_WORKDIR}/HADOOP-CONF/slaves

head -n 1 ${PBS_O_WORKDIR}/HADOOP-CONF/slaves >
${PBS_O_WORKDIR}/HADOOP-CONF/master

export HADOOP_MASTER=`cat ${PBS_O_WORKDIR}/HADOOP-CONF/master`

sed 's/XXXXXX/'${HADOOP_MASTER}'/g'
${PBS_O_WORKDIR}/HADOOP-CONF/hadoop-site.xml.template >
${PBS_O_WORKDIR}/HADOOP-CONF/hadoop-site.xml


cd $PBS_O_WORKDIR

source /usr/local/Modules/3.2.5/init/bash
module load JAVA-1.6

echo "Start @" `date`

/opt/hadoop-0.18.0/bin/start-mapred.sh --config ~/HADOOP-CONF

/opt/hadoop-0.18.0/bin/hadoop --config ~/HADOOP-CONF jar
/opt/hadoop-0.18.0/hadoop-0.18.0-examples.jar randomwriter
-Dtest.randomwriter.maps_per_host=2
-Dtest.randomwrite.bytes_per_map=268435456 rand

/opt/hadoop-0.18.0/bin/hadoop --config ~/HADOOP-CONF jar
/opt/hadoop-0.18.0/hadoop-0.18.0-examples.jar sort rand rand-sort

# /opt/hadoop-0.18.0/bin/hadoop --config ~/HADOOP-CONF dfs -ls

/opt/hadoop-0.18.0/bin/stop-mapred.sh --config ~/HADOOP-CONF

echo "End @" `date`

exit 0


This procedure allows me to
- use persistent HDFS on all cluster, placing namenode to frontend (always
up and running) and datanode to other nodes
- submit a lot of jobs to resource manager trasparently without any problem
and manage jobs priority/reservation with MAUI as simple as other classical
HPC jobs
- execute jobtracker and tasktracker services on the nodes chosen by TORQUE
(in particular, the first node selected becomes the jobtracker)
- store logs for different users into separated directory
- run only one job at time (but probably multiple map/reduce jobs can
runs together
because different jobs use different subset of nodes)

Probably HOD does what I can do with my raw script... it's possibile that I
don't understand well the userguide...

Sorry for my english :-P

Regards

2008/9/2 Hemanth Yamijala <yh...@yahoo-inc.com>

> Allen Wittenauer wrote:
>
>>
>> On 8/18/08 11:33 AM, "Filippo Spiga" <sp...@gmail.com> wrote:
>>
>>
>>> Well but I haven't understand how I should configurate HOD to work in
>>> this
>>> manner.
>>>
>>> For HDFS I folllow this sequence of steps
>>> - conf/master contain only master node of my cluster
>>> - conf/slaves contain all nodes
>>> - I start HDFS using bin/start-dfs.sh
>>>
>>>
>>
>>    Right, fine...
>>
>>
>>
>>> Potentially I would allow to use all nodes for MapReduce.
>>> For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I
>>> change only the gridservice-hdfs section?
>>>
>>>
>>
>>    I was hoping the HOD folks would answer this question for you, but they
>> are apparently sleeping. :)
>>
>>
>>
> Woops ! Sorry, I missed this.
>
>>    Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it should
>> use that as the -default- HDFS. That doesn't prevent a user from using HOD
>> to create a custom HDFS as part of their job submission.
>>
>>
>>
> Allen's answer is perfect. Please refer to
> http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS
> for more information about how to set up the gridservice-hdfs section to
> use a static or
> external HDFS.
>
>
>


-- 
Filippo Spiga
Computational Physics and Complex Systems Laboratory (FISLAB) -
http://www.fislab.disco.unimib.it
Department of Informatics, Systems and Communication (DISCo) - University of
Milano-Bicocca
mobile: +393408387735
Skype: filippo.spiga

C'e' un solo modo di dimenticare il tempo: impiegarlo.
-- Baudelaire, "Diari intimi"

Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment

Posted by Hemanth Yamijala <yh...@yahoo-inc.com>.

Allen Wittenauer wrote:
>
> On 8/18/08 11:33 AM, "Filippo Spiga" <sp...@gmail.com> wrote:
>   
>> Well but I haven't understand how I should configurate HOD to work in this
>> manner.
>>
>> For HDFS I folllow this sequence of steps
>> - conf/master contain only master node of my cluster
>> - conf/slaves contain all nodes
>> - I start HDFS using bin/start-dfs.sh
>>     
>
>     Right, fine...
>
>   
>> Potentially I would allow to use all nodes for MapReduce.
>> For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I
>> change only the gridservice-hdfs section?
>>     
>
>     I was hoping the HOD folks would answer this question for you, but they
> are apparently sleeping. :)
>
>   
Woops ! Sorry, I missed this.
>     Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it should
> use that as the -default- HDFS. That doesn't prevent a user from using HOD
> to create a custom HDFS as part of their job submission.
>
>   
Allen's answer is perfect. Please refer to 
http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS
for more information about how to set up the gridservice-hdfs section to 
use a static or
external HDFS.