You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Merto Mertek <ma...@gmail.com> on 2011/09/23 16:59:38 UTC

Environment consideration for a research on scheduling

Hi,
in the first phase we are planning to establish a small cluster with few
commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
version 0.20.203 with missing
libraries<http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567>).
Would you suggest any other version?

In the second phase we are planning to analyse, test and modify some of
hadoop schedulers.

Now I am interested what is the best way to deploy ubuntu and hadop to this
few machine. I was thinking to configure the system in the local VM and then
converting it to each physical machine but probably this is not the best
option. If you know any other please share..

Thanks you!

Re: Environment consideration for a research on scheduling

Posted by Merto Mertek <ma...@gmail.com>.
 I agree, we will go the standard route. Like you suggested we will go step
by step to the full cluster deployment. After the first node configuration
we will use clonezilla to replicate it and then setup them one by one..

On the workernodes I was thinking to run ubuntu server, namenode will run
ubuntu desktop. I am interested how should I configure the environment that
I will able to remotely monitor, analyse and configure the cluster. I will
run jobs outsite the local network via ssh to the namenode, however in this
situation I will not be abble to access the web interface of the job and
tasktracker. So I am wondering how to analyze them and how did you configure
your environment to be as practical as possible.

For monitoring the cluster I saw that ganglia is one of the option, but in
this stage of testing probably job-history files will be enough..

On 23 September 2011 17:09, GOEKE, MATTHEW (AG/1000) <
matthew.goeke@monsanto.com> wrote:

> If you are starting from scratch with no prior Hadoop install experience I
> would configure stand-alone, migrate to pseudo distributed and then to fully
> distributed verifying functionality at each step by doing a simple word
> count run. Also, if you don't mind using the CDH distribution then SCM /
> their rpms will greatly simplify both the bin installs as well as the user
> creation.
>
> Your VM route will most likely work but I can imagine the amount of hiccups
> during migration from that to the real cluster will not make it worth your
> time.
>
> Matt
>
> -----Original Message-----
> From: Merto Mertek [mailto:masmertoz@gmail.com]
> Sent: Friday, September 23, 2011 10:00 AM
> To: common-user@hadoop.apache.org
> Subject: Environment consideration for a research on scheduling
>
> Hi,
> in the first phase we are planning to establish a small cluster with few
> commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
> 10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
> version 0.20.203 with missing
> libraries<
> http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567
> >).
> Would you suggest any other version?
>
> In the second phase we are planning to analyse, test and modify some of
> hadoop schedulers.
>
> Now I am interested what is the best way to deploy ubuntu and hadop to this
> few machine. I was thinking to configure the system in the local VM and
> then
> converting it to each physical machine but probably this is not the best
> option. If you know any other please share..
>
> Thanks you!
> This e-mail message may contain privileged and/or confidential information,
> and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error,
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other use
> of this e-mail by you is strictly prohibited.
>
> All e-mails and attachments sent and received are subject to monitoring,
> reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for
> checking for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage
> caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
>
>
> The information contained in this email may be subject to the export
> control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR)
> and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
> information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>
>

Re: Environment consideration for a research on scheduling

Posted by Merto Mertek <ma...@gmail.com>.
Desktop edition was chosen just to run the namemode and to monitor cluster
statistics. Workernodes were chosen to run on ubuntu server edition because
we find this configuration in several research papers. One of such
configuration can be found in the paper for LATE scheduler (is maybe some
source code of this available or is integrated in the new fair scheduler?)

thanks for the provided tools..

On 26 September 2011 11:41, Steve Loughran <st...@apache.org> wrote:

> On 23/09/11 16:09, GOEKE, MATTHEW (AG/1000) wrote:
>
>> If you are starting from scratch with no prior Hadoop install experience I
>> would configure stand-alone, migrate to pseudo distributed and then to fully
>> distributed verifying functionality at each step by doing a simple word
>> count run. Also, if you don't mind using the CDH distribution then SCM /
>> their rpms will greatly simplify both the bin installs as well as the user
>> creation.
>>
>> Your VM route will most likely work but I can imagine the amount of
>> hiccups during migration from that to the real cluster will not make it
>> worth your time.
>>
>> Matt
>>
>> -----Original Message-----
>> From: Merto Mertek [mailto:masmertoz@gmail.com]
>> Sent: Friday, September 23, 2011 10:00 AM
>> To: common-user@hadoop.apache.org
>> Subject: Environment consideration for a research on scheduling
>>
>> Hi,
>> in the first phase we are planning to establish a small cluster with few
>> commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
>> 10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
>> version 0.20.203 with missing
>> libraries<http://hadoop-**common.472056.n3.nabble.com/**
>> Development-enviroment-**problems-eclipse-hadoop-0-20-**
>> 203-td3186022.html#a3188567<http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567>
>> >).
>> Would you suggest any other version?
>>
>
> I wouldn't run to put Ubuntu 10.x on; they make good desktops, but RHEL and
> CentOS are the platform of choice in the server side.
>
>
>
>
>> In the second phase we are planning to analyse, test and modify some of
>> hadoop schedulers.
>>
>
> The main schedulers used by Y! and FB are fairly tuned for their workloads,
> and not apparently something you'd want to play with. There is at least one
> other scheduler in the contribs/ dir to play with.
>
> the other thing about scheduling is that you may have a faster development
> cycle if, instead of working on a real cluster, you simulate it and
> multiples of real time; using stats collected from your own workload by way
> of the gridmix2 tools. I've never done scheduling work, but think there's
> some stuff there to do that. if not, it's a possible contribution.
>
> Be aware that the changes in 0.23+ will change resource scheduling; this
> may be a better place to do development with a plan to deploy in 2012. Oh,
> and get on the mapreduce lists, esp, the -dev list, to discuss issues
>
>
>
>  The information contained in this email may be subject to the export
>> control laws and regulations of the United States, potentially
>> including but not limited to the Export Administration Regulations (EAR)
>> and sanctions regulations issued by the U.S. Department of
>> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
>> information you are obligated to comply with all
>> applicable U.S. export laws and regulations.
>>
>>
> I have no idea what that means but am not convinced that reading an email
> forces me to comply with a different country's rules
>

Re: Environment consideration for a research on scheduling

Posted by Steve Loughran <st...@apache.org>.
On 23/09/11 16:09, GOEKE, MATTHEW (AG/1000) wrote:
> If you are starting from scratch with no prior Hadoop install experience I would configure stand-alone, migrate to pseudo distributed and then to fully distributed verifying functionality at each step by doing a simple word count run. Also, if you don't mind using the CDH distribution then SCM / their rpms will greatly simplify both the bin installs as well as the user creation.
>
> Your VM route will most likely work but I can imagine the amount of hiccups during migration from that to the real cluster will not make it worth your time.
>
> Matt
>
> -----Original Message-----
> From: Merto Mertek [mailto:masmertoz@gmail.com]
> Sent: Friday, September 23, 2011 10:00 AM
> To: common-user@hadoop.apache.org
> Subject: Environment consideration for a research on scheduling
>
> Hi,
> in the first phase we are planning to establish a small cluster with few
> commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
> 10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
> version 0.20.203 with missing
> libraries<http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567>).
> Would you suggest any other version?

I wouldn't run to put Ubuntu 10.x on; they make good desktops, but RHEL 
and CentOS are the platform of choice in the server side.


>
> In the second phase we are planning to analyse, test and modify some of
> hadoop schedulers.

The main schedulers used by Y! and FB are fairly tuned for their 
workloads, and not apparently something you'd want to play with. There 
is at least one other scheduler in the contribs/ dir to play with.

the other thing about scheduling is that you may have a faster 
development cycle if, instead of working on a real cluster, you simulate 
it and multiples of real time; using stats collected from your own 
workload by way of the gridmix2 tools. I've never done scheduling work, 
but think there's some stuff there to do that. if not, it's a possible 
contribution.

Be aware that the changes in 0.23+ will change resource scheduling; this 
may be a better place to do development with a plan to deploy in 2012. 
Oh, and get on the mapreduce lists, esp, the -dev list, to discuss issues


> The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>

I have no idea what that means but am not convinced that reading an 
email forces me to comply with a different country's rules

RE: Environment consideration for a research on scheduling

Posted by "GOEKE, MATTHEW (AG/1000)" <ma...@monsanto.com>.
If you are starting from scratch with no prior Hadoop install experience I would configure stand-alone, migrate to pseudo distributed and then to fully distributed verifying functionality at each step by doing a simple word count run. Also, if you don't mind using the CDH distribution then SCM / their rpms will greatly simplify both the bin installs as well as the user creation.

Your VM route will most likely work but I can imagine the amount of hiccups during migration from that to the real cluster will not make it worth your time.

Matt 

-----Original Message-----
From: Merto Mertek [mailto:masmertoz@gmail.com] 
Sent: Friday, September 23, 2011 10:00 AM
To: common-user@hadoop.apache.org
Subject: Environment consideration for a research on scheduling

Hi,
in the first phase we are planning to establish a small cluster with few
commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
version 0.20.203 with missing
libraries<http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567>).
Would you suggest any other version?

In the second phase we are planning to analyse, test and modify some of
hadoop schedulers.

Now I am interested what is the best way to deploy ubuntu and hadop to this
few machine. I was thinking to configure the system in the local VM and then
converting it to each physical machine but probably this is not the best
option. If you know any other please share..

Thanks you!
This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.