You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Periya.Data" <pe...@gmail.com> on 2011/11/29 21:28:53 UTC

choices for deploying a small hadoop cluster on EC2

Hi All,
        I am just beginning to learn how to deploy a small cluster (a 3
node cluster) on EC2. After some quick Googling, I see the following
approaches:

   1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
   have features for persisting (EBS)?
   2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop clusters/POC
   etc. Good stuff - I can persist using EBS snapshots. But, this uses CDH2.
   3. Install hadoop manually and related stuff like Hive...on each cluster
   node...on EC2 (or use some automation tool like Chef). I do not prefer it.
   4. Hadoop distribution comes with EC2 (under src/contrib) and there are
   several Hadoop EC2 AMIs available. I have not studied enough to know if
   that is easy for a beginner like me.
   5. Anything else??

1 and 2 look promising as a beginner. If any of you have any thoughts about
this, I would like to know (like what to keep in mind, what to take care
of, caveats etc). I want my data /config to persist (using EBS) and
continue from where I left off...(after a few days).  Also, I want to have
HIVE and SQOOP installed. Can this done using 1 or 2? Or, will installation
of them have to be done manually after I set up the cluster?

Thanks very much,

PD.

Re: choices for deploying a small hadoop cluster on EC2

Posted by "Periya.Data" <pe...@gmail.com>.

Thanks for all your help and replies.  Though I am leaning towards option 1
or 2, I looked up Big Table...an Incubator project in Apache. Could not
find enough info on it in its website. I have a few more questions...and
hope they apply to these mailing-list..

1. Cos: Can you please point me to a link that talk about BigTop & EC2?

2. Regarding Whirr, can I just choose an Ubuntu EBS-backed AMI? Would that
be any different from choosing a normal Hadoop AMI and (later) try to mount
an EBS to this instance?

3. John: I like you idea of using S3 to store input and output. But, say I
start a hadoop cluster, configure Sqoop and Hive and run it. Then, after I
get my output in S3, I either stop it or terminate it (since I do not have
EBS, I don't care). Now, after a while, I want to bring up a similar
cluster and run Hive and Sqoop and do more experiments. In this case, will
I have to reconfigure all my Sqoop settings, Hive table schemas etc?
Because, I think once I "stop" an instance, I will lose the configs and
when I restart a Hadoop AMI, I will only have hadoop nicely running in that
instance and nothing else.

I ideally want everything to persist...even configs and newly installed
tools (Hive, Sqoop).  Or , should I create a custom Ubuntu AMI with Hadoop,
Sqoop, Hive etc "pre-cooked" in it? Probably, this is the ideal way to
proceed...even if it is a little painful. I think I really want EBS-backed
instance..as it maintains its internal state when stopped and restarted.

 Please let me know your opinion. This discussion is deviating from what I
originally started as..

A little Googling has similar posts:
https://forums.aws.amazon.com/message.jspa?messageID=131157

I  know I can get to know by trying out these ....but, I want to lessen my
burden in the trial-and-error process.

Thanks very much,
PD.

On Tue, Nov 29, 2011 at 12:40 PM, Konstantin Boudnik <co...@apache.org> wrote:

> I'd suggest you use BigTop (cross-posting to bigtop-dev@ list) produced
> bit
> which also posses Puppet recipes allowing for fully automated deployment
> and
> configuration. BigTop also uses Jenkins EC2 plugin for deployment part and
> it
> seems to work real great!
>
> Cos
>
> On Tue, Nov 29, 2011 at 12:28PM, Periya.Data wrote:
> > Hi All,
> >         I am just beginning to learn how to deploy a small cluster (a 3
> > node cluster) on EC2. After some quick Googling, I see the following
> > approaches:
> >
> >    1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
> >    have features for persisting (EBS)?
> >    2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop
> clusters/POC
> >    etc. Good stuff - I can persist using EBS snapshots. But, this uses
> CDH2.
> >    3. Install hadoop manually and related stuff like Hive...on each
> cluster
> >    node...on EC2 (or use some automation tool like Chef). I do not
> prefer it.
> >    4. Hadoop distribution comes with EC2 (under src/contrib) and there
> are
> >    several Hadoop EC2 AMIs available. I have not studied enough to know
> if
> >    that is easy for a beginner like me.
> >    5. Anything else??
> >
> > 1 and 2 look promising as a beginner. If any of you have any thoughts
> about
> > this, I would like to know (like what to keep in mind, what to take care
> > of, caveats etc). I want my data /config to persist (using EBS) and
> > continue from where I left off...(after a few days).  Also, I want to
> have
> > HIVE and SQOOP installed. Can this done using 1 or 2? Or, will
> installation
> > of them have to be done manually after I set up the cluster?
> >
> > Thanks very much,
> >
> > PD.
>

Re: choices for deploying a small hadoop cluster on EC2

Posted by Konstantin Boudnik <co...@apache.org>.

I'd suggest you use BigTop (cross-posting to bigtop-dev@ list) produced bit
which also posses Puppet recipes allowing for fully automated deployment and
configuration. BigTop also uses Jenkins EC2 plugin for deployment part and it
seems to work real great!

Cos

On Tue, Nov 29, 2011 at 12:28PM, Periya.Data wrote:
> Hi All,
>         I am just beginning to learn how to deploy a small cluster (a 3
> node cluster) on EC2. After some quick Googling, I see the following
> approaches:
> 
>    1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
>    have features for persisting (EBS)?
>    2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop clusters/POC
>    etc. Good stuff - I can persist using EBS snapshots. But, this uses CDH2.
>    3. Install hadoop manually and related stuff like Hive...on each cluster
>    node...on EC2 (or use some automation tool like Chef). I do not prefer it.
>    4. Hadoop distribution comes with EC2 (under src/contrib) and there are
>    several Hadoop EC2 AMIs available. I have not studied enough to know if
>    that is easy for a beginner like me.
>    5. Anything else??
> 
> 1 and 2 look promising as a beginner. If any of you have any thoughts about
> this, I would like to know (like what to keep in mind, what to take care
> of, caveats etc). I want my data /config to persist (using EBS) and
> continue from where I left off...(after a few days).  Also, I want to have
> HIVE and SQOOP installed. Can this done using 1 or 2? Or, will installation
> of them have to be done manually after I set up the cluster?
> 
> Thanks very much,
> 
> PD.

Re: choices for deploying a small hadoop cluster on EC2

Posted by Prashant Sharma <pr...@gmail.com>.

yes pallets library. https://github.com/pallet/pallet-hadoop-example


On Wed, Nov 30, 2011 at 1:58 AM, Periya.Data <pe...@gmail.com> wrote:

> Hi All,
>        I am just beginning to learn how to deploy a small cluster (a 3
> node cluster) on EC2. After some quick Googling, I see the following
> approaches:
>
>   1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
>   have features for persisting (EBS)?
>   2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop clusters/POC
>   etc. Good stuff - I can persist using EBS snapshots. But, this uses CDH2.
>   3. Install hadoop manually and related stuff like Hive...on each cluster
>   node...on EC2 (or use some automation tool like Chef). I do not prefer
> it.
>   4. Hadoop distribution comes with EC2 (under src/contrib) and there are
>   several Hadoop EC2 AMIs available. I have not studied enough to know if
>   that is easy for a beginner like me.
>   5. Anything else??
>
> 1 and 2 look promising as a beginner. If any of you have any thoughts about
> this, I would like to know (like what to keep in mind, what to take care
> of, caveats etc). I want my data /config to persist (using EBS) and
> continue from where I left off...(after a few days).  Also, I want to have
> HIVE and SQOOP installed. Can this done using 1 or 2? Or, will installation
> of them have to be done manually after I set up the cluster?
>
> Thanks very much,
>
> PD.
>

Re: choices for deploying a small hadoop cluster on EC2

Posted by Konstantin Boudnik <co...@apache.org>.

I'd suggest you use BigTop (cross-posting to bigtop-dev@ list) produced bit
which also posses Puppet recipes allowing for fully automated deployment and
configuration. BigTop also uses Jenkins EC2 plugin for deployment part and it
seems to work real great!

Cos

On Tue, Nov 29, 2011 at 12:28PM, Periya.Data wrote:
> Hi All,
>         I am just beginning to learn how to deploy a small cluster (a 3
> node cluster) on EC2. After some quick Googling, I see the following
> approaches:
> 
>    1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
>    have features for persisting (EBS)?
>    2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop clusters/POC
>    etc. Good stuff - I can persist using EBS snapshots. But, this uses CDH2.
>    3. Install hadoop manually and related stuff like Hive...on each cluster
>    node...on EC2 (or use some automation tool like Chef). I do not prefer it.
>    4. Hadoop distribution comes with EC2 (under src/contrib) and there are
>    several Hadoop EC2 AMIs available. I have not studied enough to know if
>    that is easy for a beginner like me.
>    5. Anything else??
> 
> 1 and 2 look promising as a beginner. If any of you have any thoughts about
> this, I would like to know (like what to keep in mind, what to take care
> of, caveats etc). I want my data /config to persist (using EBS) and
> continue from where I left off...(after a few days).  Also, I want to have
> HIVE and SQOOP installed. Can this done using 1 or 2? Or, will installation
> of them have to be done manually after I set up the cluster?
> 
> Thanks very much,
> 
> PD.

Re: choices for deploying a small hadoop cluster on EC2

Posted by John Conwell <jo...@iamjohn.me>.

I'm a big fan of Whirr, though I dont think it support EBS persistance.  My
hadoop deployment strategy has always been store input and output data on
S3, spin up my hadoop cluster with either whirr or Elastic Map Reduce, run
the job, store output data on S3, and kill the cluster.


On Tue, Nov 29, 2011 at 12:28 PM, Periya.Data <pe...@gmail.com> wrote:

> Hi All,
>        I am just beginning to learn how to deploy a small cluster (a 3
> node cluster) on EC2. After some quick Googling, I see the following
> approaches:
>
>   1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
>   have features for persisting (EBS)?
>   2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop clusters/POC
>   etc. Good stuff - I can persist using EBS snapshots. But, this uses CDH2.
>   3. Install hadoop manually and related stuff like Hive...on each cluster
>   node...on EC2 (or use some automation tool like Chef). I do not prefer
> it.
>   4. Hadoop distribution comes with EC2 (under src/contrib) and there are
>   several Hadoop EC2 AMIs available. I have not studied enough to know if
>   that is easy for a beginner like me.
>   5. Anything else??
>
> 1 and 2 look promising as a beginner. If any of you have any thoughts about
> this, I would like to know (like what to keep in mind, what to take care
> of, caveats etc). I want my data /config to persist (using EBS) and
> continue from where I left off...(after a few days).  Also, I want to have
> HIVE and SQOOP installed. Can this done using 1 or 2? Or, will installation
> of them have to be done manually after I set up the cluster?
>
> Thanks very much,
>
> PD.
>



-- 

Thanks,
John C