You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by deneche abdelhakim <a_...@yahoo.fr> on 2010/01/11 05:03:31 UTC

Re : Good starting instance for AMI

I use the Cloudera distribution and it works just fine. It already includes Java and Hadoop.

http://archive.cloudera.com/docs/ec2.html

The default AMI uses Hadoop 0.18.3 but you can launch a special AMI with Hadoop 0.20 using the following command:

% hadoop-ec2 launch-cluster --env REPO=testing --env HADOOP_VERSION=0.20 \
  my-hadoop-cluster 10


--- En date de : Lun 11.1.10, Grant Ingersoll <gs...@apache.org> a écrit :

> De: Grant Ingersoll <gs...@apache.org>
> Objet: Good starting instance for AMI
> À: mahout-user@lucene.apache.org
> Date: Lundi 11 Janvier 2010, 0h16
> Anyone have recs on a good AMI to
> start with on EC2 to load with Mahout?  Preferably
> Linux and already has Java 1.6 installed.
> 
> Thanks,
> Grant

Re: Re : Good starting instance for AMI

Posted by Robin Anil <ro...@gmail.com>.

Since i dont have a personal linux box these days. I code on eclipse on
windows and fire up an instance attach the ebs and patch and test my code.
yes, I have only tried a single node yet.


On Tue, Jan 12, 2010 at 8:55 AM, Liang Chenmin <li...@gmail.com>wrote:

> I first followed the tutorial about running mahout on EMR, need some
> revision on the command line though.
>
> On Mon, Jan 11, 2010 at 6:44 PM, deneche abdelhakim <a_deneche@yahoo.fr
> >wrote:
>
> > I used Cloudera's with Mahout to test the Decision Forest implementation.
> >
> > --- En date de : Lun 11.1.10, Grant Ingersoll <gs...@apache.org> a
> > écrit :
> >
> > > De: Grant Ingersoll <gs...@apache.org>
> > > Objet: Re: Re : Good starting instance for AMI
> > > À: mahout-user@lucene.apache.org
> > > Date: Lundi 11 Janvier 2010, 20h51
> > > One quick question for all who
> > > responded:
> > > How many have tried Mahout with the setup they
> > > recommended?
> > >
> > > -Grant
> > >
> > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
> > >
> > > > Some comments on Cloudera's Hadoop (CDH) and Elastic
> > > MapReduce (EMR).
> > > >
> > > > I have used both to get hadoop jobs up and running
> > > (although my EMR use has
> > > > mostly been limited to running batch Pig scripts
> > > weekly). Deciding on which
> > > > one to use really depends on what kind of job/data
> > > you're working with.
> > > >
> > > > EMR is most useful if you're already storing the
> > > dataset you're using on S3
> > > > and plan on running a one-off job. My understanding is
> > > that it's configured
> > > > to use jets3t to stream data from s3 rather than
> > > copying it to the cluster,
> > > > which is fine for a single pass over a small to medium
> > > sized dataset, but
> > > > obviously slower for multiple passes or larger
> > > datasets. The API is also
> > > > useful if you have a set workflow that you plan to run
> > > on a regular basis,
> > > > and I often prototype quick and dirty jobs on very
> > > small EMR clusters to
> > > > test how some things run in the wild (obviously not
> > > the most cost effective
> > > > solution, but I've foudn pseudo-distributed mode
> > > doesn't catch everything).
> > > >
> > > > CDH gives you greater control over the initial setup
> > > and configuration of
> > > > your cluster. From my understanding, it's not really
> > > an AMI. Rather, it's a
> > > > set of Python scripts that's been modified from the
> > > ec2 scripts from
> > > > hadoop/contrib with some nifty additions like being
> > > able to specify and set
> > > > up EBS volumes, proxy on the cluster, and some others.
> > > The scripts use the
> > > > boto Python module (a very useful Python module for
> > > working with EC2) to
> > > > make a request to EC2 to setup a specified sized
> > > cluster with whatever
> > > > vanilla AMI that's specified. It sets up the security
> > > groups and opens up
> > > > the relevant ports and it then passes the init script
> > > to each of the
> > > > instances once they've booted (same user-data file
> > > setup which is limited to
> > > > 16K I believe). The init script tells each node to
> > > download hadoop (from
> > > > Clouderas OS-specific repos) and any other
> > > user-specified packages and set
> > > > them up. The hadoop config xml is hardcoded into the
> > > init script (although
> > > > you can pass a modified config beforehand). The master
> > > is started first, and
> > > > then the slaves are started so that the slaves can be
> > > given info about what
> > > > NN and JT to connect to (the config uses the public
> > > DNS I believe to make
> > > > things easier to set up). You can use either 0.18.3
> > > (CDH) or 0.20 (CDH2)
> > > > when it comes to Hadoop versions, although I've had
> > > mixed results with the
> > > > latter.
> > > >
> > > > Personally, I'd still like some kind of facade or
> > > something similar to
> > > > further abstract things and make it easier for others
> > > to quickly set up
> > > > ad-hoc clusters for 'quick n dirty' jobs. I know other
> > > libraries like Crane
> > > > have been released recently, but given the language of
> > > choice (Clojure), I
> > > > haven't yet had a chance to really investigate.
> > > >
> > > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >
> > > >> Just use several of these files.
> > > >>
> > > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin
> > > <liangchenmin04@gmail.com
> > > >>> wrote:
> > > >>
> > > >>> EMR requires S3 bucket, but S3 instance has a
> > > limit of file
> > > >>> size(5GB), so need some extra care here. Has
> > > any one encounter the file
> > > >>> size
> > > >>> problem on S3 also? I kind of think that it's
> > > unreasonable to have a  5G
> > > >>> size limit when we want to use the system to
> > > deal with large data set.
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Ted Dunning, CTO
> > > >> DeepDyve
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Zaki Rahaman
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://www.lucidimagination.com/
> > >
> > > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> > >
> > >
> >
> >
> >
> >
>
>
> --
> Chenmin Liang
> Language Technologies Institute, School of Computer Science
> Carnegie Mellon University
>

Re: Re : Good starting instance for AMI

Posted by Liang Chenmin <li...@gmail.com>.

I first followed the tutorial about running mahout on EMR, need some
revision on the command line though.

On Mon, Jan 11, 2010 at 6:44 PM, deneche abdelhakim <a_...@yahoo.fr>wrote:

> I used Cloudera's with Mahout to test the Decision Forest implementation.
>
> --- En date de : Lun 11.1.10, Grant Ingersoll <gs...@apache.org> a
> écrit :
>
> > De: Grant Ingersoll <gs...@apache.org>
> > Objet: Re: Re : Good starting instance for AMI
> > À: mahout-user@lucene.apache.org
> > Date: Lundi 11 Janvier 2010, 20h51
> > One quick question for all who
> > responded:
> > How many have tried Mahout with the setup they
> > recommended?
> >
> > -Grant
> >
> > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
> >
> > > Some comments on Cloudera's Hadoop (CDH) and Elastic
> > MapReduce (EMR).
> > >
> > > I have used both to get hadoop jobs up and running
> > (although my EMR use has
> > > mostly been limited to running batch Pig scripts
> > weekly). Deciding on which
> > > one to use really depends on what kind of job/data
> > you're working with.
> > >
> > > EMR is most useful if you're already storing the
> > dataset you're using on S3
> > > and plan on running a one-off job. My understanding is
> > that it's configured
> > > to use jets3t to stream data from s3 rather than
> > copying it to the cluster,
> > > which is fine for a single pass over a small to medium
> > sized dataset, but
> > > obviously slower for multiple passes or larger
> > datasets. The API is also
> > > useful if you have a set workflow that you plan to run
> > on a regular basis,
> > > and I often prototype quick and dirty jobs on very
> > small EMR clusters to
> > > test how some things run in the wild (obviously not
> > the most cost effective
> > > solution, but I've foudn pseudo-distributed mode
> > doesn't catch everything).
> > >
> > > CDH gives you greater control over the initial setup
> > and configuration of
> > > your cluster. From my understanding, it's not really
> > an AMI. Rather, it's a
> > > set of Python scripts that's been modified from the
> > ec2 scripts from
> > > hadoop/contrib with some nifty additions like being
> > able to specify and set
> > > up EBS volumes, proxy on the cluster, and some others.
> > The scripts use the
> > > boto Python module (a very useful Python module for
> > working with EC2) to
> > > make a request to EC2 to setup a specified sized
> > cluster with whatever
> > > vanilla AMI that's specified. It sets up the security
> > groups and opens up
> > > the relevant ports and it then passes the init script
> > to each of the
> > > instances once they've booted (same user-data file
> > setup which is limited to
> > > 16K I believe). The init script tells each node to
> > download hadoop (from
> > > Clouderas OS-specific repos) and any other
> > user-specified packages and set
> > > them up. The hadoop config xml is hardcoded into the
> > init script (although
> > > you can pass a modified config beforehand). The master
> > is started first, and
> > > then the slaves are started so that the slaves can be
> > given info about what
> > > NN and JT to connect to (the config uses the public
> > DNS I believe to make
> > > things easier to set up). You can use either 0.18.3
> > (CDH) or 0.20 (CDH2)
> > > when it comes to Hadoop versions, although I've had
> > mixed results with the
> > > latter.
> > >
> > > Personally, I'd still like some kind of facade or
> > something similar to
> > > further abstract things and make it easier for others
> > to quickly set up
> > > ad-hoc clusters for 'quick n dirty' jobs. I know other
> > libraries like Crane
> > > have been released recently, but given the language of
> > choice (Clojure), I
> > > haven't yet had a chance to really investigate.
> > >
> > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > >> Just use several of these files.
> > >>
> > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin
> > <liangchenmin04@gmail.com
> > >>> wrote:
> > >>
> > >>> EMR requires S3 bucket, but S3 instance has a
> > limit of file
> > >>> size(5GB), so need some extra care here. Has
> > any one encounter the file
> > >>> size
> > >>> problem on S3 also? I kind of think that it's
> > unreasonable to have a  5G
> > >>> size limit when we want to use the system to
> > deal with large data set.
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Ted Dunning, CTO
> > >> DeepDyve
> > >>
> > >
> > >
> > >
> > > --
> > > Zaki Rahaman
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
> >
> >
>
>
>
>


-- 
Chenmin Liang
Language Technologies Institute, School of Computer Science
Carnegie Mellon University

Re: Re : Good starting instance for AMI

Posted by Ted Dunning <te...@gmail.com>.

I have only run Mahout on a single node or a fixed cluster in our data
center.

On Mon, Jan 11, 2010 at 11:51 AM, Grant Ingersoll <gs...@apache.org>wrote:

> How many have tried Mahout with the setup they recommended?




-- 
Ted Dunning, CTO
DeepDyve

Re: Re : Good starting instance for AMI

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 18, 2010, at 10:59 AM, Drew Farris wrote:

> On Mon, Jan 18, 2010 at 10:20 AM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>>> 
>>> I wonder if the CDH2 ami's could be used as a starting point? Not sure
>>> if you're allowed to unbundle and modify public AMI's. It would
>>> certainly be more difficult to start from scratch.
>> 
>> I'd prefer to be dependent on the official Apache distro that we use.
>> 
> 
> Do you mean the distro of Hadoop, or something else? From what I
> understand the convenience that CDH2 provides is largely based on the
> launch/management scripts, I agree that it would make sense to replace
> the actual hadoop distro with something that we use.

I just want the exact version that is in our MavenPOM. 

> 
> It is pretty simple to create AMI's from scratch, but I was wondering
> about getting things set up to auto-launch the various parts of hadoop
> at boot time and get the configuration right so that they are bound
> into a single cluster etc. If those sorts of things are trivial or
> otherwise covered, no need to start from CDH2.

Yes, I'd like that too.

Re: Re : Good starting instance for AMI

Posted by Drew Farris <dr...@gmail.com>.

On Mon, Jan 18, 2010 at 10:20 AM, Grant Ingersoll <gs...@apache.org> wrote:

>>
>> I wonder if the CDH2 ami's could be used as a starting point? Not sure
>> if you're allowed to unbundle and modify public AMI's. It would
>> certainly be more difficult to start from scratch.
>
> I'd prefer to be dependent on the official Apache distro that we use.
>

Do you mean the distro of Hadoop, or something else? From what I
understand the convenience that CDH2 provides is largely based on the
launch/management scripts, I agree that it would make sense to replace
the actual hadoop distro with something that we use.

It is pretty simple to create AMI's from scratch, but I was wondering
about getting things set up to auto-launch the various parts of hadoop
at boot time and get the configuration right so that they are bound
into a single cluster etc. If those sorts of things are trivial or
otherwise covered, no need to start from CDH2.

Drew

Re: Re : Good starting instance for AMI

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 18, 2010, at 10:07 AM, Drew Farris wrote:

> Sounds great.
> 
> It might be handy to include with the AMI a local maven repo
> pre-populated with build dependencies to shorten the build time as
> well.

Running as I type...

> 
> I wonder if the CDH2 ami's could be used as a starting point? Not sure
> if you're allowed to unbundle and modify public AMI's. It would
> certainly be more difficult to start from scratch.

I'd prefer to be dependent on the official Apache distro that we use.

> 
> Amazon hosts some public datasets for free:
> http://aws.amazon.com/publicdatasets/
> Perhaps the mahout test data in vector form could be bundled up into a
> snapshot that could be re-used by anyone.

Yes!  I would welcome help on this.  I also wonder if we can talk to Amazon about hosting that data publicly so that we don't have to pay for it.  Either that or maybe we could ask the ASF for some small budget to do so.  

Any insight from those w/ more experience would be greatly appreciated.  I can talk to the Amazon contact who runs the Apache donation project.

-Grant

Re: Re : Good starting instance for AMI

Posted by Drew Farris <dr...@gmail.com>.

Sounds great.

It might be handy to include with the AMI a local maven repo
pre-populated with build dependencies to shorten the build time as
well.

I wonder if the CDH2 ami's could be used as a starting point? Not sure
if you're allowed to unbundle and modify public AMI's. It would
certainly be more difficult to start from scratch.

Amazon hosts some public datasets for free:
http://aws.amazon.com/publicdatasets/
Perhaps the mahout test data in vector form could be bundled up into a
snapshot that could be re-used by anyone.

On Mon, Jan 18, 2010 at 9:54 AM, Grant Ingersoll <gs...@apache.org> wrote:
> OK, thanks for all the advice.  I'm wondering if this makes sense:'
>
> Create an AMI with:
> 1. Java 1.6
> 2. Maven
> 3. svn
> 4. Mahout's exact Hadoop version
> 5. A checkout of Mahout
>
> I want to be able to run the trunk version of Mahout with little upgrade pain, both on an individual node and in a cluster.
>
> Is this the shortest path?  I don't have much experience w/ creating AMIs, but I want my work to be reusable by the community (remember, committers can get credits from Amazon for testing Mahout)
>
> After that, I want to convert some of the public datasets to vector format and run some performance benchmarks.
>
> Thoughts?
>
> On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote:
>
>> I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is wget the Mahout's job files and the data from S3, and launch my job.
>>
>> --- En date de : Mar 12.1.10, deneche abdelhakim <a_...@yahoo.fr> a écrit :
>>
>>> De: deneche abdelhakim <a_...@yahoo.fr>
>>> Objet: Re: Re : Good starting instance for AMI
>>> À: mahout-user@lucene.apache.org
>>> Date: Mardi 12 Janvier 2010, 3h44
>>> I used Cloudera's with Mahout to test
>>> the Decision Forest implementation.
>>>
>>> --- En date de : Lun 11.1.10, Grant Ingersoll <gs...@apache.org>
>>> a écrit :
>>>
>>>> De: Grant Ingersoll <gs...@apache.org>
>>>> Objet: Re: Re : Good starting instance for AMI
>>>> À: mahout-user@lucene.apache.org
>>>> Date: Lundi 11 Janvier 2010, 20h51
>>>> One quick question for all who
>>>> responded:
>>>> How many have tried Mahout with the setup they
>>>> recommended?
>>>>
>>>> -Grant
>>>>
>>>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
>>>>
>>>>> Some comments on Cloudera's Hadoop (CDH) and
>>> Elastic
>>>> MapReduce (EMR).
>>>>>
>>>>> I have used both to get hadoop jobs up and
>>> running
>>>> (although my EMR use has
>>>>> mostly been limited to running batch Pig scripts
>>>> weekly). Deciding on which
>>>>> one to use really depends on what kind of
>>> job/data
>>>> you're working with.
>>>>>
>>>>> EMR is most useful if you're already storing the
>>>> dataset you're using on S3
>>>>> and plan on running a one-off job. My
>>> understanding is
>>>> that it's configured
>>>>> to use jets3t to stream data from s3 rather than
>>>> copying it to the cluster,
>>>>> which is fine for a single pass over a small to
>>> medium
>>>> sized dataset, but
>>>>> obviously slower for multiple passes or larger
>>>> datasets. The API is also
>>>>> useful if you have a set workflow that you plan
>>> to run
>>>> on a regular basis,
>>>>> and I often prototype quick and dirty jobs on
>>> very
>>>> small EMR clusters to
>>>>> test how some things run in the wild (obviously
>>> not
>>>> the most cost effective
>>>>> solution, but I've foudn pseudo-distributed mode
>>>> doesn't catch everything).
>>>>>
>>>>> CDH gives you greater control over the initial
>>> setup
>>>> and configuration of
>>>>> your cluster. From my understanding, it's not
>>> really
>>>> an AMI. Rather, it's a
>>>>> set of Python scripts that's been modified from
>>> the
>>>> ec2 scripts from
>>>>> hadoop/contrib with some nifty additions like
>>> being
>>>> able to specify and set
>>>>> up EBS volumes, proxy on the cluster, and some
>>> others.
>>>> The scripts use the
>>>>> boto Python module (a very useful Python module
>>> for
>>>> working with EC2) to
>>>>> make a request to EC2 to setup a specified sized
>>>> cluster with whatever
>>>>> vanilla AMI that's specified. It sets up the
>>> security
>>>> groups and opens up
>>>>> the relevant ports and it then passes the init
>>> script
>>>> to each of the
>>>>> instances once they've booted (same user-data
>>> file
>>>> setup which is limited to
>>>>> 16K I believe). The init script tells each node
>>> to
>>>> download hadoop (from
>>>>> Clouderas OS-specific repos) and any other
>>>> user-specified packages and set
>>>>> them up. The hadoop config xml is hardcoded into
>>> the
>>>> init script (although
>>>>> you can pass a modified config beforehand). The
>>> master
>>>> is started first, and
>>>>> then the slaves are started so that the slaves
>>> can be
>>>> given info about what
>>>>> NN and JT to connect to (the config uses the
>>> public
>>>> DNS I believe to make
>>>>> things easier to set up). You can use either
>>> 0.18.3
>>>> (CDH) or 0.20 (CDH2)
>>>>> when it comes to Hadoop versions, although I've
>>> had
>>>> mixed results with the
>>>>> latter.
>>>>>
>>>>> Personally, I'd still like some kind of facade
>>> or
>>>> something similar to
>>>>> further abstract things and make it easier for
>>> others
>>>> to quickly set up
>>>>> ad-hoc clusters for 'quick n dirty' jobs. I know
>>> other
>>>> libraries like Crane
>>>>> have been released recently, but given the
>>> language of
>>>> choice (Clojure), I
>>>>> haven't yet had a chance to really investigate.
>>>>>
>>>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
>>> <te...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Just use several of these files.
>>>>>>
>>>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang
>>> Chenmin
>>>> <liangchenmin04@gmail.com
>>>>>>> wrote:
>>>>>>
>>>>>>> EMR requires S3 bucket, but S3 instance
>>> has a
>>>> limit of file
>>>>>>> size(5GB), so need some extra care here.
>>> Has
>>>> any one encounter the file
>>>>>>> size
>>>>>>> problem on S3 also? I kind of think that
>>> it's
>>>> unreasonable to have a  5G
>>>>>>> size limit when we want to use the system
>>> to
>>>> deal with large data set.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ted Dunning, CTO
>>>>>> DeepDyve
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Zaki Rahaman
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Re : Good starting instance for AMI

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 18, 2010, at 3:15 PM, Ted Dunning wrote:

> Is there an important difference between creating an existing AMI or using
> an existing AMI with a startup script that populates everything from S3?
> 
> Building an AMI takes a few hours of time and is a total pain in the butt.
> My eventual result was that I didn't need to do it at all.
> 
> I found that I had roughly three levels of variation in my production
> systems:
> 
> - the OS
> - the infrastructural components like java, hadoop and zookeeeper
> - the application that I wanted to run
> 
> My initial thought was that the AMI should cover the first two aspects of
> variability.  But I also found that I wanted to change the version of the
> infrastructure stuff fairly often in development of the AMI and not
> infrequently in production.
> 
> For Mahout customers, I would imagine that there is a reasonable amount of
> variability in desired OS (Ubuntu versus Redhat versus Centos at least), JDK
> and Hadoop versions.  

I only see a need for two: the version in trunk and the one in latest release.

This is all well and good, but I have yet to see anyone say:  here's the AMI, the download script and the instructions.  So I'm just going to go ahead with what I think is useful for my needs, document it, and put it up there for people to use or not.  If anything, it will be useful for me to do it since I've never setup a Hadoop cluster on EC2 before.  

-Grant

Re: Re : Good starting instance for AMI

Posted by Sean Owen <sr...@gmail.com>.

+1 this is a smarter version of what I tried to put together too. A
semi-custom AMI would download components and configure via an /etc/rc
script. Quite nice.

Point taken about Hadoop and the usefulness amongst ourselves of such
a thing. Based on incomplete experience with running AMIs, and a
Hadoop cluster, it's going to be no small feet to craft a series of
AMIs (or one configurable one) that will reliably come up, find its
workers, accept jobs, etc. It's not terrible but the work of a week
I'm guessing.

That would be pretty great, for the whole community, should you
succeed. You could probably make a nice paid AMI out of it!

On Mon, Jan 18, 2010 at 8:15 PM, Ted Dunning <te...@gmail.com> wrote:
> Is there an important difference between creating an existing AMI or using
> an existing AMI with a startup script that populates everything from S3?
>
> Building an AMI takes a few hours of time and is a total pain in the butt.
> My eventual result was that I didn't need to do it at all.
>
> I found that I had roughly three levels of variation in my production
> systems:
>
> - the OS
> - the infrastructural components like java, hadoop and zookeeeper
> - the application that I wanted to run
>
> My initial thought was that the AMI should cover the first two aspects of
> variability.  But I also found that I wanted to change the version of the
> infrastructure stuff fairly often in development of the AMI and not
> infrequently in production.
>
> For Mahout customers, I would imagine that there is a reasonable amount of
> variability in desired OS (Ubuntu versus Redhat versus Centos at least), JDK
> and Hadoop versions.  We definitely can't afford the time to build AMI's for
> all options.
>
> My final answer for deepdyve was to use a standard alestic.com AMI.  That
> let me change the OS whenever I needed to and would let Mahout customers
> pick their preference.  These AMI's allow a 16K startup script which I used
> to handle infrastructure variation.  That worked very well for me and could
> be used for Mahout.
>
> The cost was a few 10's of seconds at boot time.  The benefit was vastly
> better debug and development cycle.  Somebody else handled the OS and I
> could test many variations of setup script very quickly.  This practice is
> very much in line with what RightScale does.
>
> Generally, I would avoid the full-custom AMI in favor of a few S3 hosted tar
> balls rooted at / that anybody can rain down on any Linux version they
> want.
>
> On Mon, Jan 18, 2010 at 6:54 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> Create an AMI with:
>> 1. Java 1.6
>> 2. Maven
>> 3. svn
>> 4. Mahout's exact Hadoop version
>> 5. A checkout of Mahout
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Re : Good starting instance for AMI

Posted by Ken Krugler <kk...@transpac.com>.

On Jan 18, 2010, at 12:15pm, Ted Dunning wrote:

> Is there an important difference between creating an existing AMI or  
> using
> an existing AMI with a startup script that populates everything from  
> S3?
>
> Building an AMI takes a few hours of time and is a total pain in the  
> butt.
> My eventual result was that I didn't need to do it at all.

[snip]

Leaving aside the pros/cons of having a pre-installed Hadoop, there  
were two things that I found non-trivial to handle via the init script:

1. Get LZO support installed.

Though I didn't dig into the various ways to do a scripted install.

2. Turn off noatime.

You can do it via the script, but it feels kind of odd to have to re- 
mount disks, and either know about the set of volumes or do fancy sed- 
fu to dynamically generate the list.

Maybe there's an easy way that I missed? Input welcome...

-- Ken


The two things that
>
> I found that I had roughly three levels of variation in my production
> systems:
>
> - the OS
> - the infrastructural components like java, hadoop and zookeeeper
> - the application that I wanted to run
>
> My initial thought was that the AMI should cover the first two  
> aspects of
> variability.  But I also found that I wanted to change the version  
> of the
> infrastructure stuff fairly often in development of the AMI and not
> infrequently in production.
>
> For Mahout customers, I would imagine that there is a reasonable  
> amount of
> variability in desired OS (Ubuntu versus Redhat versus Centos at  
> least), JDK
> and Hadoop versions.  We definitely can't afford the time to build  
> AMI's for
> all options.
>
> My final answer for deepdyve was to use a standard alestic.com AMI.   
> That
> let me change the OS whenever I needed to and would let Mahout  
> customers
> pick their preference.  These AMI's allow a 16K startup script which  
> I used
> to handle infrastructure variation.  That worked very well for me  
> and could
> be used for Mahout.
>
> The cost was a few 10's of seconds at boot time.  The benefit was  
> vastly
> better debug and development cycle.  Somebody else handled the OS  
> and I
> could test many variations of setup script very quickly.  This  
> practice is
> very much in line with what RightScale does.
>
> Generally, I would avoid the full-custom AMI in favor of a few S3  
> hosted tar
> balls rooted at / that anybody can rain down on any Linux version they
> want.
>
> On Mon, Jan 18, 2010 at 6:54 AM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> Create an AMI with:
>> 1. Java 1.6
>> 2. Maven
>> 3. svn
>> 4. Mahout's exact Hadoop version
>> 5. A checkout of Mahout
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Re : Good starting instance for AMI

Posted by Ted Dunning <te...@gmail.com>.

Is there an important difference between creating an existing AMI or using
an existing AMI with a startup script that populates everything from S3?

Building an AMI takes a few hours of time and is a total pain in the butt.
My eventual result was that I didn't need to do it at all.

I found that I had roughly three levels of variation in my production
systems:

- the OS
- the infrastructural components like java, hadoop and zookeeeper
- the application that I wanted to run

My initial thought was that the AMI should cover the first two aspects of
variability.  But I also found that I wanted to change the version of the
infrastructure stuff fairly often in development of the AMI and not
infrequently in production.

For Mahout customers, I would imagine that there is a reasonable amount of
variability in desired OS (Ubuntu versus Redhat versus Centos at least), JDK
and Hadoop versions.  We definitely can't afford the time to build AMI's for
all options.

My final answer for deepdyve was to use a standard alestic.com AMI.  That
let me change the OS whenever I needed to and would let Mahout customers
pick their preference.  These AMI's allow a 16K startup script which I used
to handle infrastructure variation.  That worked very well for me and could
be used for Mahout.

The cost was a few 10's of seconds at boot time.  The benefit was vastly
better debug and development cycle.  Somebody else handled the OS and I
could test many variations of setup script very quickly.  This practice is
very much in line with what RightScale does.

Generally, I would avoid the full-custom AMI in favor of a few S3 hosted tar
balls rooted at / that anybody can rain down on any Linux version they
want.

On Mon, Jan 18, 2010 at 6:54 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Create an AMI with:
> 1. Java 1.6
> 2. Maven
> 3. svn
> 4. Mahout's exact Hadoop version
> 5. A checkout of Mahout
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Re : Good starting instance for AMI

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 18, 2010, at 10:31 AM, Sean Owen wrote:

> AFAIK AMIs are fixed. You make your instance as you like it, then run
> some special voodoo to save it off as an AMI. Later you can run the
> AMI, change it, build a new one, but that's a new one. Yeah anyone can
> do it.

Right, I just mostly want a way for others, presumably committers, to be able to edit the same image, so that we aren't duplicating efforts or spinning off a bunch of different AMI's that confuse people.  

> 
> I think this came up before and my only question is, what's the use
> case for this we're trying to answer? So far it sounds like a regular
> instance with a copy of a Mahout .jar. Is this meaningfully more
> useful for someone than simply providing the .jar? I can't exactly
> migrate from one Mahout AMI to another in any sense, when upgrades are
> provided -- AMIs aren't a mechanism for distributing a library.
> 
> We're also not talking about providing a ready-to-go Hadoop cluster.
> And shouldn't. This is something Elastic Mapreduce is already great
> for.
> 

Except EMR is on 0.18.3.  So, yes, I am interested in a ready-to-go Hadoop cluster along w/ a suite of data sets that we can use to benchmark Mahout trunk and make it easier for people to try out Mahout or even run in production.  So while I would agree they aren't a mechanism for distributing a library, they are very useful for getting people up and running very quickly.

At any rate, I think the bigger takeaway from your point is this doesn't have to be some officially supported thing and it isn't required of releases.

I mostly, right now, have a need to benchmark Mahout's clustering capabilities and thus need a Hadoop cluster.  Rather than do a one off like many others have done, I'd like to share my efforts w/ others so that we all, hopefully, benefit.  I can definitely say that if there was an AMI on it that was already preconfigured for me w/ Mahout trunk and Hadoop ready to go, I'd use it and I bet others would too.

So far, I have everything on an instance (mvn, svn, java, Mahout, etc.) except the Hadoop cluster stuff.  I've already run mvn install on Mahout.  In other words, it's pretty ready to go.

> Once upon a time I wrote an AMI that would fire up, automatically
> download data from a location, run recommendations, upload them, and
> quit. Pretty simple, pretty nice. *That* kind of thing I think is
> really useful. The AMI is like one big remote method invocation.

+1.  

> 
> On Mon, Jan 18, 2010 at 3:26 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> 
>> On Jan 18, 2010, at 10:20 AM, Robin Anil wrote:
>> 
>>> Perfect!. We can have two ami's. Mahout trunk and mahout release version.
>> 
>> Cool.  I'll get my base AMI up (just as soon as I figure out the security stuff) and then we can coordinate.  Is it possible to have multiple people "manage" an AMI so that the Mahout committers can reasonably take on keeping them up to date?
>> 
>> -Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Re : Good starting instance for AMI

Posted by Sean Owen <sr...@gmail.com>.

AFAIK AMIs are fixed. You make your instance as you like it, then run
some special voodoo to save it off as an AMI. Later you can run the
AMI, change it, build a new one, but that's a new one. Yeah anyone can
do it.

I think this came up before and my only question is, what's the use
case for this we're trying to answer? So far it sounds like a regular
instance with a copy of a Mahout .jar. Is this meaningfully more
useful for someone than simply providing the .jar? I can't exactly
migrate from one Mahout AMI to another in any sense, when upgrades are
provided -- AMIs aren't a mechanism for distributing a library.

We're also not talking about providing a ready-to-go Hadoop cluster.
And shouldn't. This is something Elastic Mapreduce is already great
for.

Once upon a time I wrote an AMI that would fire up, automatically
download data from a location, run recommendations, upload them, and
quit. Pretty simple, pretty nice. *That* kind of thing I think is
really useful. The AMI is like one big remote method invocation.

On Mon, Jan 18, 2010 at 3:26 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Jan 18, 2010, at 10:20 AM, Robin Anil wrote:
>
>> Perfect!. We can have two ami's. Mahout trunk and mahout release version.
>
> Cool.  I'll get my base AMI up (just as soon as I figure out the security stuff) and then we can coordinate.  Is it possible to have multiple people "manage" an AMI so that the Mahout committers can reasonably take on keeping them up to date?
>
> -Grant

Re: Re : Good starting instance for AMI

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 18, 2010, at 10:20 AM, Robin Anil wrote:

> Perfect!. We can have two ami's. Mahout trunk and mahout release version.

Cool.  I'll get my base AMI up (just as soon as I figure out the security stuff) and then we can coordinate.  Is it possible to have multiple people "manage" an AMI so that the Mahout committers can reasonably take on keeping them up to date?

-Grant

Re: Re : Good starting instance for AMI

Posted by Robin Anil <ro...@gmail.com>.

It would be great if we can bundle lzo codec too

We need to put some script to add to the hadoop slaves to run a cluster
easily(needn't be optimized configuration)

One problem i see is we may have to make for both 386 and x64 kernel(or we
wont be able to run small/large instances respectively)
Robin

On Mon, Jan 18, 2010 at 8:50 PM, Robin Anil <ro...@gmail.com> wrote:

> Perfect!. We can have two ami's. Mahout trunk and mahout release version.
>
>
> On Mon, Jan 18, 2010 at 8:24 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> OK, thanks for all the advice.  I'm wondering if this makes sense:'
>>
>> Create an AMI with:
>> 1. Java 1.6
>> 2. Maven
>> 3. svn
>> 4. Mahout's exact Hadoop version
>> 5. A checkout of Mahout
>>
>> I want to be able to run the trunk version of Mahout with little upgrade
>> pain, both on an individual node and in a cluster.
>>
>> Is this the shortest path?  I don't have much experience w/ creating AMIs,
>> but I want my work to be reusable by the community (remember, committers can
>> get credits from Amazon for testing Mahout)
>>
>> After that, I want to convert some of the public datasets to vector format
>> and run some performance benchmarks.
>>
>> Thoughts?
>>
>> On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote:
>>
>> > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs
>> Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is
>> wget the Mahout's job files and the data from S3, and launch my job.
>> >
>> > --- En date de : Mar 12.1.10, deneche abdelhakim <a_...@yahoo.fr> a
>> écrit :
>> >
>> >> De: deneche abdelhakim <a_...@yahoo.fr>
>> >> Objet: Re: Re : Good starting instance for AMI
>> >> À: mahout-user@lucene.apache.org
>> >> Date: Mardi 12 Janvier 2010, 3h44
>> >> I used Cloudera's with Mahout to test
>> >> the Decision Forest implementation.
>> >>
>> >> --- En date de : Lun 11.1.10, Grant Ingersoll <gs...@apache.org>
>> >> a écrit :
>> >>
>> >>> De: Grant Ingersoll <gs...@apache.org>
>> >>> Objet: Re: Re : Good starting instance for AMI
>> >>> À: mahout-user@lucene.apache.org
>> >>> Date: Lundi 11 Janvier 2010, 20h51
>> >>> One quick question for all who
>> >>> responded:
>> >>> How many have tried Mahout with the setup they
>> >>> recommended?
>> >>>
>> >>> -Grant
>> >>>
>> >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
>> >>>
>> >>>> Some comments on Cloudera's Hadoop (CDH) and
>> >> Elastic
>> >>> MapReduce (EMR).
>> >>>>
>> >>>> I have used both to get hadoop jobs up and
>> >> running
>> >>> (although my EMR use has
>> >>>> mostly been limited to running batch Pig scripts
>> >>> weekly). Deciding on which
>> >>>> one to use really depends on what kind of
>> >> job/data
>> >>> you're working with.
>> >>>>
>> >>>> EMR is most useful if you're already storing the
>> >>> dataset you're using on S3
>> >>>> and plan on running a one-off job. My
>> >> understanding is
>> >>> that it's configured
>> >>>> to use jets3t to stream data from s3 rather than
>> >>> copying it to the cluster,
>> >>>> which is fine for a single pass over a small to
>> >> medium
>> >>> sized dataset, but
>> >>>> obviously slower for multiple passes or larger
>> >>> datasets. The API is also
>> >>>> useful if you have a set workflow that you plan
>> >> to run
>> >>> on a regular basis,
>> >>>> and I often prototype quick and dirty jobs on
>> >> very
>> >>> small EMR clusters to
>> >>>> test how some things run in the wild (obviously
>> >> not
>> >>> the most cost effective
>> >>>> solution, but I've foudn pseudo-distributed mode
>> >>> doesn't catch everything).
>> >>>>
>> >>>> CDH gives you greater control over the initial
>> >> setup
>> >>> and configuration of
>> >>>> your cluster. From my understanding, it's not
>> >> really
>> >>> an AMI. Rather, it's a
>> >>>> set of Python scripts that's been modified from
>> >> the
>> >>> ec2 scripts from
>> >>>> hadoop/contrib with some nifty additions like
>> >> being
>> >>> able to specify and set
>> >>>> up EBS volumes, proxy on the cluster, and some
>> >> others.
>> >>> The scripts use the
>> >>>> boto Python module (a very useful Python module
>> >> for
>> >>> working with EC2) to
>> >>>> make a request to EC2 to setup a specified sized
>> >>> cluster with whatever
>> >>>> vanilla AMI that's specified. It sets up the
>> >> security
>> >>> groups and opens up
>> >>>> the relevant ports and it then passes the init
>> >> script
>> >>> to each of the
>> >>>> instances once they've booted (same user-data
>> >> file
>> >>> setup which is limited to
>> >>>> 16K I believe). The init script tells each node
>> >> to
>> >>> download hadoop (from
>> >>>> Clouderas OS-specific repos) and any other
>> >>> user-specified packages and set
>> >>>> them up. The hadoop config xml is hardcoded into
>> >> the
>> >>> init script (although
>> >>>> you can pass a modified config beforehand). The
>> >> master
>> >>> is started first, and
>> >>>> then the slaves are started so that the slaves
>> >> can be
>> >>> given info about what
>> >>>> NN and JT to connect to (the config uses the
>> >> public
>> >>> DNS I believe to make
>> >>>> things easier to set up). You can use either
>> >> 0.18.3
>> >>> (CDH) or 0.20 (CDH2)
>> >>>> when it comes to Hadoop versions, although I've
>> >> had
>> >>> mixed results with the
>> >>>> latter.
>> >>>>
>> >>>> Personally, I'd still like some kind of facade
>> >> or
>> >>> something similar to
>> >>>> further abstract things and make it easier for
>> >> others
>> >>> to quickly set up
>> >>>> ad-hoc clusters for 'quick n dirty' jobs. I know
>> >> other
>> >>> libraries like Crane
>> >>>> have been released recently, but given the
>> >> language of
>> >>> choice (Clojure), I
>> >>>> haven't yet had a chance to really investigate.
>> >>>>
>> >>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
>> >> <te...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>>> Just use several of these files.
>> >>>>>
>> >>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang
>> >> Chenmin
>> >>> <liangchenmin04@gmail.com
>> >>>>>> wrote:
>> >>>>>
>> >>>>>> EMR requires S3 bucket, but S3 instance
>> >> has a
>> >>> limit of file
>> >>>>>> size(5GB), so need some extra care here.
>> >> Has
>> >>> any one encounter the file
>> >>>>>> size
>> >>>>>> problem on S3 also? I kind of think that
>> >> it's
>> >>> unreasonable to have a  5G
>> >>>>>> size limit when we want to use the system
>> >> to
>> >>> deal with large data set.
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Ted Dunning, CTO
>> >>>>> DeepDyve
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Zaki Rahaman
>> >>>
>> >>> --------------------------
>> >>> Grant Ingersoll
>> >>> http://www.lucidimagination.com/
>> >>>
>> >>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>>
>>
>

Re: Re : Good starting instance for AMI

Posted by Robin Anil <ro...@gmail.com>.

Perfect!. We can have two ami's. Mahout trunk and mahout release version.


On Mon, Jan 18, 2010 at 8:24 PM, Grant Ingersoll <gs...@apache.org>wrote:

> OK, thanks for all the advice.  I'm wondering if this makes sense:'
>
> Create an AMI with:
> 1. Java 1.6
> 2. Maven
> 3. svn
> 4. Mahout's exact Hadoop version
> 5. A checkout of Mahout
>
> I want to be able to run the trunk version of Mahout with little upgrade
> pain, both on an individual node and in a cluster.
>
> Is this the shortest path?  I don't have much experience w/ creating AMIs,
> but I want my work to be reusable by the community (remember, committers can
> get credits from Amazon for testing Mahout)
>
> After that, I want to convert some of the public datasets to vector format
> and run some performance benchmarks.
>
> Thoughts?
>
> On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote:
>
> > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs
> Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is
> wget the Mahout's job files and the data from S3, and launch my job.
> >
> > --- En date de : Mar 12.1.10, deneche abdelhakim <a_...@yahoo.fr> a
> écrit :
> >
> >> De: deneche abdelhakim <a_...@yahoo.fr>
> >> Objet: Re: Re : Good starting instance for AMI
> >> À: mahout-user@lucene.apache.org
> >> Date: Mardi 12 Janvier 2010, 3h44
> >> I used Cloudera's with Mahout to test
> >> the Decision Forest implementation.
> >>
> >> --- En date de : Lun 11.1.10, Grant Ingersoll <gs...@apache.org>
> >> a écrit :
> >>
> >>> De: Grant Ingersoll <gs...@apache.org>
> >>> Objet: Re: Re : Good starting instance for AMI
> >>> À: mahout-user@lucene.apache.org
> >>> Date: Lundi 11 Janvier 2010, 20h51
> >>> One quick question for all who
> >>> responded:
> >>> How many have tried Mahout with the setup they
> >>> recommended?
> >>>
> >>> -Grant
> >>>
> >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
> >>>
> >>>> Some comments on Cloudera's Hadoop (CDH) and
> >> Elastic
> >>> MapReduce (EMR).
> >>>>
> >>>> I have used both to get hadoop jobs up and
> >> running
> >>> (although my EMR use has
> >>>> mostly been limited to running batch Pig scripts
> >>> weekly). Deciding on which
> >>>> one to use really depends on what kind of
> >> job/data
> >>> you're working with.
> >>>>
> >>>> EMR is most useful if you're already storing the
> >>> dataset you're using on S3
> >>>> and plan on running a one-off job. My
> >> understanding is
> >>> that it's configured
> >>>> to use jets3t to stream data from s3 rather than
> >>> copying it to the cluster,
> >>>> which is fine for a single pass over a small to
> >> medium
> >>> sized dataset, but
> >>>> obviously slower for multiple passes or larger
> >>> datasets. The API is also
> >>>> useful if you have a set workflow that you plan
> >> to run
> >>> on a regular basis,
> >>>> and I often prototype quick and dirty jobs on
> >> very
> >>> small EMR clusters to
> >>>> test how some things run in the wild (obviously
> >> not
> >>> the most cost effective
> >>>> solution, but I've foudn pseudo-distributed mode
> >>> doesn't catch everything).
> >>>>
> >>>> CDH gives you greater control over the initial
> >> setup
> >>> and configuration of
> >>>> your cluster. From my understanding, it's not
> >> really
> >>> an AMI. Rather, it's a
> >>>> set of Python scripts that's been modified from
> >> the
> >>> ec2 scripts from
> >>>> hadoop/contrib with some nifty additions like
> >> being
> >>> able to specify and set
> >>>> up EBS volumes, proxy on the cluster, and some
> >> others.
> >>> The scripts use the
> >>>> boto Python module (a very useful Python module
> >> for
> >>> working with EC2) to
> >>>> make a request to EC2 to setup a specified sized
> >>> cluster with whatever
> >>>> vanilla AMI that's specified. It sets up the
> >> security
> >>> groups and opens up
> >>>> the relevant ports and it then passes the init
> >> script
> >>> to each of the
> >>>> instances once they've booted (same user-data
> >> file
> >>> setup which is limited to
> >>>> 16K I believe). The init script tells each node
> >> to
> >>> download hadoop (from
> >>>> Clouderas OS-specific repos) and any other
> >>> user-specified packages and set
> >>>> them up. The hadoop config xml is hardcoded into
> >> the
> >>> init script (although
> >>>> you can pass a modified config beforehand). The
> >> master
> >>> is started first, and
> >>>> then the slaves are started so that the slaves
> >> can be
> >>> given info about what
> >>>> NN and JT to connect to (the config uses the
> >> public
> >>> DNS I believe to make
> >>>> things easier to set up). You can use either
> >> 0.18.3
> >>> (CDH) or 0.20 (CDH2)
> >>>> when it comes to Hadoop versions, although I've
> >> had
> >>> mixed results with the
> >>>> latter.
> >>>>
> >>>> Personally, I'd still like some kind of facade
> >> or
> >>> something similar to
> >>>> further abstract things and make it easier for
> >> others
> >>> to quickly set up
> >>>> ad-hoc clusters for 'quick n dirty' jobs. I know
> >> other
> >>> libraries like Crane
> >>>> have been released recently, but given the
> >> language of
> >>> choice (Clojure), I
> >>>> haven't yet had a chance to really investigate.
> >>>>
> >>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
> >> <te...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> Just use several of these files.
> >>>>>
> >>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang
> >> Chenmin
> >>> <liangchenmin04@gmail.com
> >>>>>> wrote:
> >>>>>
> >>>>>> EMR requires S3 bucket, but S3 instance
> >> has a
> >>> limit of file
> >>>>>> size(5GB), so need some extra care here.
> >> Has
> >>> any one encounter the file
> >>>>>> size
> >>>>>> problem on S3 also? I kind of think that
> >> it's
> >>> unreasonable to have a  5G
> >>>>>> size limit when we want to use the system
> >> to
> >>> deal with large data set.
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Ted Dunning, CTO
> >>>>> DeepDyve
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Zaki Rahaman
> >>>
> >>> --------------------------
> >>> Grant Ingersoll
> >>> http://www.lucidimagination.com/
> >>>
> >>> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
> >>>
> >>>
> >>
> >>
> >>
> >>
> >
> >
> >
>
>

Re: Re : Good starting instance for AMI

Posted by Olivier Grisel <ol...@ensta.org>.

2010/1/18 Grant Ingersoll <gs...@apache.org>:
> OK, thanks for all the advice.  I'm wondering if this makes sense:'
>
> Create an AMI with:
> 1. Java 1.6
> 2. Maven
> 3. svn
> 4. Mahout's exact Hadoop version
> 5. A checkout of Mahout

I am running CDH2 with hadoop currently in version 0.20.1+152-1~j
(using cloudera's intrepid-testing  apt repo on a regular ubuntu
karmic distro) with on my 2 dev boxes (one is 32bit bi core and one is
64bit quad core) in conf-pseudo (single node cIuser). I could
successfully run mahout-0.3-SNAPSHOT jobs (including the
hadoop-0.20.2-SNAPSHOT. I guess this would run exactly the same on a
real EC2 cluster setup with http://archive.cloudera.com/docs/ec2.html
.

> I want to be able to run the trunk version of Mahout with little upgrade pain, both on an individual node and in a cluster.
>
> Is this the shortest path?  I don't have much experience w/ creating AMIs, but I want my work to be reusable by the community (remember, committers can get credits from Amazon for testing Mahout)
>
> After that, I want to convert some of the public datasets to vector format and run some performance benchmarks.

I think we should host sample datasets that are know to be
vectorizable using mahout utilities either on S3 (using s3:// and not
s3n:// when individual files are larger than 5GB) or using a dedicated
EBS volume with a public snapshot.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: Re : Good starting instance for AMI

Posted by Grant Ingersoll <gs...@apache.org>.

OK, thanks for all the advice.  I'm wondering if this makes sense:'

Create an AMI with:
1. Java 1.6
2. Maven
3. svn
4. Mahout's exact Hadoop version
5. A checkout of Mahout

I want to be able to run the trunk version of Mahout with little upgrade pain, both on an individual node and in a cluster.

Is this the shortest path?  I don't have much experience w/ creating AMIs, but I want my work to be reusable by the community (remember, committers can get credits from Amazon for testing Mahout)

After that, I want to convert some of the public datasets to vector format and run some performance benchmarks.

Thoughts?

On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote:

> I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is wget the Mahout's job files and the data from S3, and launch my job.
> 
> --- En date de : Mar 12.1.10, deneche abdelhakim <a_...@yahoo.fr> a écrit :
> 
>> De: deneche abdelhakim <a_...@yahoo.fr>
>> Objet: Re: Re : Good starting instance for AMI
>> À: mahout-user@lucene.apache.org
>> Date: Mardi 12 Janvier 2010, 3h44
>> I used Cloudera's with Mahout to test
>> the Decision Forest implementation.
>> 
>> --- En date de : Lun 11.1.10, Grant Ingersoll <gs...@apache.org>
>> a écrit :
>> 
>>> De: Grant Ingersoll <gs...@apache.org>
>>> Objet: Re: Re : Good starting instance for AMI
>>> À: mahout-user@lucene.apache.org
>>> Date: Lundi 11 Janvier 2010, 20h51
>>> One quick question for all who
>>> responded:
>>> How many have tried Mahout with the setup they
>>> recommended?
>>> 
>>> -Grant
>>> 
>>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
>>> 
>>>> Some comments on Cloudera's Hadoop (CDH) and
>> Elastic
>>> MapReduce (EMR).
>>>> 
>>>> I have used both to get hadoop jobs up and
>> running
>>> (although my EMR use has
>>>> mostly been limited to running batch Pig scripts
>>> weekly). Deciding on which
>>>> one to use really depends on what kind of
>> job/data
>>> you're working with.
>>>> 
>>>> EMR is most useful if you're already storing the
>>> dataset you're using on S3
>>>> and plan on running a one-off job. My
>> understanding is
>>> that it's configured
>>>> to use jets3t to stream data from s3 rather than
>>> copying it to the cluster,
>>>> which is fine for a single pass over a small to
>> medium
>>> sized dataset, but
>>>> obviously slower for multiple passes or larger
>>> datasets. The API is also
>>>> useful if you have a set workflow that you plan
>> to run
>>> on a regular basis,
>>>> and I often prototype quick and dirty jobs on
>> very
>>> small EMR clusters to
>>>> test how some things run in the wild (obviously
>> not
>>> the most cost effective
>>>> solution, but I've foudn pseudo-distributed mode
>>> doesn't catch everything).
>>>> 
>>>> CDH gives you greater control over the initial
>> setup
>>> and configuration of
>>>> your cluster. From my understanding, it's not
>> really
>>> an AMI. Rather, it's a
>>>> set of Python scripts that's been modified from
>> the
>>> ec2 scripts from
>>>> hadoop/contrib with some nifty additions like
>> being
>>> able to specify and set
>>>> up EBS volumes, proxy on the cluster, and some
>> others.
>>> The scripts use the
>>>> boto Python module (a very useful Python module
>> for
>>> working with EC2) to
>>>> make a request to EC2 to setup a specified sized
>>> cluster with whatever
>>>> vanilla AMI that's specified. It sets up the
>> security
>>> groups and opens up
>>>> the relevant ports and it then passes the init
>> script
>>> to each of the
>>>> instances once they've booted (same user-data
>> file
>>> setup which is limited to
>>>> 16K I believe). The init script tells each node
>> to
>>> download hadoop (from
>>>> Clouderas OS-specific repos) and any other
>>> user-specified packages and set
>>>> them up. The hadoop config xml is hardcoded into
>> the
>>> init script (although
>>>> you can pass a modified config beforehand). The
>> master
>>> is started first, and
>>>> then the slaves are started so that the slaves
>> can be
>>> given info about what
>>>> NN and JT to connect to (the config uses the
>> public
>>> DNS I believe to make
>>>> things easier to set up). You can use either
>> 0.18.3
>>> (CDH) or 0.20 (CDH2)
>>>> when it comes to Hadoop versions, although I've
>> had
>>> mixed results with the
>>>> latter.
>>>> 
>>>> Personally, I'd still like some kind of facade
>> or
>>> something similar to
>>>> further abstract things and make it easier for
>> others
>>> to quickly set up
>>>> ad-hoc clusters for 'quick n dirty' jobs. I know
>> other
>>> libraries like Crane
>>>> have been released recently, but given the
>> language of
>>> choice (Clojure), I
>>>> haven't yet had a chance to really investigate.
>>>> 
>>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
>> <te...@gmail.com>
>>> wrote:
>>>> 
>>>>> Just use several of these files.
>>>>> 
>>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang
>> Chenmin
>>> <liangchenmin04@gmail.com
>>>>>> wrote:
>>>>> 
>>>>>> EMR requires S3 bucket, but S3 instance
>> has a
>>> limit of file
>>>>>> size(5GB), so need some extra care here.
>> Has
>>> any one encounter the file
>>>>>> size
>>>>>> problem on S3 also? I kind of think that
>> it's
>>> unreasonable to have a  5G
>>>>>> size limit when we want to use the system
>> to
>>> deal with large data set.
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Ted Dunning, CTO
>>>>> DeepDyve
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Zaki Rahaman
>>> 
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>> 
>>> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 
>

Re: Re : Good starting instance for AMI

Posted by deneche abdelhakim <a_...@yahoo.fr>.

I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is wget the Mahout's job files and the data from S3, and launch my job.

--- En date de : Mar 12.1.10, deneche abdelhakim <a_...@yahoo.fr> a écrit :

> De: deneche abdelhakim <a_...@yahoo.fr>
> Objet: Re: Re : Good starting instance for AMI
> À: mahout-user@lucene.apache.org
> Date: Mardi 12 Janvier 2010, 3h44
> I used Cloudera's with Mahout to test
> the Decision Forest implementation.
> 
> --- En date de : Lun 11.1.10, Grant Ingersoll <gs...@apache.org>
> a écrit :
> 
> > De: Grant Ingersoll <gs...@apache.org>
> > Objet: Re: Re : Good starting instance for AMI
> > À: mahout-user@lucene.apache.org
> > Date: Lundi 11 Janvier 2010, 20h51
> > One quick question for all who
> > responded:
> > How many have tried Mahout with the setup they
> > recommended?
> > 
> > -Grant
> > 
> > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
> > 
> > > Some comments on Cloudera's Hadoop (CDH) and
> Elastic
> > MapReduce (EMR).
> > > 
> > > I have used both to get hadoop jobs up and
> running
> > (although my EMR use has
> > > mostly been limited to running batch Pig scripts
> > weekly). Deciding on which
> > > one to use really depends on what kind of
> job/data
> > you're working with.
> > > 
> > > EMR is most useful if you're already storing the
> > dataset you're using on S3
> > > and plan on running a one-off job. My
> understanding is
> > that it's configured
> > > to use jets3t to stream data from s3 rather than
> > copying it to the cluster,
> > > which is fine for a single pass over a small to
> medium
> > sized dataset, but
> > > obviously slower for multiple passes or larger
> > datasets. The API is also
> > > useful if you have a set workflow that you plan
> to run
> > on a regular basis,
> > > and I often prototype quick and dirty jobs on
> very
> > small EMR clusters to
> > > test how some things run in the wild (obviously
> not
> > the most cost effective
> > > solution, but I've foudn pseudo-distributed mode
> > doesn't catch everything).
> > > 
> > > CDH gives you greater control over the initial
> setup
> > and configuration of
> > > your cluster. From my understanding, it's not
> really
> > an AMI. Rather, it's a
> > > set of Python scripts that's been modified from
> the
> > ec2 scripts from
> > > hadoop/contrib with some nifty additions like
> being
> > able to specify and set
> > > up EBS volumes, proxy on the cluster, and some
> others.
> > The scripts use the
> > > boto Python module (a very useful Python module
> for
> > working with EC2) to
> > > make a request to EC2 to setup a specified sized
> > cluster with whatever
> > > vanilla AMI that's specified. It sets up the
> security
> > groups and opens up
> > > the relevant ports and it then passes the init
> script
> > to each of the
> > > instances once they've booted (same user-data
> file
> > setup which is limited to
> > > 16K I believe). The init script tells each node
> to
> > download hadoop (from
> > > Clouderas OS-specific repos) and any other
> > user-specified packages and set
> > > them up. The hadoop config xml is hardcoded into
> the
> > init script (although
> > > you can pass a modified config beforehand). The
> master
> > is started first, and
> > > then the slaves are started so that the slaves
> can be
> > given info about what
> > > NN and JT to connect to (the config uses the
> public
> > DNS I believe to make
> > > things easier to set up). You can use either
> 0.18.3
> > (CDH) or 0.20 (CDH2)
> > > when it comes to Hadoop versions, although I've
> had
> > mixed results with the
> > > latter.
> > > 
> > > Personally, I'd still like some kind of facade
> or
> > something similar to
> > > further abstract things and make it easier for
> others
> > to quickly set up
> > > ad-hoc clusters for 'quick n dirty' jobs. I know
> other
> > libraries like Crane
> > > have been released recently, but given the
> language of
> > choice (Clojure), I
> > > haven't yet had a chance to really investigate.
> > > 
> > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
> <te...@gmail.com>
> > wrote:
> > > 
> > >> Just use several of these files.
> > >> 
> > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang
> Chenmin
> > <liangchenmin04@gmail.com
> > >>> wrote:
> > >> 
> > >>> EMR requires S3 bucket, but S3 instance
> has a
> > limit of file
> > >>> size(5GB), so need some extra care here.
> Has
> > any one encounter the file
> > >>> size
> > >>> problem on S3 also? I kind of think that
> it's
> > unreasonable to have a  5G
> > >>> size limit when we want to use the system
> to
> > deal with large data set.
> > >>> 
> > >> 
> > >> 
> > >> 
> > >> --
> > >> Ted Dunning, CTO
> > >> DeepDyve
> > >> 
> > > 
> > > 
> > > 
> > > -- 
> > > Zaki Rahaman
> > 
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> > 
> > Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
> > 
> > 
> 
> 
> 
>

Re: Re : Good starting instance for AMI

Posted by deneche abdelhakim <a_...@yahoo.fr>.

I used Cloudera's with Mahout to test the Decision Forest implementation.

--- En date de : Lun 11.1.10, Grant Ingersoll <gs...@apache.org> a écrit :

> De: Grant Ingersoll <gs...@apache.org>
> Objet: Re: Re : Good starting instance for AMI
> À: mahout-user@lucene.apache.org
> Date: Lundi 11 Janvier 2010, 20h51
> One quick question for all who
> responded:
> How many have tried Mahout with the setup they
> recommended?
> 
> -Grant
> 
> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
> 
> > Some comments on Cloudera's Hadoop (CDH) and Elastic
> MapReduce (EMR).
> > 
> > I have used both to get hadoop jobs up and running
> (although my EMR use has
> > mostly been limited to running batch Pig scripts
> weekly). Deciding on which
> > one to use really depends on what kind of job/data
> you're working with.
> > 
> > EMR is most useful if you're already storing the
> dataset you're using on S3
> > and plan on running a one-off job. My understanding is
> that it's configured
> > to use jets3t to stream data from s3 rather than
> copying it to the cluster,
> > which is fine for a single pass over a small to medium
> sized dataset, but
> > obviously slower for multiple passes or larger
> datasets. The API is also
> > useful if you have a set workflow that you plan to run
> on a regular basis,
> > and I often prototype quick and dirty jobs on very
> small EMR clusters to
> > test how some things run in the wild (obviously not
> the most cost effective
> > solution, but I've foudn pseudo-distributed mode
> doesn't catch everything).
> > 
> > CDH gives you greater control over the initial setup
> and configuration of
> > your cluster. From my understanding, it's not really
> an AMI. Rather, it's a
> > set of Python scripts that's been modified from the
> ec2 scripts from
> > hadoop/contrib with some nifty additions like being
> able to specify and set
> > up EBS volumes, proxy on the cluster, and some others.
> The scripts use the
> > boto Python module (a very useful Python module for
> working with EC2) to
> > make a request to EC2 to setup a specified sized
> cluster with whatever
> > vanilla AMI that's specified. It sets up the security
> groups and opens up
> > the relevant ports and it then passes the init script
> to each of the
> > instances once they've booted (same user-data file
> setup which is limited to
> > 16K I believe). The init script tells each node to
> download hadoop (from
> > Clouderas OS-specific repos) and any other
> user-specified packages and set
> > them up. The hadoop config xml is hardcoded into the
> init script (although
> > you can pass a modified config beforehand). The master
> is started first, and
> > then the slaves are started so that the slaves can be
> given info about what
> > NN and JT to connect to (the config uses the public
> DNS I believe to make
> > things easier to set up). You can use either 0.18.3
> (CDH) or 0.20 (CDH2)
> > when it comes to Hadoop versions, although I've had
> mixed results with the
> > latter.
> > 
> > Personally, I'd still like some kind of facade or
> something similar to
> > further abstract things and make it easier for others
> to quickly set up
> > ad-hoc clusters for 'quick n dirty' jobs. I know other
> libraries like Crane
> > have been released recently, but given the language of
> choice (Clojure), I
> > haven't yet had a chance to really investigate.
> > 
> > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > 
> >> Just use several of these files.
> >> 
> >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin
> <liangchenmin04@gmail.com
> >>> wrote:
> >> 
> >>> EMR requires S3 bucket, but S3 instance has a
> limit of file
> >>> size(5GB), so need some extra care here. Has
> any one encounter the file
> >>> size
> >>> problem on S3 also? I kind of think that it's
> unreasonable to have a  5G
> >>> size limit when we want to use the system to
> deal with large data set.
> >>> 
> >> 
> >> 
> >> 
> >> --
> >> Ted Dunning, CTO
> >> DeepDyve
> >> 
> > 
> > 
> > 
> > -- 
> > Zaki Rahaman
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
> 
>

Re: Re : Good starting instance for AMI

Posted by Grant Ingersoll <gs...@apache.org>.

One quick question for all who responded:
How many have tried Mahout with the setup they recommended?

-Grant

On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:

> Some comments on Cloudera's Hadoop (CDH) and Elastic MapReduce (EMR).
> 
> I have used both to get hadoop jobs up and running (although my EMR use has
> mostly been limited to running batch Pig scripts weekly). Deciding on which
> one to use really depends on what kind of job/data you're working with.
> 
> EMR is most useful if you're already storing the dataset you're using on S3
> and plan on running a one-off job. My understanding is that it's configured
> to use jets3t to stream data from s3 rather than copying it to the cluster,
> which is fine for a single pass over a small to medium sized dataset, but
> obviously slower for multiple passes or larger datasets. The API is also
> useful if you have a set workflow that you plan to run on a regular basis,
> and I often prototype quick and dirty jobs on very small EMR clusters to
> test how some things run in the wild (obviously not the most cost effective
> solution, but I've foudn pseudo-distributed mode doesn't catch everything).
> 
> CDH gives you greater control over the initial setup and configuration of
> your cluster. From my understanding, it's not really an AMI. Rather, it's a
> set of Python scripts that's been modified from the ec2 scripts from
> hadoop/contrib with some nifty additions like being able to specify and set
> up EBS volumes, proxy on the cluster, and some others. The scripts use the
> boto Python module (a very useful Python module for working with EC2) to
> make a request to EC2 to setup a specified sized cluster with whatever
> vanilla AMI that's specified. It sets up the security groups and opens up
> the relevant ports and it then passes the init script to each of the
> instances once they've booted (same user-data file setup which is limited to
> 16K I believe). The init script tells each node to download hadoop (from
> Clouderas OS-specific repos) and any other user-specified packages and set
> them up. The hadoop config xml is hardcoded into the init script (although
> you can pass a modified config beforehand). The master is started first, and
> then the slaves are started so that the slaves can be given info about what
> NN and JT to connect to (the config uses the public DNS I believe to make
> things easier to set up). You can use either 0.18.3 (CDH) or 0.20 (CDH2)
> when it comes to Hadoop versions, although I've had mixed results with the
> latter.
> 
> Personally, I'd still like some kind of facade or something similar to
> further abstract things and make it easier for others to quickly set up
> ad-hoc clusters for 'quick n dirty' jobs. I know other libraries like Crane
> have been released recently, but given the language of choice (Clojure), I
> haven't yet had a chance to really investigate.
> 
> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <te...@gmail.com> wrote:
> 
>> Just use several of these files.
>> 
>> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <liangchenmin04@gmail.com
>>> wrote:
>> 
>>> EMR requires S3 bucket, but S3 instance has a limit of file
>>> size(5GB), so need some extra care here. Has any one encounter the file
>>> size
>>> problem on S3 also? I kind of think that it's unreasonable to have a  5G
>>> size limit when we want to use the system to deal with large data set.
>>> 
>> 
>> 
>> 
>> --
>> Ted Dunning, CTO
>> DeepDyve
>> 
> 
> 
> 
> -- 
> Zaki Rahaman

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Re : Good starting instance for AMI

Posted by zaki rahaman <za...@gmail.com>.

Some comments on Cloudera's Hadoop (CDH) and Elastic MapReduce (EMR).

I have used both to get hadoop jobs up and running (although my EMR use has
mostly been limited to running batch Pig scripts weekly). Deciding on which
one to use really depends on what kind of job/data you're working with.

EMR is most useful if you're already storing the dataset you're using on S3
and plan on running a one-off job. My understanding is that it's configured
to use jets3t to stream data from s3 rather than copying it to the cluster,
which is fine for a single pass over a small to medium sized dataset, but
obviously slower for multiple passes or larger datasets. The API is also
useful if you have a set workflow that you plan to run on a regular basis,
and I often prototype quick and dirty jobs on very small EMR clusters to
test how some things run in the wild (obviously not the most cost effective
solution, but I've foudn pseudo-distributed mode doesn't catch everything).

CDH gives you greater control over the initial setup and configuration of
your cluster. From my understanding, it's not really an AMI. Rather, it's a
set of Python scripts that's been modified from the ec2 scripts from
hadoop/contrib with some nifty additions like being able to specify and set
up EBS volumes, proxy on the cluster, and some others. The scripts use the
boto Python module (a very useful Python module for working with EC2) to
make a request to EC2 to setup a specified sized cluster with whatever
vanilla AMI that's specified. It sets up the security groups and opens up
the relevant ports and it then passes the init script to each of the
instances once they've booted (same user-data file setup which is limited to
16K I believe). The init script tells each node to download hadoop (from
Clouderas OS-specific repos) and any other user-specified packages and set
them up. The hadoop config xml is hardcoded into the init script (although
you can pass a modified config beforehand). The master is started first, and
then the slaves are started so that the slaves can be given info about what
NN and JT to connect to (the config uses the public DNS I believe to make
things easier to set up). You can use either 0.18.3 (CDH) or 0.20 (CDH2)
when it comes to Hadoop versions, although I've had mixed results with the
latter.

Personally, I'd still like some kind of facade or something similar to
further abstract things and make it easier for others to quickly set up
ad-hoc clusters for 'quick n dirty' jobs. I know other libraries like Crane
have been released recently, but given the language of choice (Clojure), I
haven't yet had a chance to really investigate.

On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <te...@gmail.com> wrote:

> Just use several of these files.
>
> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <liangchenmin04@gmail.com
> >wrote:
>
> > EMR requires S3 bucket, but S3 instance has a limit of file
> > size(5GB), so need some extra care here. Has any one encounter the file
> > size
> > problem on S3 also? I kind of think that it's unreasonable to have a  5G
> > size limit when we want to use the system to deal with large data set.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

-- 
Zaki Rahaman

Re: Re : Good starting instance for AMI

Posted by Ted Dunning <te...@gmail.com>.

Just use several of these files.

On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <li...@gmail.com>wrote:

> EMR requires S3 bucket, but S3 instance has a limit of file
> size(5GB), so need some extra care here. Has any one encounter the file
> size
> problem on S3 also? I kind of think that it's unreasonable to have a  5G
> size limit when we want to use the system to deal with large data set.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Re : Good starting instance for AMI

Posted by Liang Chenmin <li...@gmail.com>.

I used EMR for our project, and it works. It took some time to set up
though. EMR requires S3 bucket, but S3 instance has a limit of file
size(5GB), so need some extra care here. Has any one encounter the file size
problem on S3 also? I kind of think that it's unreasonable to have a  5G
size limit when we want to use the system to deal with large data set.

On Sun, Jan 10, 2010 at 8:06 PM, Ted Dunning <te...@gmail.com> wrote:

> This seems the easiest answer so far!
>
> On Sun, Jan 10, 2010 at 8:03 PM, deneche abdelhakim <a_deneche@yahoo.fr
> >wrote:
>
> >
> > % hadoop-ec2 launch-cluster --env REPO=testing --env HADOOP_VERSION=0.20
> \
> >  my-hadoop-cluster 10
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

-- 
Chenmin Liang
Language Technologies Institute, School of Computer Science
Carnegie Mellon University

Re: Re : Good starting instance for AMI

Posted by Ted Dunning <te...@gmail.com>.

This seems the easiest answer so far!

On Sun, Jan 10, 2010 at 8:03 PM, deneche abdelhakim <a_...@yahoo.fr>wrote:

>
> % hadoop-ec2 launch-cluster --env REPO=testing --env HADOOP_VERSION=0.20 \
>  my-hadoop-cluster 10




-- 
Ted Dunning, CTO
DeepDyve