You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/08/29 16:58:41 UTC

Clustering on Elastic Map Reduce

Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce?  If so, please share steps please.

Thanks,
Grant

Re: Clustering on Elastic Map Reduce

Posted by Sean Owen <sr...@gmail.com>.

I probably misunderstood the original problem. I had assumed the issue
was in getting the right .jar out to the workers. If it's just getting
stuff to the driver, yeah packaging properties files in the .jar file
should work.

On Mon, Sep 13, 2010 at 1:01 AM, Jake Mannix <ja...@gmail.com> wrote:
> Hmm?  Why would the workers need the driver.classes.props file?  It's what
> determines what MR job to run - once you're on a worker node, you're done
> with it, aren't you?  Or am I not following what the issue is...

Re: Clustering on Elastic Map Reduce

Posted by Jake Mannix <ja...@gmail.com>.

Hmm?  Why would the workers need the driver.classes.props file?  It's what
determines what MR job to run - once you're on a worker node, you're done
with it, aren't you?  Or am I not following what the issue is...

  -jake

On Sun, Sep 12, 2010 at 4:40 PM, Sean Owen <sr...@gmail.com> wrote:

> From the props file? My understanding is that it doesn't survive to the
> worker but does to the driver. Not quite so?
>
> On Sep 13, 2010 12:37 AM, "Jake Mannix" <ja...@gmail.com> wrote:
>
> On Sun, Sep 12, 2010 at 12:45 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > 2 isn't how it is 'supposed...
> But where would the Driver get the values to put into the Configuration
> object?
>
>  -jake
>

Re: Clustering on Elastic Map Reduce

Posted by Sean Owen <sr...@gmail.com>.

>From the props file? My understanding is that it doesn't survive to the
worker but does to the driver. Not quite so?

On Sep 13, 2010 12:37 AM, "Jake Mannix" <ja...@gmail.com> wrote:

On Sun, Sep 12, 2010 at 12:45 PM, Sean Owen <sr...@gmail.com> wrote:

> 2 isn't how it is 'supposed...
But where would the Driver get the values to put into the Configuration
object?

 -jake

Re: Clustering on Elastic Map Reduce

Posted by Jake Mannix <ja...@gmail.com>.

On Sun, Sep 12, 2010 at 12:45 PM, Sean Owen <sr...@gmail.com> wrote:

> 2 isn't how it is 'supposed' to work. The Configuration object is how you
> pass to the job any name - value pairs.
>
> The more right way is for the Driver to copy the properties entries into
> Configuration. Everything downstream can see that.
>
> I think we would do well to keep it simple here. There are already props
> files and two flavors of command line args in play for configuration.
>

But where would the Driver get the values to put into the Configuration
object?

  -jake

Re: Clustering on Elastic Map Reduce

Posted by Sean Owen <sr...@gmail.com>.

2 isn't how it is 'supposed' to work. The Configuration object is how you
pass to the job any name - value pairs.

The more right way is for the Driver to copy the properties entries into
Configuration. Everything downstream can see that.

I think we would do well to keep it simple here. There are already props
files and two flavors of command line args in play for configuration.

Sean

On Sep 12, 2010 7:41 PM, "Ted Dunning" <te...@gmail.com> wrote:
> The reflection option sounds dangerous because it isn't clear that the
> classes will be loaded yet which would mean that they wouldn't be seen.
>
> Option 2 is, as you say, relatively simple.
>
> On Sun, Sep 12, 2010 at 10:21 AM, Grant Ingersoll <gsingers@apache.org
>wrote:
>
>> My first thought is to create a JOB jar that contains the properties, but
>> the thought occurred to me that there might be a way to enhance the
>> classpath. Other thoughts:
>> 1. Instead of requiring driver.classes.props, we could just have an
>> Interface that each of those drivers implements that reports it's short
name
>> and description and then we just need to do some reflection at startup to
>> get all implementers of the interface.
>> 2. We create a "default.driver.classes.props" that is actually packaged
>> into the JOB jar. We first look for driver.classes.props then we look for
>> default.driver.classes.props, then we throw an exception.
>>
>> I guess my preference is #2, since that is the least code, still allows
the
>> existing functionality to work and provides reasonable defaults w/o any
>> setup.
>>

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 12, 2010, at 7:35 PM, Jake Mannix wrote:

> On Sun, Sep 12, 2010 at 12:23 PM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> 
>>> Option 2 is, as you say, relatively simple.
>> 
>> I have this working and will post/commit a patch.

https://issues.apache.org/jira/browse/MAHOUT-500 has the patch.  It's a pretty trivial change and I just use the existing driver.classes.props file (renaming it) so that we don't have to maintain two copies.

-Grant

Re: Clustering on Elastic Map Reduce

Posted by Jake Mannix <ja...@gmail.com>.

On Sun, Sep 12, 2010 at 12:23 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> > Option 2 is, as you say, relatively simple.
>
> I have this working and will post/commit a patch.


+1  on this - it was what I'd had in mind originally with the
driver.class.props file.  In fact, it's the one and only .props file in the
conf directory which is "required", and is only accessible to users because
they can easily add their own driver classes which would be used by the
MahoutDriver by editing this file.  Having a default set of values either in
the .job file, or hardcoded into the MahoutDriver would make sure that file
isn't needed for the general use case.

  -jake

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 12, 2010, at 2:40 PM, Ted Dunning wrote:

> The reflection option sounds dangerous because it isn't clear that the
> classes will be loaded yet which would mean that they wouldn't be seen.

Agreed.

> 
> Option 2 is, as you say, relatively simple.

I have this working and will post/commit a patch.

> 
> On Sun, Sep 12, 2010 at 10:21 AM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> My first thought is to create a JOB jar that contains the properties, but
>> the thought occurred to me that there might be a way to enhance the
>> classpath.  Other thoughts:
>> 1. Instead of requiring driver.classes.props, we could just have an
>> Interface that each of those drivers implements that reports it's short name
>> and description and then we just need to do some reflection at startup to
>> get all implementers of the interface.
>> 2. We create a "default.driver.classes.props" that is actually packaged
>> into the JOB jar.  We first look for driver.classes.props then we look for
>> default.driver.classes.props, then we throw an exception.
>> 
>> I guess my preference is #2, since that is the least code, still allows the
>> existing functionality to work and provides reasonable defaults w/o any
>> setup.
>> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Clustering on Elastic Map Reduce

Posted by Ted Dunning <te...@gmail.com>.

The reflection option sounds dangerous because it isn't clear that the
classes will be loaded yet which would mean that they wouldn't be seen.

Option 2 is, as you say, relatively simple.

On Sun, Sep 12, 2010 at 10:21 AM, Grant Ingersoll <gs...@apache.org>wrote:

> My first thought is to create a JOB jar that contains the properties, but
> the thought occurred to me that there might be a way to enhance the
> classpath.  Other thoughts:
> 1. Instead of requiring driver.classes.props, we could just have an
> Interface that each of those drivers implements that reports it's short name
> and description and then we just need to do some reflection at startup to
> get all implementers of the interface.
> 2. We create a "default.driver.classes.props" that is actually packaged
> into the JOB jar.  We first look for driver.classes.props then we look for
> default.driver.classes.props, then we throw an exception.
>
> I guess my preference is #2, since that is the least code, still allows the
> existing functionality to work and provides reasonable defaults w/o any
> setup.
>

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

moving to dev@

So, I can run KMeansDriver directly on EMR, but one of the things I want to do is actually run MahoutDriver on EMR.  The only sticking point to this are the lines:
<snip classname="MahoutDriver">
InputStream propsStream = Thread.currentThread()
                                    .getContextClassLoader()
                                    .getResourceAsStream("driver.classes.props");

    mainClasses.load(propsStream);
</snip>

due to the fact that the properties files are not in the class path that EMR gets.

Anyone have suggestions on working around this?  

My first thought is to create a JOB jar that contains the properties, but the thought occurred to me that there might be a way to enhance the classpath.  Other thoughts:
1. Instead of requiring driver.classes.props, we could just have an Interface that each of those drivers implements that reports it's short name and description and then we just need to do some reflection at startup to get all implementers of the interface.
2. We create a "default.driver.classes.props" that is actually packaged into the JOB jar.  We first look for driver.classes.props then we look for default.driver.classes.props, then we throw an exception.

I guess my preference is #2, since that is the least code, still allows the existing functionality to work and provides reasonable defaults w/o any setup.

Thoughts?

-Grant

On Sep 12, 2010, at 8:07 AM, Grant Ingersoll wrote:

> 
> On Sep 12, 2010, at 7:42 AM, Grant Ingersoll wrote:
> 
>> 
>> On Sep 11, 2010, at 10:11 PM, Drew Farris wrote:
>> 
>>> I will write up notes on the EMR wiki page.
> 
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce is updated to 0.4-SNAPSHOT.
> 
> -Grant
> 
>

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 12, 2010, at 7:42 AM, Grant Ingersoll wrote:

> 
> On Sep 11, 2010, at 10:11 PM, Drew Farris wrote:
> 
>> I will write up notes on the EMR wiki page.

https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce is updated to 0.4-SNAPSHOT.

-Grant

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 11, 2010, at 10:11 PM, Drew Farris wrote:

> Congratulations!
> 
> What's the best way to send messages back to the caller of an EMR job,
> using stderr instead of the log framework here?

It probably makes sense to have any command line errors, etc. go to stderr instead of logging the exception, but this may just be a relic of EMR and the way it configures logging.  I will write up notes on the EMR wiki page.

I just committed one minor fix that should help, namely printing out the OptionException when it occurs as part of the generic print options method.



> 
> On Sat, Sep 11, 2010 at 9:32 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> And indeed, running this via the Ruby CLI works as well.  Woo hoo!
>> 
>> -Grant
>> 
>> On Sep 11, 2010, at 9:01 PM, Grant Ingersoll wrote:
>> 
>>> 
>>> On Sep 11, 2010, at 8:02 PM, Grant Ingersoll wrote:
>>> 
>>>> I've made a little bit of progress here, but not much.  Here's what I ran:
>>>> 
>>>> elastic-mapreduce -j <JOB>  --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job  --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg
>>> 
>>> 
>>>> --k
>>> 
>>> Ugh.  It's -k, not --k.
>>> 
>>> So, this bit of code could likely be more useful:
>>> } catch (IllegalArgumentException e) {
>>>      log.error(e.getMessage());
>>>      CommandLineUtil.printHelpWithGenericOptions(group);
>>>      return null;
>>>    }
>>> 
>>> Since, at least on EMR, the logs tend to get buried and it writes it out to syslog, not stderr or stdout.
>>> 
>>> I have it running now by logging into the EMR instance using SSH and then I also specifically uploaded my Vector file to HDFS by hand.  In other words, I'm not using the remote Ruby CLI just yet.
>>> 
>>> Progress.  Sigh.
>>> 
>>> -Grant
>> 
>> --------------------------
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> 
>> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Clustering on Elastic Map Reduce

Posted by Drew Farris <dr...@apache.org>.

Congratulations!

What's the best way to send messages back to the caller of an EMR job,
using stderr instead of the log framework here?

On Sat, Sep 11, 2010 at 9:32 PM, Grant Ingersoll <gs...@apache.org> wrote:
> And indeed, running this via the Ruby CLI works as well.  Woo hoo!
>
> -Grant
>
> On Sep 11, 2010, at 9:01 PM, Grant Ingersoll wrote:
>
>>
>> On Sep 11, 2010, at 8:02 PM, Grant Ingersoll wrote:
>>
>>> I've made a little bit of progress here, but not much.  Here's what I ran:
>>>
>>> elastic-mapreduce -j <JOB>  --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job  --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg
>>
>>
>>> --k
>>
>> Ugh.  It's -k, not --k.
>>
>> So, this bit of code could likely be more useful:
>> } catch (IllegalArgumentException e) {
>>      log.error(e.getMessage());
>>      CommandLineUtil.printHelpWithGenericOptions(group);
>>      return null;
>>    }
>>
>> Since, at least on EMR, the logs tend to get buried and it writes it out to syslog, not stderr or stdout.
>>
>> I have it running now by logging into the EMR instance using SSH and then I also specifically uploaded my Vector file to HDFS by hand.  In other words, I'm not using the remote Ruby CLI just yet.
>>
>> Progress.  Sigh.
>>
>> -Grant
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

And indeed, running this via the Ruby CLI works as well.  Woo hoo!

-Grant

On Sep 11, 2010, at 9:01 PM, Grant Ingersoll wrote:

> 
> On Sep 11, 2010, at 8:02 PM, Grant Ingersoll wrote:
> 
>> I've made a little bit of progress here, but not much.  Here's what I ran:
>> 
>> elastic-mapreduce -j <JOB>  --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job  --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg
> 
> 
>> --k
> 
> Ugh.  It's -k, not --k.  
> 
> So, this bit of code could likely be more useful:
> } catch (IllegalArgumentException e) {
>      log.error(e.getMessage());
>      CommandLineUtil.printHelpWithGenericOptions(group);
>      return null;
>    }
> 
> Since, at least on EMR, the logs tend to get buried and it writes it out to syslog, not stderr or stdout.
> 
> I have it running now by logging into the EMR instance using SSH and then I also specifically uploaded my Vector file to HDFS by hand.  In other words, I'm not using the remote Ruby CLI just yet.
> 
> Progress.  Sigh.
> 
> -Grant

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 11, 2010, at 8:02 PM, Grant Ingersoll wrote:

> I've made a little bit of progress here, but not much.  Here's what I ran:
> 
> elastic-mapreduce -j <JOB>  --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job  --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg

> --k

Ugh.  It's -k, not --k.  

So, this bit of code could likely be more useful:
} catch (IllegalArgumentException e) {
      log.error(e.getMessage());
      CommandLineUtil.printHelpWithGenericOptions(group);
      return null;
    }

Since, at least on EMR, the logs tend to get buried and it writes it out to syslog, not stderr or stdout.

I have it running now by logging into the EMR instance using SSH and then I also specifically uploaded my Vector file to HDFS by hand.  In other words, I'm not using the remote Ruby CLI just yet.

Progress.  Sigh.

-Grant

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

I've made a little bit of progress here, but not much.  Here's what I ran:

elastic-mapreduce -j <JOB>  --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job  --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg --k --arg 10 --arg --output --arg s3n://news-vecs/out/ --arg --distanceMeasure --arg  org.apache.mahout.common.distance.CosineDistanceMeasure --arg --convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50 --arg --clustering -v --debug

In the controller log, I see:
2010-09-11T23:49:16.958Z INFO Fetching jar file.
2010-09-11T23:49:20.723Z INFO Working dir /mnt/var/lib/hadoop/steps/1
2010-09-11T23:49:20.723Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java -cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-0.20-core.jar:/home/hadoop/hadoop-0.20-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/1 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/1/tmp -Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 org.apache.hadoop.util.RunJar /mnt/var/lib/hadoop/steps/1/mahout-core-0.4-SNAPSHOT.job org.apache.mahout.clustering.kmeans.KMeansDriver --input s3n://news-vecs/part-out.vec --clusters s3n://news-vecs/kmeans/clusters/ --k 10 --output s3n://news-vecs/out/ --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure --convergenceDelta 0.001 --overwrite --maxIter 50 --clustering
2010-09-11T23:49:23.302Z INFO Execution ended with ret val 0
2010-09-11T23:49:25.415Z INFO Step created jobs: 
2010-09-11T23:49:25.416Z INFO Step succeeded

But, then in stdout log I see:
<snip>
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>             comma separated archives to be unarchived
                               on the compute machines.
 -conf <configuration file>    specify an application configuration file
 -D <property=value>           use value for given property
 -files <paths>                comma separated files to be copied to the
                               map reduce cluster
 -fs <local|namenode:port>     specify a namenode
 -jt <local|jobtracker:port>   specify a job tracker
 -libjars <paths>              comma separated jar files to include in the
                               classpath.
Job-Specific Options:                                                           
  --input (-i) input                           Path to job input directory.     
  --output (-o) output                         The directory pathname for       
                                               output.                          
  --distanceMeasure (-dm) distanceMeasure      The classname of the             
                                               DistanceMeasure. Default is      
                                               SquaredEuclidean                 
  --clusters (-c) clusters                     The input centroids, as Vectors. 
                                               Must be a SequenceFile of        
                                               Writable, Cluster/Canopy.  If k  
                                               is also specified, then a random 
                                               set of vectors will be selected  
                                               and written out to this path     
                                               first                            
  --numClusters (-k) k                         The k in k-Means.  If specified, 
                                               then a random selection of k     
                                               Vectors will be chosen as the    
                                               Centroid and written to the      
                                               clusters input path.             
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value.     
                                               Default is 0.5                   
  --maxIter (-x) maxIter                       The maximum number of            
                                               iterations.                      
  --overwrite (-ow)                            If present, overwrite the output 
                                               directory before running job     
  --maxRed (-r) maxRed                         The number of reduce tasks.      
                                               Defaults to 2                    
  --clustering (-cl)                           If present, run clustering after 
                                               the iterations have taken place  
  --method (-xm) method                        The execution method to use:     
                                               sequential or mapreduce. Default 
                                               is mapreduce                     
  --help (-h)                                  Print out help                   
  --tempDir tempDir                            Intermediate output directory    
  --startPhase startPhase                      First phase to run               
  --endPhase endPhase                          Last phase to run                
</snip>

Which, of course, shows that it isn't getting the arguments.  Perhaps it's the s3n:// paths?  I'm going to try running from ssh.

-Grant



On Sep 2, 2010, at 1:04 PM, Drew Farris wrote:

> Were there specific issues you ran into? I suspect the documentation
> on the wiki is out of date.
> 
> Drew
> 
> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce?  If so, please share steps please.
>> 
>> Thanks,
>> Grant

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 2, 2010, at 2:23 PM, Sebastian Schelter wrote:

> I've been using the Java SDK ( http://aws.amazon.com/sdkforjava/) to run
> recommender stuff on EMR, it's working really well and the coding was
> pretty straight forward. I could provide some sample code in case
> somebody wants to see that.

I was wondering if we couldn't just factor this into the bin/mahout script?  I think that would be pretty cool for our end users, giving them pretty much EMR via a very simple CLI.  Users would be able to go from local to EMR to own Hadoop cluster w/o changing much more than a few CLI parameters.

-Grant

Re: Clustering on Elastic Map Reduce

Posted by Sebastian Schelter <ss...@apache.org>.

I've been using the Java SDK ( http://aws.amazon.com/sdkforjava/) to run
recommender stuff on EMR, it's working really well and the coding was
pretty straight forward. I could provide some sample code in case
somebody wants to see that.

--sebastian

Am 02.09.2010 20:05, schrieb Grant Ingersoll:
> In talking w/ Jake, he said he launched EMR and then SSH'd in and ran that way.  I'm going to try that next, as the Ruby CLI and the Console hasn't worked for me.  It's weird, it invokes the main(), but it doesn't seem to pass in the args.  I will update as I progress.  I'm trying to do some benchmarking of clustering on there.
>
> Might be a fun thing to debug at our Bay Area meetup...
>
> -Grant
>
> On Sep 2, 2010, at 1:37 PM, Drew Farris wrote:
>
>   
>> Jeff,
>>
>> I'm not sure we're talking about the same documentation on the wiki. I
>> was looking at the page:
>> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>>
>> Your referring to the page the following page, correct?
>> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2
>>
>> I've tried these since the change to 0.20.2 and they work for me too,
>> point taken about updating this to use the latest CDH. I haven't tried
>> running on Elastic MapReduce either.
>>
>> Drew
>>
>> On Thu, Sep 2, 2010 at 1:26 PM, Jeff Eastman <jd...@windwardsolutions.com> wrote:
>>     
>>>  The documentation on the wiki is about building an AMI for EC2 and is out
>>> of date since Cloudera has released a 0.20.2 AMI. Now that EMR supports our
>>> 0.20.2 as well the step of building an AMI should not be needed. But I've
>>> not tried clustering on EMR, only EC2, and the wiki instructions were
>>> accurate then.
>>>
>>> On 9/2/10 10:04 AM, Drew Farris wrote:
>>>       
>>>> Were there specific issues you ran into? I suspect the documentation
>>>> on the wiki is out of date.
>>>>
>>>> Drew
>>>>
>>>> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org>
>>>>  wrote:
>>>>         
>>>>> Has anyone successfully run any of the clustering algorithms on Amazon's
>>>>> Elastic Map Reduce?  If so, please share steps please.
>>>>>
>>>>> Thanks,
>>>>> Grant
>>>>>           
>>>
>>>       
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

In talking w/ Jake, he said he launched EMR and then SSH'd in and ran that way.  I'm going to try that next, as the Ruby CLI and the Console hasn't worked for me.  It's weird, it invokes the main(), but it doesn't seem to pass in the args.  I will update as I progress.  I'm trying to do some benchmarking of clustering on there.

Might be a fun thing to debug at our Bay Area meetup...

-Grant

On Sep 2, 2010, at 1:37 PM, Drew Farris wrote:

> Jeff,
> 
> I'm not sure we're talking about the same documentation on the wiki. I
> was looking at the page:
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
> 
> Your referring to the page the following page, correct?
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2
> 
> I've tried these since the change to 0.20.2 and they work for me too,
> point taken about updating this to use the latest CDH. I haven't tried
> running on Elastic MapReduce either.
> 
> Drew
> 
> On Thu, Sep 2, 2010 at 1:26 PM, Jeff Eastman <jd...@windwardsolutions.com> wrote:
>>  The documentation on the wiki is about building an AMI for EC2 and is out
>> of date since Cloudera has released a 0.20.2 AMI. Now that EMR supports our
>> 0.20.2 as well the step of building an AMI should not be needed. But I've
>> not tried clustering on EMR, only EC2, and the wiki instructions were
>> accurate then.
>> 
>> On 9/2/10 10:04 AM, Drew Farris wrote:
>>> 
>>> Were there specific issues you ran into? I suspect the documentation
>>> on the wiki is out of date.
>>> 
>>> Drew
>>> 
>>> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org>
>>>  wrote:
>>>> 
>>>> Has anyone successfully run any of the clustering algorithms on Amazon's
>>>> Elastic Map Reduce?  If so, please share steps please.
>>>> 
>>>> Thanks,
>>>> Grant
>> 
>> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Clustering on Elastic Map Reduce

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Yes Drew, you're right, I hadn't seen that EMR page. It looks more 
difficult to configure than building my own AMI was :).

On 9/2/10 10:37 AM, Drew Farris wrote:
> Jeff,
>
> I'm not sure we're talking about the same documentation on the wiki. I
> was looking at the page:
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>
> Your referring to the page the following page, correct?
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2
>
> I've tried these since the change to 0.20.2 and they work for me too,
> point taken about updating this to use the latest CDH. I haven't tried
> running on Elastic MapReduce either.
>
> Drew
>
> On Thu, Sep 2, 2010 at 1:26 PM, Jeff Eastman<jd...@windwardsolutions.com>  wrote:
>>   The documentation on the wiki is about building an AMI for EC2 and is out
>> of date since Cloudera has released a 0.20.2 AMI. Now that EMR supports our
>> 0.20.2 as well the step of building an AMI should not be needed. But I've
>> not tried clustering on EMR, only EC2, and the wiki instructions were
>> accurate then.
>>
>> On 9/2/10 10:04 AM, Drew Farris wrote:
>>> Were there specific issues you ran into? I suspect the documentation
>>> on the wiki is out of date.
>>>
>>> Drew
>>>
>>> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org>
>>>   wrote:
>>>> Has anyone successfully run any of the clustering algorithms on Amazon's
>>>> Elastic Map Reduce?  If so, please share steps please.
>>>>
>>>> Thanks,
>>>> Grant
>>

Re: Clustering on Elastic Map Reduce

Posted by Drew Farris <dr...@apache.org>.

Jeff,

I'm not sure we're talking about the same documentation on the wiki. I
was looking at the page:
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce

Your referring to the page the following page, correct?
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2

I've tried these since the change to 0.20.2 and they work for me too,
point taken about updating this to use the latest CDH. I haven't tried
running on Elastic MapReduce either.

Drew

On Thu, Sep 2, 2010 at 1:26 PM, Jeff Eastman <jd...@windwardsolutions.com> wrote:
>  The documentation on the wiki is about building an AMI for EC2 and is out
> of date since Cloudera has released a 0.20.2 AMI. Now that EMR supports our
> 0.20.2 as well the step of building an AMI should not be needed. But I've
> not tried clustering on EMR, only EC2, and the wiki instructions were
> accurate then.
>
> On 9/2/10 10:04 AM, Drew Farris wrote:
>>
>> Were there specific issues you ran into? I suspect the documentation
>> on the wiki is out of date.
>>
>> Drew
>>
>> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org>
>>  wrote:
>>>
>>> Has anyone successfully run any of the clustering algorithms on Amazon's
>>> Elastic Map Reduce?  If so, please share steps please.
>>>
>>> Thanks,
>>> Grant
>
>

Re: Clustering on Elastic Map Reduce

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  The documentation on the wiki is about building an AMI for EC2 and is 
out of date since Cloudera has released a 0.20.2 AMI. Now that EMR 
supports our 0.20.2 as well the step of building an AMI should not be 
needed. But I've not tried clustering on EMR, only EC2, and the wiki 
instructions were accurate then.

On 9/2/10 10:04 AM, Drew Farris wrote:
> Were there specific issues you ran into? I suspect the documentation
> on the wiki is out of date.
>
> Drew
>
> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org>  wrote:
>> Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce?  If so, please share steps please.
>>
>> Thanks,
>> Grant

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 2, 2010, at 1:04 PM, Drew Farris wrote:

> Were there specific issues you ran into? I suspect the documentation
> on the wiki is out of date.

It definitely is.  I had posted earlier with my steps, I will try to update w/ my latest soon.


> 
> Drew
> 
> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce?  If so, please share steps please.
>> 
>> Thanks,
>> Grant

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Clustering on Elastic Map Reduce

Posted by Drew Farris <dr...@apache.org>.

Were there specific issues you ran into? I suspect the documentation
on the wiki is out of date.

Drew

On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce?  If so, please share steps please.
>
> Thanks,
> Grant