You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/08/29 16:58:41 UTC
Clustering on Elastic Map Reduce
Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce? If so, please share steps please.
Thanks,
Grant
Re: Clustering on Elastic Map Reduce
Posted by Sean Owen <sr...@gmail.com>.
I probably misunderstood the original problem. I had assumed the issue
was in getting the right .jar out to the workers. If it's just getting
stuff to the driver, yeah packaging properties files in the .jar file
should work.
On Mon, Sep 13, 2010 at 1:01 AM, Jake Mannix <ja...@gmail.com> wrote:
> Hmm? Why would the workers need the driver.classes.props file? It's what
> determines what MR job to run - once you're on a worker node, you're done
> with it, aren't you? Or am I not following what the issue is...
Re: Clustering on Elastic Map Reduce
Posted by Jake Mannix <ja...@gmail.com>.
Hmm? Why would the workers need the driver.classes.props file? It's what
determines what MR job to run - once you're on a worker node, you're done
with it, aren't you? Or am I not following what the issue is...
-jake
On Sun, Sep 12, 2010 at 4:40 PM, Sean Owen <sr...@gmail.com> wrote:
> From the props file? My understanding is that it doesn't survive to the
> worker but does to the driver. Not quite so?
>
> On Sep 13, 2010 12:37 AM, "Jake Mannix" <ja...@gmail.com> wrote:
>
> On Sun, Sep 12, 2010 at 12:45 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > 2 isn't how it is 'supposed...
> But where would the Driver get the values to put into the Configuration
> object?
>
> -jake
>
Re: Clustering on Elastic Map Reduce
Posted by Sean Owen <sr...@gmail.com>.
>From the props file? My understanding is that it doesn't survive to the
worker but does to the driver. Not quite so?
On Sep 13, 2010 12:37 AM, "Jake Mannix" <ja...@gmail.com> wrote:
On Sun, Sep 12, 2010 at 12:45 PM, Sean Owen <sr...@gmail.com> wrote:
> 2 isn't how it is 'supposed...
But where would the Driver get the values to put into the Configuration
object?
-jake
Re: Clustering on Elastic Map Reduce
Posted by Jake Mannix <ja...@gmail.com>.
On Sun, Sep 12, 2010 at 12:45 PM, Sean Owen <sr...@gmail.com> wrote:
> 2 isn't how it is 'supposed' to work. The Configuration object is how you
> pass to the job any name - value pairs.
>
> The more right way is for the Driver to copy the properties entries into
> Configuration. Everything downstream can see that.
>
> I think we would do well to keep it simple here. There are already props
> files and two flavors of command line args in play for configuration.
>
But where would the Driver get the values to put into the Configuration
object?
-jake
Re: Clustering on Elastic Map Reduce
Posted by Sean Owen <sr...@gmail.com>.
2 isn't how it is 'supposed' to work. The Configuration object is how you
pass to the job any name - value pairs.
The more right way is for the Driver to copy the properties entries into
Configuration. Everything downstream can see that.
I think we would do well to keep it simple here. There are already props
files and two flavors of command line args in play for configuration.
Sean
On Sep 12, 2010 7:41 PM, "Ted Dunning" <te...@gmail.com> wrote:
> The reflection option sounds dangerous because it isn't clear that the
> classes will be loaded yet which would mean that they wouldn't be seen.
>
> Option 2 is, as you say, relatively simple.
>
> On Sun, Sep 12, 2010 at 10:21 AM, Grant Ingersoll <gsingers@apache.org
>wrote:
>
>> My first thought is to create a JOB jar that contains the properties, but
>> the thought occurred to me that there might be a way to enhance the
>> classpath. Other thoughts:
>> 1. Instead of requiring driver.classes.props, we could just have an
>> Interface that each of those drivers implements that reports it's short
name
>> and description and then we just need to do some reflection at startup to
>> get all implementers of the interface.
>> 2. We create a "default.driver.classes.props" that is actually packaged
>> into the JOB jar. We first look for driver.classes.props then we look for
>> default.driver.classes.props, then we throw an exception.
>>
>> I guess my preference is #2, since that is the least code, still allows
the
>> existing functionality to work and provides reasonable defaults w/o any
>> setup.
>>
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 12, 2010, at 7:35 PM, Jake Mannix wrote:
> On Sun, Sep 12, 2010 at 12:23 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>>
>>> Option 2 is, as you say, relatively simple.
>>
>> I have this working and will post/commit a patch.
https://issues.apache.org/jira/browse/MAHOUT-500 has the patch. It's a pretty trivial change and I just use the existing driver.classes.props file (renaming it) so that we don't have to maintain two copies.
-Grant
Re: Clustering on Elastic Map Reduce
Posted by Jake Mannix <ja...@gmail.com>.
On Sun, Sep 12, 2010 at 12:23 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
> > Option 2 is, as you say, relatively simple.
>
> I have this working and will post/commit a patch.
+1 on this - it was what I'd had in mind originally with the
driver.class.props file. In fact, it's the one and only .props file in the
conf directory which is "required", and is only accessible to users because
they can easily add their own driver classes which would be used by the
MahoutDriver by editing this file. Having a default set of values either in
the .job file, or hardcoded into the MahoutDriver would make sure that file
isn't needed for the general use case.
-jake
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 12, 2010, at 2:40 PM, Ted Dunning wrote:
> The reflection option sounds dangerous because it isn't clear that the
> classes will be loaded yet which would mean that they wouldn't be seen.
Agreed.
>
> Option 2 is, as you say, relatively simple.
I have this working and will post/commit a patch.
>
> On Sun, Sep 12, 2010 at 10:21 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> My first thought is to create a JOB jar that contains the properties, but
>> the thought occurred to me that there might be a way to enhance the
>> classpath. Other thoughts:
>> 1. Instead of requiring driver.classes.props, we could just have an
>> Interface that each of those drivers implements that reports it's short name
>> and description and then we just need to do some reflection at startup to
>> get all implementers of the interface.
>> 2. We create a "default.driver.classes.props" that is actually packaged
>> into the JOB jar. We first look for driver.classes.props then we look for
>> default.driver.classes.props, then we throw an exception.
>>
>> I guess my preference is #2, since that is the least code, still allows the
>> existing functionality to work and provides reasonable defaults w/o any
>> setup.
>>
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
Re: Clustering on Elastic Map Reduce
Posted by Ted Dunning <te...@gmail.com>.
The reflection option sounds dangerous because it isn't clear that the
classes will be loaded yet which would mean that they wouldn't be seen.
Option 2 is, as you say, relatively simple.
On Sun, Sep 12, 2010 at 10:21 AM, Grant Ingersoll <gs...@apache.org>wrote:
> My first thought is to create a JOB jar that contains the properties, but
> the thought occurred to me that there might be a way to enhance the
> classpath. Other thoughts:
> 1. Instead of requiring driver.classes.props, we could just have an
> Interface that each of those drivers implements that reports it's short name
> and description and then we just need to do some reflection at startup to
> get all implementers of the interface.
> 2. We create a "default.driver.classes.props" that is actually packaged
> into the JOB jar. We first look for driver.classes.props then we look for
> default.driver.classes.props, then we throw an exception.
>
> I guess my preference is #2, since that is the least code, still allows the
> existing functionality to work and provides reasonable defaults w/o any
> setup.
>
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
moving to dev@
So, I can run KMeansDriver directly on EMR, but one of the things I want to do is actually run MahoutDriver on EMR. The only sticking point to this are the lines:
<snip classname="MahoutDriver">
InputStream propsStream = Thread.currentThread()
.getContextClassLoader()
.getResourceAsStream("driver.classes.props");
mainClasses.load(propsStream);
</snip>
due to the fact that the properties files are not in the class path that EMR gets.
Anyone have suggestions on working around this?
My first thought is to create a JOB jar that contains the properties, but the thought occurred to me that there might be a way to enhance the classpath. Other thoughts:
1. Instead of requiring driver.classes.props, we could just have an Interface that each of those drivers implements that reports it's short name and description and then we just need to do some reflection at startup to get all implementers of the interface.
2. We create a "default.driver.classes.props" that is actually packaged into the JOB jar. We first look for driver.classes.props then we look for default.driver.classes.props, then we throw an exception.
I guess my preference is #2, since that is the least code, still allows the existing functionality to work and provides reasonable defaults w/o any setup.
Thoughts?
-Grant
On Sep 12, 2010, at 8:07 AM, Grant Ingersoll wrote:
>
> On Sep 12, 2010, at 7:42 AM, Grant Ingersoll wrote:
>
>>
>> On Sep 11, 2010, at 10:11 PM, Drew Farris wrote:
>>
>>> I will write up notes on the EMR wiki page.
>
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce is updated to 0.4-SNAPSHOT.
>
> -Grant
>
>
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 12, 2010, at 7:42 AM, Grant Ingersoll wrote:
>
> On Sep 11, 2010, at 10:11 PM, Drew Farris wrote:
>
>> I will write up notes on the EMR wiki page.
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce is updated to 0.4-SNAPSHOT.
-Grant
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 11, 2010, at 10:11 PM, Drew Farris wrote:
> Congratulations!
>
> What's the best way to send messages back to the caller of an EMR job,
> using stderr instead of the log framework here?
It probably makes sense to have any command line errors, etc. go to stderr instead of logging the exception, but this may just be a relic of EMR and the way it configures logging. I will write up notes on the EMR wiki page.
I just committed one minor fix that should help, namely printing out the OptionException when it occurs as part of the generic print options method.
>
> On Sat, Sep 11, 2010 at 9:32 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> And indeed, running this via the Ruby CLI works as well. Woo hoo!
>>
>> -Grant
>>
>> On Sep 11, 2010, at 9:01 PM, Grant Ingersoll wrote:
>>
>>>
>>> On Sep 11, 2010, at 8:02 PM, Grant Ingersoll wrote:
>>>
>>>> I've made a little bit of progress here, but not much. Here's what I ran:
>>>>
>>>> elastic-mapreduce -j <JOB> --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg
>>>
>>>
>>>> --k
>>>
>>> Ugh. It's -k, not --k.
>>>
>>> So, this bit of code could likely be more useful:
>>> } catch (IllegalArgumentException e) {
>>> log.error(e.getMessage());
>>> CommandLineUtil.printHelpWithGenericOptions(group);
>>> return null;
>>> }
>>>
>>> Since, at least on EMR, the logs tend to get buried and it writes it out to syslog, not stderr or stdout.
>>>
>>> I have it running now by logging into the EMR instance using SSH and then I also specifically uploaded my Vector file to HDFS by hand. In other words, I'm not using the remote Ruby CLI just yet.
>>>
>>> Progress. Sigh.
>>>
>>> -Grant
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>>
>>
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
Re: Clustering on Elastic Map Reduce
Posted by Drew Farris <dr...@apache.org>.
Congratulations!
What's the best way to send messages back to the caller of an EMR job,
using stderr instead of the log framework here?
On Sat, Sep 11, 2010 at 9:32 PM, Grant Ingersoll <gs...@apache.org> wrote:
> And indeed, running this via the Ruby CLI works as well. Woo hoo!
>
> -Grant
>
> On Sep 11, 2010, at 9:01 PM, Grant Ingersoll wrote:
>
>>
>> On Sep 11, 2010, at 8:02 PM, Grant Ingersoll wrote:
>>
>>> I've made a little bit of progress here, but not much. Here's what I ran:
>>>
>>> elastic-mapreduce -j <JOB> --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg
>>
>>
>>> --k
>>
>> Ugh. It's -k, not --k.
>>
>> So, this bit of code could likely be more useful:
>> } catch (IllegalArgumentException e) {
>> log.error(e.getMessage());
>> CommandLineUtil.printHelpWithGenericOptions(group);
>> return null;
>> }
>>
>> Since, at least on EMR, the logs tend to get buried and it writes it out to syslog, not stderr or stdout.
>>
>> I have it running now by logging into the EMR instance using SSH and then I also specifically uploaded my Vector file to HDFS by hand. In other words, I'm not using the remote Ruby CLI just yet.
>>
>> Progress. Sigh.
>>
>> -Grant
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
And indeed, running this via the Ruby CLI works as well. Woo hoo!
-Grant
On Sep 11, 2010, at 9:01 PM, Grant Ingersoll wrote:
>
> On Sep 11, 2010, at 8:02 PM, Grant Ingersoll wrote:
>
>> I've made a little bit of progress here, but not much. Here's what I ran:
>>
>> elastic-mapreduce -j <JOB> --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg
>
>
>> --k
>
> Ugh. It's -k, not --k.
>
> So, this bit of code could likely be more useful:
> } catch (IllegalArgumentException e) {
> log.error(e.getMessage());
> CommandLineUtil.printHelpWithGenericOptions(group);
> return null;
> }
>
> Since, at least on EMR, the logs tend to get buried and it writes it out to syslog, not stderr or stdout.
>
> I have it running now by logging into the EMR instance using SSH and then I also specifically uploaded my Vector file to HDFS by hand. In other words, I'm not using the remote Ruby CLI just yet.
>
> Progress. Sigh.
>
> -Grant
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 11, 2010, at 8:02 PM, Grant Ingersoll wrote:
> I've made a little bit of progress here, but not much. Here's what I ran:
>
> elastic-mapreduce -j <JOB> --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg
> --k
Ugh. It's -k, not --k.
So, this bit of code could likely be more useful:
} catch (IllegalArgumentException e) {
log.error(e.getMessage());
CommandLineUtil.printHelpWithGenericOptions(group);
return null;
}
Since, at least on EMR, the logs tend to get buried and it writes it out to syslog, not stderr or stdout.
I have it running now by logging into the EMR instance using SSH and then I also specifically uploaded my Vector file to HDFS by hand. In other words, I'm not using the remote Ruby CLI just yet.
Progress. Sigh.
-Grant
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
I've made a little bit of progress here, but not much. Here's what I ran:
elastic-mapreduce -j <JOB> --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://news-vecs/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters/ --arg --k --arg 10 --arg --output --arg s3n://news-vecs/out/ --arg --distanceMeasure --arg org.apache.mahout.common.distance.CosineDistanceMeasure --arg --convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50 --arg --clustering -v --debug
In the controller log, I see:
2010-09-11T23:49:16.958Z INFO Fetching jar file.
2010-09-11T23:49:20.723Z INFO Working dir /mnt/var/lib/hadoop/steps/1
2010-09-11T23:49:20.723Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java -cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-0.20-core.jar:/home/hadoop/hadoop-0.20-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/1 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/1/tmp -Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 org.apache.hadoop.util.RunJar /mnt/var/lib/hadoop/steps/1/mahout-core-0.4-SNAPSHOT.job org.apache.mahout.clustering.kmeans.KMeansDriver --input s3n://news-vecs/part-out.vec --clusters s3n://news-vecs/kmeans/clusters/ --k 10 --output s3n://news-vecs/out/ --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure --convergenceDelta 0.001 --overwrite --maxIter 50 --clustering
2010-09-11T23:49:23.302Z INFO Execution ended with ret val 0
2010-09-11T23:49:25.415Z INFO Step created jobs:
2010-09-11T23:49:25.416Z INFO Step succeeded
But, then in stdout log I see:
<snip>
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in the
classpath.
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for
output.
--distanceMeasure (-dm) distanceMeasure The classname of the
DistanceMeasure. Default is
SquaredEuclidean
--clusters (-c) clusters The input centroids, as Vectors.
Must be a SequenceFile of
Writable, Cluster/Canopy. If k
is also specified, then a random
set of vectors will be selected
and written out to this path
first
--numClusters (-k) k The k in k-Means. If specified,
then a random selection of k
Vectors will be chosen as the
Centroid and written to the
clusters input path.
--convergenceDelta (-cd) convergenceDelta The convergence delta value.
Default is 0.5
--maxIter (-x) maxIter The maximum number of
iterations.
--overwrite (-ow) If present, overwrite the output
directory before running job
--maxRed (-r) maxRed The number of reduce tasks.
Defaults to 2
--clustering (-cl) If present, run clustering after
the iterations have taken place
--method (-xm) method The execution method to use:
sequential or mapreduce. Default
is mapreduce
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
</snip>
Which, of course, shows that it isn't getting the arguments. Perhaps it's the s3n:// paths? I'm going to try running from ssh.
-Grant
On Sep 2, 2010, at 1:04 PM, Drew Farris wrote:
> Were there specific issues you ran into? I suspect the documentation
> on the wiki is out of date.
>
> Drew
>
> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce? If so, please share steps please.
>>
>> Thanks,
>> Grant
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 2, 2010, at 2:23 PM, Sebastian Schelter wrote:
> I've been using the Java SDK ( http://aws.amazon.com/sdkforjava/) to run
> recommender stuff on EMR, it's working really well and the coding was
> pretty straight forward. I could provide some sample code in case
> somebody wants to see that.
I was wondering if we couldn't just factor this into the bin/mahout script? I think that would be pretty cool for our end users, giving them pretty much EMR via a very simple CLI. Users would be able to go from local to EMR to own Hadoop cluster w/o changing much more than a few CLI parameters.
-Grant
Re: Clustering on Elastic Map Reduce
Posted by Sebastian Schelter <ss...@apache.org>.
I've been using the Java SDK ( http://aws.amazon.com/sdkforjava/) to run
recommender stuff on EMR, it's working really well and the coding was
pretty straight forward. I could provide some sample code in case
somebody wants to see that.
--sebastian
Am 02.09.2010 20:05, schrieb Grant Ingersoll:
> In talking w/ Jake, he said he launched EMR and then SSH'd in and ran that way. I'm going to try that next, as the Ruby CLI and the Console hasn't worked for me. It's weird, it invokes the main(), but it doesn't seem to pass in the args. I will update as I progress. I'm trying to do some benchmarking of clustering on there.
>
> Might be a fun thing to debug at our Bay Area meetup...
>
> -Grant
>
> On Sep 2, 2010, at 1:37 PM, Drew Farris wrote:
>
>
>> Jeff,
>>
>> I'm not sure we're talking about the same documentation on the wiki. I
>> was looking at the page:
>> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>>
>> Your referring to the page the following page, correct?
>> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2
>>
>> I've tried these since the change to 0.20.2 and they work for me too,
>> point taken about updating this to use the latest CDH. I haven't tried
>> running on Elastic MapReduce either.
>>
>> Drew
>>
>> On Thu, Sep 2, 2010 at 1:26 PM, Jeff Eastman <jd...@windwardsolutions.com> wrote:
>>
>>> The documentation on the wiki is about building an AMI for EC2 and is out
>>> of date since Cloudera has released a 0.20.2 AMI. Now that EMR supports our
>>> 0.20.2 as well the step of building an AMI should not be needed. But I've
>>> not tried clustering on EMR, only EC2, and the wiki instructions were
>>> accurate then.
>>>
>>> On 9/2/10 10:04 AM, Drew Farris wrote:
>>>
>>>> Were there specific issues you ran into? I suspect the documentation
>>>> on the wiki is out of date.
>>>>
>>>> Drew
>>>>
>>>> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org>
>>>> wrote:
>>>>
>>>>> Has anyone successfully run any of the clustering algorithms on Amazon's
>>>>> Elastic Map Reduce? If so, please share steps please.
>>>>>
>>>>> Thanks,
>>>>> Grant
>>>>>
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
In talking w/ Jake, he said he launched EMR and then SSH'd in and ran that way. I'm going to try that next, as the Ruby CLI and the Console hasn't worked for me. It's weird, it invokes the main(), but it doesn't seem to pass in the args. I will update as I progress. I'm trying to do some benchmarking of clustering on there.
Might be a fun thing to debug at our Bay Area meetup...
-Grant
On Sep 2, 2010, at 1:37 PM, Drew Farris wrote:
> Jeff,
>
> I'm not sure we're talking about the same documentation on the wiki. I
> was looking at the page:
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>
> Your referring to the page the following page, correct?
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2
>
> I've tried these since the change to 0.20.2 and they work for me too,
> point taken about updating this to use the latest CDH. I haven't tried
> running on Elastic MapReduce either.
>
> Drew
>
> On Thu, Sep 2, 2010 at 1:26 PM, Jeff Eastman <jd...@windwardsolutions.com> wrote:
>> The documentation on the wiki is about building an AMI for EC2 and is out
>> of date since Cloudera has released a 0.20.2 AMI. Now that EMR supports our
>> 0.20.2 as well the step of building an AMI should not be needed. But I've
>> not tried clustering on EMR, only EC2, and the wiki instructions were
>> accurate then.
>>
>> On 9/2/10 10:04 AM, Drew Farris wrote:
>>>
>>> Were there specific issues you ran into? I suspect the documentation
>>> on the wiki is out of date.
>>>
>>> Drew
>>>
>>> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org>
>>> wrote:
>>>>
>>>> Has anyone successfully run any of the clustering algorithms on Amazon's
>>>> Elastic Map Reduce? If so, please share steps please.
>>>>
>>>> Thanks,
>>>> Grant
>>
>>
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
Re: Clustering on Elastic Map Reduce
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Yes Drew, you're right, I hadn't seen that EMR page. It looks more
difficult to configure than building my own AMI was :).
On 9/2/10 10:37 AM, Drew Farris wrote:
> Jeff,
>
> I'm not sure we're talking about the same documentation on the wiki. I
> was looking at the page:
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>
> Your referring to the page the following page, correct?
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2
>
> I've tried these since the change to 0.20.2 and they work for me too,
> point taken about updating this to use the latest CDH. I haven't tried
> running on Elastic MapReduce either.
>
> Drew
>
> On Thu, Sep 2, 2010 at 1:26 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>> The documentation on the wiki is about building an AMI for EC2 and is out
>> of date since Cloudera has released a 0.20.2 AMI. Now that EMR supports our
>> 0.20.2 as well the step of building an AMI should not be needed. But I've
>> not tried clustering on EMR, only EC2, and the wiki instructions were
>> accurate then.
>>
>> On 9/2/10 10:04 AM, Drew Farris wrote:
>>> Were there specific issues you ran into? I suspect the documentation
>>> on the wiki is out of date.
>>>
>>> Drew
>>>
>>> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org>
>>> wrote:
>>>> Has anyone successfully run any of the clustering algorithms on Amazon's
>>>> Elastic Map Reduce? If so, please share steps please.
>>>>
>>>> Thanks,
>>>> Grant
>>
Re: Clustering on Elastic Map Reduce
Posted by Drew Farris <dr...@apache.org>.
Jeff,
I'm not sure we're talking about the same documentation on the wiki. I
was looking at the page:
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
Your referring to the page the following page, correct?
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2
I've tried these since the change to 0.20.2 and they work for me too,
point taken about updating this to use the latest CDH. I haven't tried
running on Elastic MapReduce either.
Drew
On Thu, Sep 2, 2010 at 1:26 PM, Jeff Eastman <jd...@windwardsolutions.com> wrote:
> The documentation on the wiki is about building an AMI for EC2 and is out
> of date since Cloudera has released a 0.20.2 AMI. Now that EMR supports our
> 0.20.2 as well the step of building an AMI should not be needed. But I've
> not tried clustering on EMR, only EC2, and the wiki instructions were
> accurate then.
>
> On 9/2/10 10:04 AM, Drew Farris wrote:
>>
>> Were there specific issues you ran into? I suspect the documentation
>> on the wiki is out of date.
>>
>> Drew
>>
>> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org>
>> wrote:
>>>
>>> Has anyone successfully run any of the clustering algorithms on Amazon's
>>> Elastic Map Reduce? If so, please share steps please.
>>>
>>> Thanks,
>>> Grant
>
>
Re: Clustering on Elastic Map Reduce
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
The documentation on the wiki is about building an AMI for EC2 and is
out of date since Cloudera has released a 0.20.2 AMI. Now that EMR
supports our 0.20.2 as well the step of building an AMI should not be
needed. But I've not tried clustering on EMR, only EC2, and the wiki
instructions were accurate then.
On 9/2/10 10:04 AM, Drew Farris wrote:
> Were there specific issues you ran into? I suspect the documentation
> on the wiki is out of date.
>
> Drew
>
> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll<gs...@apache.org> wrote:
>> Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce? If so, please share steps please.
>>
>> Thanks,
>> Grant
Re: Clustering on Elastic Map Reduce
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 2, 2010, at 1:04 PM, Drew Farris wrote:
> Were there specific issues you ran into? I suspect the documentation
> on the wiki is out of date.
It definitely is. I had posted earlier with my steps, I will try to update w/ my latest soon.
>
> Drew
>
> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce? If so, please share steps please.
>>
>> Thanks,
>> Grant
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
Re: Clustering on Elastic Map Reduce
Posted by Drew Farris <dr...@apache.org>.
Were there specific issues you ran into? I suspect the documentation
on the wiki is out of date.
Drew
On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Has anyone successfully run any of the clustering algorithms on Amazon's Elastic Map Reduce? If so, please share steps please.
>
> Thanks,
> Grant