You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Stephen Green <St...@sun.com> on 2009/04/14 16:15:49 UTC
Mahout on Elastic MapReduce
I told some folks here at work that I would give a talk on Mahout for
our reading group and decided that I would use it as an opportunity to
try Amazon's Elastic MapReduce (EMR).
I downloaded and untarred Hadoop 0.18.3, which is the version that
Amazon claims they have running so that I could try things out here.
I can start up Hadoop and sucessfully run a KMeans cluster on the
synthetic control data using the instructions on the wiki and the
following command line:
bin/hadoop jar ~/Projects/EC2/mahout-0.1/examples/target/mahout-
examples-0.1.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job input/
testdata output org.apache.mahout.utils.EuclideanDistanceMeasure 80 55
0.5 10
I realize there's a shorter invocation, but I'm trying to figure out
what Amazon needs to run this, so I'm pulled the default arguments
from the KMeans job.
Now, on Amazon, you can specify a jar file that gets run with "bin/
hadoop jar" and you also specify the arguments that will be used with
that jar file.
The trick is that the input and output data need to be in S3 buckets
and you need to specify the locations with S3 native URIs. I used the
command line interface to EMR to create a job like so:
elastic-mapreduce -v --create --name KMeans --num-instances 1 \
--jar s3n://mahout-code/mahout-examples-0.1.job \
--main-class
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \
--arg s3n://mahout-input/testdata \
--arg s3n://mahout-output \
--arg org.apache.mahout.utils.EuclideanDistanceMeasure \
--arg 80 --arg 55 --arg 0.5 --arg 10
But this fails with the message: Steps completed with errors. Turns
out you can have the EMR infrastructure dump the logs for the tasks
and looking at the stderr for step 1 I see:
java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
expected: hdfs://domU-12-31-39-00-ED-51.compute-1
.internal:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
320)
at
org
.apache
.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:
84)
at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
at
org
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:
77)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
If I replace the s3n URI for the output with just mahout-output the
code appears to run without incident (at least the log output looks
like the log output from my local run.) Unfortunately, the HDFS
instance into which it's put disappears in a puff of smoke when the
job finishes running.
Now, I am by no means a Hadoop expert, but it seems like if it can
load the data from an s3n input URI, then it probably has the right
classes in there to do that (in fact, it looks like the jets3t jar is
in the .job file three times!), so it seems like the KMeans job from
mahout should be happy to use an s3n output URI, but I'm clearly
misunderstanding something here.
One of the EMR samples is a Java DNA sequence matching thing
(CloudBurst), which seems to work fine with an s3n URI for the
output. The setup for it's output looks like the following:
Path oPath = new Path(outpath);
FileOutputFormat.setOutputPath(conf, oPath);
System.err.println(" Removing old results");
FileSystem.get(conf).delete(oPath);
where "conf" is of type org.apache.hadoop.mapred.JobConf. This is a
bit different than what happens in the KMeans job:
Path outPath = new Path(output);
client.setConf(conf);
FileSystem dfs = FileSystem.get(conf);
if (dfs.exists(outPath))
dfs.delete(outPath, true);
Trying to use the CloudBurst idiom in the KMeans job produced no joy.
Any help would be greatly appreciated.
Steve Green
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Tim Bass <ti...@gmail.com>.
FYI (cheerleading the great work of the teams):
KMeans Clustering Now Running on Elastic MapReduce
http://www.thecepblog.com/2009/04/19/kmeans-clustering-now-running-on-elastic-mapreduce/
Cheers.
On Sat, Apr 18, 2009 at 6:56 PM, Tim Bass <ti...@gmail.com> wrote:
> Also, given that there was no reply to my call for others who have
> mahout running in EC2, I think it is safe to blog
> that Stephen is the first. So, without objection, I will write a blog
> post about this soon.
>
> Congrats again!
>
> On Sat, Apr 18, 2009 at 6:49 PM, Stephen Green <St...@sun.com> wrote:
>> Happy to. I'll do it Monday.
>>
>> Stephen Green
>> -- stephen.green@sun.com
>> http://blogs.sun.com/searchguy
>>
>>
>> On Apr 18, 2009, at 1:33, Otis Gospodnetic <ot...@yahoo.com>
>> wrote:
>>
>>>
>>> Talking about records.... Stephen, would you mind putting your
>>> notes/commands/steps on the Mahout Wiki?
>>>
>>>
>>> Thank you,
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> ----- Original Message ----
>>>>
>>>> From: Tim Bass <ti...@gmail.com>
>>>> To: mahout-user@lucene.apache.org
>>>> Sent: Thursday, April 16, 2009 6:27:56 PM
>>>> Subject: Re: Mahout on Elastic MapReduce
>>>>
>>>> For the record.....
>>>>
>>>> Is Stephen's heroic efforts the first instance of Mahout running on EMR?
>>>>
>>>> On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green wrote:
>>>>>
>>>>> On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
>>>>>
>>>>>> Excellent news! Please do open a Jira and tell us what worked. I
>>>>>> suspect
>>>>>> all of the clustering jobs share the same behavior, and other jobs for
>>>>>> other
>>>>>> components may as well.
>>>>>
>>>>> Will do. I've just done a couple of searches for everywhere that
>>>>> FileSystem.get is called and I've fixed most of them (although come to
>>>>> think
>>>>> of it I did it in the 0.1 release tag. Crap! Things haven't changed
>>>>> that
>>>>> much, right?)
>>>>>
>>>>> I'm building and running the tests now, and I'll open a Jira with a
>>>>> patch
>>>>> once that's working.
>>>>>
>>>>>> Congratulations,
>>>>>
>>>>> As my father used to say "Brute force has an elegance all its own".
>>>>>
>>>>> Steve
>>>>> --
>>>>> Stephen Green // Stephen.Green@sun.com
>>>>> Principal Investigator \\ http://blogs.sun.com/searchguy
>>>>> Aura Project // Voice: +1 781-442-0926
>>>>> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>
Re: Mahout on Elastic MapReduce
Posted by Grant Ingersoll <gs...@apache.org>.
On May 7, 2009, at 11:42 AM, Stephen Green wrote:
>
> On May 7, 2009, at 2:12 PM, Tim Bass wrote:
>
>> Is it possible to get permissions to edit that page?
>
> I didn't know that you needed such a thing, and I can't figure out
> how to add permissions if there are any...
Tim, do you have a wiki account? You should just need to login. We
have write by all setup.
>
>
>> Or would you
>> prefer I ping you privately when I follow in your shoes?
>
> Please, edit away. That's the wiki spirit!
+1
>
>>
>
>> I plan to follow your example step-by-step soon :-)
>
> Let me know if you have any trouble.
>
> Steve
> --
> Stephen Green // Stephen.Green@sun.com
> Principal Investigator \\ http://blogs.sun.com/searchguy
> Aura Project // Voice: +1 781-442-0926
> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>
>
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On May 7, 2009, at 2:12 PM, Tim Bass wrote:
> Is it possible to get permissions to edit that page?
I didn't know that you needed such a thing, and I can't figure out how
to add permissions if there are any...
> Or would you
> prefer I ping you privately when I follow in your shoes?
Please, edit away. That's the wiki spirit!
>
> I plan to follow your example step-by-step soon :-)
Let me know if you have any trouble.
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Tim Bass <ti...@gmail.com>.
Is it possible to get permissions to edit that page? Or would you
prefer I ping you privately when I follow in your shoes?
I plan to follow your example step-by-step soon :-)
On Fri, May 8, 2009 at 1:02 AM, Stephen Green <St...@sun.com> wrote:
>
> On May 7, 2009, at 1:59 PM, Tim Bass wrote:
>
>> (... I found the link later in the document :-)
>>
>> "Escape Key"
>
> Dude, that totally worked! I'll fix it anyways once the wiki comes back.
>
> Steve
> --
> Stephen Green // Stephen.Green@sun.com
> Principal Investigator \\ http://blogs.sun.com/searchguy
> Aura Project // Voice: +1 781-442-0926
> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>
>
>
>
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On May 7, 2009, at 1:59 PM, Tim Bass wrote:
> (... I found the link later in the document :-)
>
> "Escape Key"
Dude, that totally worked! I'll fix it anyways once the wiki comes
back.
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Tim Bass <ti...@gmail.com>.
(... I found the link later in the document :-)
"Escape Key"
On Fri, May 8, 2009 at 12:52 AM, Tim Bass <ti...@gmail.com> wrote:
> Hi Stephen,
>
> On the wiki, you say:
>
> "Download the Ruby command line client for EMR ... "
>
> Do you mind wrapping the text above to a link to that Ruby client on the wiki?
>
> Thanks!
>
>>>> Happy to. I'll do it Monday.
>>>
>>> But I didn't say *which* Monday, though did I! Hah!
>>>
>>> Sorry for the delay. I added a page to the Wiki this afternoon:
>>>
>>>
>>> http://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>>>
>>> As I was writing, I re-tried the examples to make sure that the
>>> invocations on the page work, so I think everything's in order.
>>>
>>> If someone who has the permissions necessary to delete attachments could
>>> delete the extra graphics, I'd appreciate it.
>>>
>>>> On Apr 18, 2009, at 1:33, Otis Gospodnetic <ot...@yahoo.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Talking about records.... Stephen, would you mind putting your
>>>>> notes/commands/steps on the Mahout Wiki?
>>>>>
>>>>>
>>>>> Thank you,
>>>>> Otis
>>>>> --
>>>
>>>
>>> Steve
>>> --
>>> Stephen Green // Stephen.Green@sun.com
>>> Principal Investigator \\ http://blogs.sun.com/searchguy
>>> Aura Project // Voice: +1 781-442-0926
>>> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>>>
>>>
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>
Re: Mahout on Elastic MapReduce
Posted by Tim Bass <ti...@gmail.com>.
Also, given that there was no reply to my call for others who have
mahout running in EC2, I think it is safe to blog
that Stephen is the first. So, without objection, I will write a blog
post about this soon.
Congrats again!
On Sat, Apr 18, 2009 at 6:49 PM, Stephen Green <St...@sun.com> wrote:
> Happy to. I'll do it Monday.
>
> Stephen Green
> -- stephen.green@sun.com
> http://blogs.sun.com/searchguy
>
>
> On Apr 18, 2009, at 1:33, Otis Gospodnetic <ot...@yahoo.com>
> wrote:
>
>>
>> Talking about records.... Stephen, would you mind putting your
>> notes/commands/steps on the Mahout Wiki?
>>
>>
>> Thank you,
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>>>
>>> From: Tim Bass <ti...@gmail.com>
>>> To: mahout-user@lucene.apache.org
>>> Sent: Thursday, April 16, 2009 6:27:56 PM
>>> Subject: Re: Mahout on Elastic MapReduce
>>>
>>> For the record.....
>>>
>>> Is Stephen's heroic efforts the first instance of Mahout running on EMR?
>>>
>>> On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green wrote:
>>>>
>>>> On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
>>>>
>>>>> Excellent news! Please do open a Jira and tell us what worked. I
>>>>> suspect
>>>>> all of the clustering jobs share the same behavior, and other jobs for
>>>>> other
>>>>> components may as well.
>>>>
>>>> Will do. I've just done a couple of searches for everywhere that
>>>> FileSystem.get is called and I've fixed most of them (although come to
>>>> think
>>>> of it I did it in the 0.1 release tag. Crap! Things haven't changed
>>>> that
>>>> much, right?)
>>>>
>>>> I'm building and running the tests now, and I'll open a Jira with a
>>>> patch
>>>> once that's working.
>>>>
>>>>> Congratulations,
>>>>
>>>> As my father used to say "Brute force has an elegance all its own".
>>>>
>>>> Steve
>>>> --
>>>> Stephen Green // Stephen.Green@sun.com
>>>> Principal Investigator \\ http://blogs.sun.com/searchguy
>>>> Aura Project // Voice: +1 781-442-0926
>>>> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>>>>
>>>>
>>>>
>>>>
>>
>
Re: Mahout on Elastic MapReduce
Posted by Tim Bass <ti...@gmail.com>.
Hi Stephen,
On the wiki, you say:
"Download the Ruby command line client for EMR ... "
Do you mind wrapping the text above to a link to that Ruby client on the wiki?
Thanks!
>>> Happy to. I'll do it Monday.
>>
>> But I didn't say *which* Monday, though did I! Hah!
>>
>> Sorry for the delay. I added a page to the Wiki this afternoon:
>>
>>
>> http://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>>
>> As I was writing, I re-tried the examples to make sure that the
>> invocations on the page work, so I think everything's in order.
>>
>> If someone who has the permissions necessary to delete attachments could
>> delete the extra graphics, I'd appreciate it.
>>
>>> On Apr 18, 2009, at 1:33, Otis Gospodnetic <ot...@yahoo.com>
>>> wrote:
>>>
>>>>
>>>> Talking about records.... Stephen, would you mind putting your
>>>> notes/commands/steps on the Mahout Wiki?
>>>>
>>>>
>>>> Thank you,
>>>> Otis
>>>> --
>>
>>
>> Steve
>> --
>> Stephen Green // Stephen.Green@sun.com
>> Principal Investigator \\ http://blogs.sun.com/searchguy
>> Aura Project // Voice: +1 781-442-0926
>> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
Re: Mahout on Elastic MapReduce
Posted by Grant Ingersoll <gs...@apache.org>.
Very cool, Stephen! I'll try to delete the attachments sometime this
week.
On May 1, 2009, at 2:42 PM, Stephen Green wrote:
> On Apr 18, 2009, at 7:49 AM, Stephen Green wrote:
>
>> Happy to. I'll do it Monday.
>
> But I didn't say *which* Monday, though did I! Hah!
>
> Sorry for the delay. I added a page to the Wiki this afternoon:
>
> http://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>
> As I was writing, I re-tried the examples to make sure that the
> invocations on the page work, so I think everything's in order.
>
> If someone who has the permissions necessary to delete attachments
> could delete the extra graphics, I'd appreciate it.
>
>> On Apr 18, 2009, at 1:33, Otis Gospodnetic <otis_gospodnetic@yahoo.com
>> > wrote:
>>
>>>
>>> Talking about records.... Stephen, would you mind putting your
>>> notes/commands/steps on the Mahout Wiki?
>>>
>>>
>>> Thank you,
>>> Otis
>>> --
>
>
> Steve
> --
> Stephen Green // Stephen.Green@sun.com
> Principal Investigator \\ http://blogs.sun.com/searchguy
> Aura Project // Voice: +1 781-442-0926
> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>
>
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 18, 2009, at 7:49 AM, Stephen Green wrote:
> Happy to. I'll do it Monday.
But I didn't say *which* Monday, though did I! Hah!
Sorry for the delay. I added a page to the Wiki this afternoon:
http://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
As I was writing, I re-tried the examples to make sure that the
invocations on the page work, so I think everything's in order.
If someone who has the permissions necessary to delete attachments
could delete the extra graphics, I'd appreciate it.
> On Apr 18, 2009, at 1:33, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>
>>
>> Talking about records.... Stephen, would you mind putting your
>> notes/commands/steps on the Mahout Wiki?
>>
>>
>> Thank you,
>> Otis
>> --
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
Happy to. I'll do it Monday.
Stephen Green
--
stephen.green@sun.com
http://blogs.sun.com/searchguy
On Apr 18, 2009, at 1:33, Otis Gospodnetic
<ot...@yahoo.com> wrote:
>
> Talking about records.... Stephen, would you mind putting your notes/
> commands/steps on the Mahout Wiki?
>
>
> Thank you,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Tim Bass <ti...@gmail.com>
>> To: mahout-user@lucene.apache.org
>> Sent: Thursday, April 16, 2009 6:27:56 PM
>> Subject: Re: Mahout on Elastic MapReduce
>>
>> For the record.....
>>
>> Is Stephen's heroic efforts the first instance of Mahout running on
>> EMR?
>>
>> On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green wrote:
>>>
>>> On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
>>>
>>>> Excellent news! Please do open a Jira and tell us what worked. I
>>>> suspect
>>>> all of the clustering jobs share the same behavior, and other
>>>> jobs for other
>>>> components may as well.
>>>
>>> Will do. I've just done a couple of searches for everywhere that
>>> FileSystem.get is called and I've fixed most of them (although
>>> come to think
>>> of it I did it in the 0.1 release tag. Crap! Things haven't
>>> changed that
>>> much, right?)
>>>
>>> I'm building and running the tests now, and I'll open a Jira with
>>> a patch
>>> once that's working.
>>>
>>>> Congratulations,
>>>
>>> As my father used to say "Brute force has an elegance all its own".
>>>
>>> Steve
>>> --
>>> Stephen Green // Stephen.Green@sun.com
>>> Principal Investigator \\ http://blogs.sun.com/searchguy
>>> Aura Project // Voice: +1 781-442-0926
>>> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>>>
>>>
>>>
>>>
>
Re: Mahout on Elastic MapReduce
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Talking about records.... Stephen, would you mind putting your notes/commands/steps on the Mahout Wiki?
Thank you,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Tim Bass <ti...@gmail.com>
> To: mahout-user@lucene.apache.org
> Sent: Thursday, April 16, 2009 6:27:56 PM
> Subject: Re: Mahout on Elastic MapReduce
>
> For the record.....
>
> Is Stephen's heroic efforts the first instance of Mahout running on EMR?
>
> On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green wrote:
> >
> > On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
> >
> >> Excellent news! Please do open a Jira and tell us what worked. I suspect
> >> all of the clustering jobs share the same behavior, and other jobs for other
> >> components may as well.
> >
> > Will do. I've just done a couple of searches for everywhere that
> > FileSystem.get is called and I've fixed most of them (although come to think
> > of it I did it in the 0.1 release tag. Crap! Things haven't changed that
> > much, right?)
> >
> > I'm building and running the tests now, and I'll open a Jira with a patch
> > once that's working.
> >
> >> Congratulations,
> >
> > As my father used to say "Brute force has an elegance all its own".
> >
> > Steve
> > --
> > Stephen Green // Stephen.Green@sun.com
> > Principal Investigator \\ http://blogs.sun.com/searchguy
> > Aura Project // Voice: +1 781-442-0926
> > Sun Microsystems Labs \\ Fax: +1 781-442-1692
> >
> >
> >
> >
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 16, 2009, at 11:30 AM, Tim Bass wrote:
> This has been a very interesting. Thank you for sharing this Stephen.
No worries. Fortunately, it's one of those weeks where I'm enjoying
being down in the weeds :-) And there's been pretty steady progress
thanks to you guys and the folks at Amazon.
> Your posts on EMR and Mahout
> are motivating me to get off the golf course and badminton courts and
> re-energize my efforts with EC2 and EMR.
It does seem like it would be worthwhile to easily launch a short term
MR job that would (for example) recompute user-user or item-item
similarity using Taste.
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Tim Bass <ti...@gmail.com>.
This has been a very interesting. Thank you for sharing this Stephen.
Your posts on EMR and Mahout
are motivating me to get off the golf course and badminton courts and
re-energize my efforts with EC2 and EMR.
When we advance this a bit further, it would be great to use Amazon
Simple Message Service to move
data to S3 from live systems.
I apologize for lurking and lazy. I need to get off my "you know
what" and contribute more.
On Thu, Apr 16, 2009 at 10:13 PM, Stephen Green <St...@sun.com> wrote:
>
> On Apr 16, 2009, at 11:10 AM, Jeff Eastman wrote:
>
>> Hi Stephen,
>>
>> It looks to me like you are on the right track. The original kMeans code
>> and job patterns were written over a year ago, probably on a version of
>> Hadoop 10 or 11 IIRC. They have made significant changes to the file system
>> in the interim and nobody - except you - has tried to run kMeans on EMR.
>
> A man, alone... This would make a good movie!
>
>> It certainly seems like dfs.exists(outPath) should be false if you have no
>> outPath. You have a sharp machete and are making good progress breaking a
>> jungle trail to EMR.
>
> I'll see if I can get the s3n thing working locally and then return to spend
> a few more dimes at Amazon. Once I figure it out, I'll add an issue to Jira
> and see if I can work up a patch for mahout-dev.
>
>> If you'd like to chat on the phone or Skype, please contact me directly
>> (jeff at windwardsolutions dot com).
>
> Thanks!
>
> Steve
> --
> Stephen Green // Stephen.Green@sun.com
> Principal Investigator \\ http://blogs.sun.com/searchguy
> Aura Project // Voice: +1 781-442-0926
> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>
>
>
>
Re: Mahout on Elastic MapReduce
Posted by Tim Bass <ti...@gmail.com>.
For the record.....
Is Stephen's heroic efforts the first instance of Mahout running on EMR?
On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green <St...@sun.com> wrote:
>
> On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
>
>> Excellent news! Please do open a Jira and tell us what worked. I suspect
>> all of the clustering jobs share the same behavior, and other jobs for other
>> components may as well.
>
> Will do. I've just done a couple of searches for everywhere that
> FileSystem.get is called and I've fixed most of them (although come to think
> of it I did it in the 0.1 release tag. Crap! Things haven't changed that
> much, right?)
>
> I'm building and running the tests now, and I'll open a Jira with a patch
> once that's working.
>
>> Congratulations,
>
> As my father used to say "Brute force has an elegance all its own".
>
> Steve
> --
> Stephen Green // Stephen.Green@sun.com
> Principal Investigator \\ http://blogs.sun.com/searchguy
> Aura Project // Voice: +1 781-442-0926
> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>
>
>
>
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
> Excellent news! Please do open a Jira and tell us what worked. I
> suspect all of the clustering jobs share the same behavior, and
> other jobs for other components may as well.
Will do. I've just done a couple of searches for everywhere that
FileSystem.get is called and I've fixed most of them (although come to
think of it I did it in the 0.1 release tag. Crap! Things haven't
changed that much, right?)
I'm building and running the tests now, and I'll open a Jira with a
patch once that's working.
> Congratulations,
As my father used to say "Brute force has an elegance all its own".
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Excellent news! Please do open a Jira and tell us what worked. I suspect
all of the clustering jobs share the same behavior, and other jobs for
other components may as well.
Congratulations,
Jeff
Stephen Green wrote:
> Good news. I just had a successful run on EMR of the KMeans
> clustering of the synthetic control data.
>
> Shall I add an issue to Jira for the problems that I encountered and
> see if I can get together a patch?
>
> Steve
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
Good news. I just had a successful run on EMR of the KMeans
clustering of the synthetic control data.
Shall I add an issue to Jira for the problems that I encountered and
see if I can get together a patch?
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 16, 2009, at 10:27 AM, Stephen Green wrote:
> java.lang.NullPointerException
> at
> org
> .apache
> .hadoop
> .fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:310)
> at
> org
> .apache
> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
> at
> org
> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
> 45)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun
> .reflect
> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun
> .reflect
> .DelegatingMethodAccessorImpl
> .invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> (The line numbers in kmeans.Job are weird because I added logging.)
>
> If the Hadoop on EMR is really 0.18.3, then the null pointer here is
> the store in the NativeS3FileSystem. But there's another problem:
> I deleted the output path before I started the run, so the existence
> check should have failed and dfs.delete never should have been
> called. I added a bit of logging to the KMeans job and here's what
> it says about the output path:
>
OK, I figured this one out. I gave it the URI s3n://mahout-output/ as
the output directory. This is a problem because the URI parsing code
interprets mahout-output as a host name in the URI, which means that
roundabout NativeS3FileSystem:319, it gets the key "" from pathToKey,
which apparently indicates the root directory, which is always
supposed to exist, and so the input path is used to create a
directory, which generates the output directory and a non-null result
from getFileStatus, and therefore a true response from the exists call.
Using a subdirectory on the URI (s3n://mahout-output/kmeans) gets the
key kmeans, which moves things along a little farther.
This is a weird disconnect in the pathToKey code, I think.
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 16, 2009, at 11:10 AM, Jeff Eastman wrote:
> Hi Stephen,
>
> It looks to me like you are on the right track. The original kMeans
> code and job patterns were written over a year ago, probably on a
> version of Hadoop 10 or 11 IIRC. They have made significant changes
> to the file system in the interim and nobody - except you - has
> tried to run kMeans on EMR.
A man, alone... This would make a good movie!
> It certainly seems like dfs.exists(outPath) should be false if you
> have no outPath. You have a sharp machete and are making good
> progress breaking a jungle trail to EMR.
I'll see if I can get the s3n thing working locally and then return to
spend a few more dimes at Amazon. Once I figure it out, I'll add an
issue to Jira and see if I can work up a patch for mahout-dev.
> If you'd like to chat on the phone or Skype, please contact me
> directly (jeff at windwardsolutions dot com).
Thanks!
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Stephen,
It looks to me like you are on the right track. The original kMeans code
and job patterns were written over a year ago, probably on a version of
Hadoop 10 or 11 IIRC. They have made significant changes to the file
system in the interim and nobody - except you - has tried to run kMeans
on EMR.
The logic about using the incorrect file system method is sound, and
your fix seems like it should work. I don't expect the hadoop version
differences to impact you since kMeans has not been updated recently to
take advantage of hadoop improvements.
It certainly seems like dfs.exists(outPath) should be false if you have
no outPath. You have a sharp machete and are making good progress
breaking a jungle trail to EMR. If you'd like to chat on the phone or
Skype, please contact me directly (jeff at windwardsolutions dot com).
Jeff
Stephen Green wrote:
> A bit more progress. I asked about this problem on Amazon's EMR
> forums. Here's the thread:
>
> http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945
>
> The answer from Amazon was:
>
>> This appears to be an issue with Mahout. This exception is fairly
>> common and matches the pattern of "Wrong FS: s3n://*/, expected:
>> hdfs://*:9000". This occurs when you try and use an S3N path with
>> HDFS. Typically this occurs because the code asks for the wrong
>> FileSystem.
>>
>> This could happen because a developer used the wrong static method on
>> Hadoop's FileSystem class:
>>
>> http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html
>>
>>
>> If you call FileSystem.get(Configuration conf) you'll get an instance
>> of the cluster's default file system, which in our case is HDFS.
>> Instead, if you have a URI and want a reference to the FileSystem
>> that URI points to, you should call the method FileSystem.get(URI
>> uri, Configuration conf).
>>
>
> He offered a solution that involved using DistCp to copy data from S3
> to HDFS and then back again, but since I have the Mahout source, I
> decided to pursue things a bit further. I went into the source and
> modified the places where the filesystem is fetched to do the following:
>
> FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>
> (there were 3 places that I changed it, but I expect there are more
> lying around.) This is the idiom used by the CloudBurst example on EMR.
>
> Making this change fixes the exception that I was getting, but I'm now
> getting a different exception:
>
> java.lang.NullPointerException
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:310)
>
> at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
>
> at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:45)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> (The line numbers in kmeans.Job are weird because I added logging.)
>
> If the Hadoop on EMR is really 0.18.3, then the null pointer here is
> the store in the NativeS3FileSystem. But there's another problem: I
> deleted the output path before I started the run, so the existence
> check should have failed and dfs.delete never should have been
> called. I added a bit of logging to the KMeans job and here's what it
> says about the output path:
>
> 2009-04-16 14:04:35,757 INFO
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
> lass org.apache.hadoop.fs.s3native.NativeS3FileSystem
>
> So it got the right output file system type.
>
> 2009-04-16 14:04:35,758 INFO
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
> mahout-output/ exists: true
>
> Shouldn't dfs.exists(outPath) fail for a non-existent path? And
> didn't the store have to exist (i.e., be non-null) for it to figure
> this out? I guess this really is starting to verge into base hadoop
> territory.
>
> I'm rapidly getting to the point where I need to solve this one just
> to prove to myself that I can get it to run!
>
> Steve
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 16, 2009, at 10:27 AM, Stephen Green wrote:
> If the Hadoop on EMR is really 0.18.3, then the null pointer here is
> the store in the NativeS3FileSystem.
Perhaps it's not really 0.18.3. I decided to finally try a local run
with s3n URIs for the input and output and got a very similar error:
java.lang.NullPointerException
at
org
.apache
.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:
303)
at
org
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:
83)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
But at a different spot.
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
A bit more progress. I asked about this problem on Amazon's EMR
forums. Here's the thread:
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945
The answer from Amazon was:
> This appears to be an issue with Mahout. This exception is fairly
> common and matches the pattern of "Wrong FS: s3n://*/, expected:
> hdfs://*:9000". This occurs when you try and use an S3N path with
> HDFS. Typically this occurs because the code asks for the wrong
> FileSystem.
>
> This could happen because a developer used the wrong static method
> on Hadoop's FileSystem class:
>
> http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html
>
> If you call FileSystem.get(Configuration conf) you'll get an
> instance of the cluster's default file system, which in our case is
> HDFS. Instead, if you have a URI and want a reference to the
> FileSystem that URI points to, you should call the method
> FileSystem.get(URI uri, Configuration conf).
>
He offered a solution that involved using DistCp to copy data from S3
to HDFS and then back again, but since I have the Mahout source, I
decided to pursue things a bit further. I went into the source and
modified the places where the filesystem is fetched to do the following:
FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
(there were 3 places that I changed it, but I expect there are more
lying around.) This is the idiom used by the CloudBurst example on EMR.
Making this change fixes the exception that I was getting, but I'm now
getting a different exception:
java.lang.NullPointerException
at
org
.apache
.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:
310)
at
org
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:
83)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
(The line numbers in kmeans.Job are weird because I added logging.)
If the Hadoop on EMR is really 0.18.3, then the null pointer here is
the store in the NativeS3FileSystem. But there's another problem: I
deleted the output path before I started the run, so the existence
check should have failed and dfs.delete never should have been
called. I added a bit of logging to the KMeans job and here's what it
says about the output path:
2009-04-16 14:04:35,757 INFO
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
lass org.apache.hadoop.fs.s3native.NativeS3FileSystem
So it got the right output file system type.
2009-04-16 14:04:35,758 INFO
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
mahout-output/ exists: true
Shouldn't dfs.exists(outPath) fail for a non-existent path? And
didn't the store have to exist (i.e., be non-null) for it to figure
this out? I guess this really is starting to verge into base hadoop
territory.
I'm rapidly getting to the point where I need to solve this one just
to prove to myself that I can get it to run!
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 15, 2009, at 3:28 PM, Stephen Green wrote:
>
> On Apr 15, 2009, at 3:16 PM, Grant Ingersoll wrote:
>
>>
>> On Apr 15, 2009, at 2:54 PM, Stephen Green wrote:
>>
>>>
>>> On Apr 15, 2009, at 2:19 PM, Jeff Eastman wrote:
>>>
>>>> None of the clustering implementations hard code the filesystem.
>>>> The file names are constructed from the input and output filepath
>>>> arguments.
>>>
>>> This is what I saw as well.
>>>
>>>> Grant Ingersoll wrote:
>>>>> I seem to recall this being something you have to set in your
>>>>> Hadoop configuration. Or, let me double check that we aren't
>>>>> hard-coding the FS in our Job.
>>>
>>> I'm pretty sure that the configuration on EMR allows this, as the
>>> job has no problem reading from the s3n input data.
>>>
>>
>> Can you share your Hadoop configuration? i.e. the contents of the
>> conf directory? You might actually have some luck asking this
>> question on core-users over at Hadoop.
>
> Well, Amazon controls the configuration, but I could probably log
> into the master while it's running and pull it out.
Not sure it'll help, but I've attached the config file for the hadoop
that runs in EMR. I've redacted my AWS credentials, but they were
correct. I pulled the rest of the conf directory if you want to have
a look at that.
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 15, 2009, at 3:16 PM, Grant Ingersoll wrote:
>
> On Apr 15, 2009, at 2:54 PM, Stephen Green wrote:
>
>>
>> On Apr 15, 2009, at 2:19 PM, Jeff Eastman wrote:
>>
>>> None of the clustering implementations hard code the filesystem.
>>> The file names are constructed from the input and output filepath
>>> arguments.
>>
>> This is what I saw as well.
>>
>>> Grant Ingersoll wrote:
>>>> I seem to recall this being something you have to set in your
>>>> Hadoop configuration. Or, let me double check that we aren't
>>>> hard-coding the FS in our Job.
>>
>> I'm pretty sure that the configuration on EMR allows this, as the
>> job has no problem reading from the s3n input data.
>>
>
> Can you share your Hadoop configuration? i.e. the contents of the
> conf directory? You might actually have some luck asking this
> question on core-users over at Hadoop.
Well, Amazon controls the configuration, but I could probably log into
the master while it's running and pull it out.
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Grant Ingersoll <gs...@apache.org>.
On Apr 15, 2009, at 2:54 PM, Stephen Green wrote:
>
> On Apr 15, 2009, at 2:19 PM, Jeff Eastman wrote:
>
>> None of the clustering implementations hard code the filesystem.
>> The file names are constructed from the input and output filepath
>> arguments.
>
> This is what I saw as well.
>
>> Grant Ingersoll wrote:
>>> I seem to recall this being something you have to set in your
>>> Hadoop configuration. Or, let me double check that we aren't hard-
>>> coding the FS in our Job.
>
> I'm pretty sure that the configuration on EMR allows this, as the
> job has no problem reading from the s3n input data.
>
Can you share your Hadoop configuration? i.e. the contents of the
conf directory? You might actually have some luck asking this
question on core-users over at Hadoop.
> I'm going to keep plugging!
>
> Steve
> --
> Stephen Green // Stephen.Green@sun.com
> Principal Investigator \\ http://blogs.sun.com/searchguy
> Aura Project // Voice: +1 781-442-0926
> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>
>
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 15, 2009, at 2:19 PM, Jeff Eastman wrote:
> None of the clustering implementations hard code the filesystem. The
> file names are constructed from the input and output filepath
> arguments.
This is what I saw as well.
> Grant Ingersoll wrote:
>> I seem to recall this being something you have to set in your
>> Hadoop configuration. Or, let me double check that we aren't hard-
>> coding the FS in our Job.
I'm pretty sure that the configuration on EMR allows this, as the job
has no problem reading from the s3n input data.
I'm going to keep plugging!
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
None of the clustering implementations hard code the filesystem. The
file names are constructed from the input and output filepath arguments.
Jeff
Grant Ingersoll wrote:
> I seem to recall this being something you have to set in your Hadoop
> configuration. Or, let me double check that we aren't hard-coding the
> FS in our Job.
>
> -Grant
>
> On Apr 15, 2009, at 1:27 PM, Stephen Green wrote:
>
>> On Apr 14, 2009, at 6:54 PM, Stephen Green wrote:
>>> On Apr 14, 2009, at 5:17 PM, Grant Ingersoll wrote:
>>>
>>>> I would be concerned about the fact that EMR is using 0.18 and
>>>> Mahout is on 0.19 (which of course raises another concern expressed
>>>> by Owen O'Malley to me at ApacheCon: No one uses 0.19)
>>>
>>> Well, I did run Mahout locally on a 0.18.3 install, but that was
>>> writing to and reading from HDFS. I can build a custom
>>> mahout-examples that has the 0.18.3 Hadoop jars (or perhaps no
>>> hadoop jar at all...) I'm guessing if EMR is on 0.18.3 and it gets
>>> popular, then you're going to have to deal with that problem.
>>
>>
>> More fun today. I checked out the mahout-0.1 release and rebuilt
>> mahout. I took the mahout-examples job, removed the hadoop jar, and
>> then tried to run the KMeans clustering against the synthetic control
>> data. This failed with the same exception that I was originally
>> getting yesterday:
>>
>> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output/,
>> expected: hdfs://domU-12-31-38-01-C5-22.compute-1.internal:9000
>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
>> at
>> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>>
>> at
>> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
>>
>> at
>> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
>>
>> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
>> at
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:77)
>>
>> at
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:43)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>
>> Steve
>> --
>> Stephen Green // Stephen.Green@sun.com
>> Principal Investigator \\ http://blogs.sun.com/searchguy
>> Aura Project // Voice: +1 781-442-0926
>> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>
Re: Mahout on Elastic MapReduce
Posted by Grant Ingersoll <gs...@apache.org>.
I seem to recall this being something you have to set in your Hadoop
configuration. Or, let me double check that we aren't hard-coding the
FS in our Job.
-Grant
On Apr 15, 2009, at 1:27 PM, Stephen Green wrote:
> On Apr 14, 2009, at 6:54 PM, Stephen Green wrote:
>> On Apr 14, 2009, at 5:17 PM, Grant Ingersoll wrote:
>>
>>> I would be concerned about the fact that EMR is using 0.18 and
>>> Mahout is on 0.19 (which of course raises another concern
>>> expressed by Owen O'Malley to me at ApacheCon: No one uses 0.19)
>>
>> Well, I did run Mahout locally on a 0.18.3 install, but that was
>> writing to and reading from HDFS. I can build a custom mahout-
>> examples that has the 0.18.3 Hadoop jars (or perhaps no hadoop jar
>> at all...) I'm guessing if EMR is on 0.18.3 and it gets popular,
>> then you're going to have to deal with that problem.
>
>
> More fun today. I checked out the mahout-0.1 release and rebuilt
> mahout. I took the mahout-examples job, removed the hadoop jar, and
> then tried to run the KMeans clustering against the synthetic
> control data. This failed with the same exception that I was
> originally getting yesterday:
>
> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output/,
> expected: hdfs://domU-12-31-38-01-C5-22.compute-1.internal:9000
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
> 320)
> at
> org
> .apache
> .hadoop
> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> at
> org
> .apache
> .hadoop
> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> at
> org
> .apache
> .hadoop
> .dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:
> 408)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
> at
> org
> .apache
> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:77)
> at
> org
> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
> 43)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun
> .reflect
> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun
> .reflect
> .DelegatingMethodAccessorImpl
> .invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> Steve
> --
> Stephen Green // Stephen.Green@sun.com
> Principal Investigator \\ http://blogs.sun.com/searchguy
> Aura Project // Voice: +1 781-442-0926
> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>
>
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 14, 2009, at 6:54 PM, Stephen Green wrote:
> On Apr 14, 2009, at 5:17 PM, Grant Ingersoll wrote:
>
>> I would be concerned about the fact that EMR is using 0.18 and
>> Mahout is on 0.19 (which of course raises another concern expressed
>> by Owen O'Malley to me at ApacheCon: No one uses 0.19)
>
> Well, I did run Mahout locally on a 0.18.3 install, but that was
> writing to and reading from HDFS. I can build a custom mahout-
> examples that has the 0.18.3 Hadoop jars (or perhaps no hadoop jar
> at all...) I'm guessing if EMR is on 0.18.3 and it gets popular,
> then you're going to have to deal with that problem.
More fun today. I checked out the mahout-0.1 release and rebuilt
mahout. I took the mahout-examples job, removed the hadoop jar, and
then tried to run the KMeans clustering against the synthetic control
data. This failed with the same exception that I was originally
getting yesterday:
java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output/,
expected: hdfs://domU-12-31-38-01-C5-22.compute-1.internal:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
320)
at
org
.apache
.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:
84)
at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
at
org
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:
77)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 14, 2009, at 5:17 PM, Grant Ingersoll wrote:
> I would be concerned about the fact that EMR is using 0.18 and
> Mahout is on 0.19 (which of course raises another concern expressed
> by Owen O'Malley to me at ApacheCon: No one uses 0.19)
Well, I did run Mahout locally on a 0.18.3 install, but that was
writing to and reading from HDFS. I can build a custom mahout-
examples that has the 0.18.3 Hadoop jars (or perhaps no hadoop jar at
all...) I'm guessing if EMR is on 0.18.3 and it gets popular, then
you're going to have to deal with that problem.
> I'd say you should try reproducing the problem on the same version
> that Mahout uses.
That'll be a bit tricky in the EMR case as that's Amazon's business
(ask me about trying to get a 64bit Solaris AMI on Amazon's version of
Xen...)
>
> FWIW, any committer on the Mahout project can likely get credits to
> use AWS.
I'm happy to share my limited experience.
Also:
>> ----- Original Message ----
>>> From: Sean Owen <sr...@gmail.com>
>>> To: mahout-user@lucene.apache.org
>>> Sent: Tuesday, April 14, 2009 4:19:51 PM
>>> Subject: Re: Mahout on Elastic MapReduce
>>>
>>> This is a fairly uninformed observation, but: the error seems to be
>>> from Hadoop. It seems to say that it understands hdfs:, but not
>>> s3n:,
>>> and that makes sense to me. Do we expect Hadoop understands how to
>>> read from S3? I would expect not. (Though, you point to examples
>>> that
>>> seem to overcome this just fine?)
As Otis pointed out, Hadoop can handle S3 a couple of ways, and the
example that I've been working seems to be able to read the input data
from an s3n URI no problem.
>>> When I have integrated code with stuff stored on S3, I have always
>>> had
>>> to write extra glue code to copy from S3 to a local file system, do
>>> work, then copy back.
I think you do need to copy from S3 to HDFS, but I think that happens
automagically (? My Hadoop ignorance is starting to show!)
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Grant Ingersoll <gs...@apache.org>.
I would be concerned about the fact that EMR is using 0.18 and Mahout
is on 0.19 (which of course raises another concern expressed by Owen
O'Malley to me at ApacheCon: No one uses 0.19)
I'd say you should try reproducing the problem on the same version
that Mahout uses.
FWIW, any committer on the Mahout project can likely get credits to
use AWS.
On Apr 14, 2009, at 5:08 PM, Otis Gospodnetic wrote:
>
> Hadoop should be able to read directly from S3, I believe: http://wiki.apache.org/hadoop/AmazonS3
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Sean Owen <sr...@gmail.com>
>> To: mahout-user@lucene.apache.org
>> Sent: Tuesday, April 14, 2009 4:19:51 PM
>> Subject: Re: Mahout on Elastic MapReduce
>>
>> This is a fairly uninformed observation, but: the error seems to be
>> from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
>> and that makes sense to me. Do we expect Hadoop understands how to
>> read from S3? I would expect not. (Though, you point to examples that
>> seem to overcome this just fine?)
>>
>> When I have integrated code with stuff stored on S3, I have always
>> had
>> to write extra glue code to copy from S3 to a local file system, do
>> work, then copy back.
>>
>> On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote:
>>>
>>> On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
>>>
>>>>
>>>> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
>>>>
>>>>> Hi Stephen,
>>>>>
>>>>> You are out on the bleeding edge with EMR.
>>>>
>>>> Yeah, but the view is lovely from here!
>>>>
>>>>> I've been able to run the kmeans example directly on a small EC2
>>>>> cluster
>>>>> that I started up myself (using the Hadoop src/contrib/ec2
>>>>> scripts). I have
>>>>> not yet tried EMR (just got an account yesterday), but I see
>>>>> that it
>>>>> requires you to have your data in S3 as opposed to HDFS.
>>>>>
>>>>> The job first runs the InputDriver to copy the raw test data
>>>>> into Mahout
>>>>> Vector external representation after deleting any pre-existing
>>>>> output files.
>>>>> It looks to me like the two delete() snippets you show are pretty
>>>>> equivalent. If you have no pre-existing output directory, the
>>>>> Mahout snippet
>>>>> won't attempt to delete it.
>>>>
>>>> I managed to figure that out :-) I'm pretty comfortable with the
>>>> ideas
>>>> behind MapReduce, but being confronted with my first Job is a bit
>>>> more
>>>> daunting than I expected.
>>>>
>>>>> I too am at a loss to explain what you are seeing. If you can
>>>>> post more
>>>>> results I can try to help you read the tea leaves...
>>>>
>>>> I noticed that the CloudBurst job just deleted the directory
>>>> without
>>>> checking for existence and so I tried the same thing with Mahout:
>>>>
>>>> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
>>>> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
>>>> .internal:9000
>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
>>>> 320)
>>>> at
>>>>
>> org
>> .apache
>> .hadoop
>> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>>>> at
>>>>
>> org
>> .apache
>> .hadoop
>> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:
>> 140)
>>>> at
>>>>
>> org
>> .apache
>> .hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:
>> 210)
>>>> at
>>>> org
>>>> .apache
>>>> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
>>>> at
>>>> org
>>>> .apache
>>>> .mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
>>>>
>>>> So no joy there.
>>>>
>>>> Should I see if I can isolate this as an s3n problem? I suppose
>>>> I could
>>>> try running the Hadoop job locally with it reading and writing
>>>> the data from
>>>> S3 and see if it suffers from the same problem. At least then I
>>>> could debug
>>>> inside Hadoop.
>>>>
>>>> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
>>>> problem it might have been fixed already. That doesn't help much
>>>> running on
>>>> EMR, I guess.
>>>>
>>>> I'm also going to start a run on EMR that does away with the whole
>>>> exists/delete check and see if that works.
>>>
>>> Following up to myself (my wife will tell you that I talk to
>>> myself!) I
>>> removed a number of the exists/delete checks: in
>>> CanopyClusteringJob,
>>> CanopyDriver, KMeansDriver, and ClusterDriver. This allowed the
>>> jobs to
>>> progress, but they died the death a little later with the following
>>> exception (and a few more, I can send the whole log if you like):
>>>
>>> java.lang.IllegalArgumentException: Wrong FS:
>>> s3n://mahoutput/canopies/part-00000, expected:
>>> hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
>>> 320)
>>> at
>>>
>> org
>> .apache
>> .hadoop
>> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>>> at
>>>
>> org
>> .apache
>> .hadoop
>> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:
>> 140)
>>> at
>>>
>> org
>> .apache
>> .hadoop
>> .dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:
>> 408)
>>> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:
>>> 695)
>>> at
>>> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420)
>>> at
>>> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415)
>>> at
>>>
>> org
>> .apache
>> .mahout
>> .clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
>>> at
>>> org
>>> .apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>>> at
>>> org
>>> .apache
>>> .hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>>> at
>>> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>>> at
>>> org
>>> .apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>>> at
>>> org
>>> .apache
>>> .hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
>>> at
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:
>>> 2198)
>>>
>>> Looking at the exception message there, I would almost swear that
>>> it things
>>> the whole s3n path is the name of a FS that it doesn't know about,
>>> but that
>>> might just be a bad message. This message repeats a few times
>>> (retrying
>>> failed mappers, I guess?) and then the job fails.
>>>
>>> One thing that occurred to me: the mahout examples job has the
>>> hadoop
>>> 0.19.1 core jar in it. Could I be seeing some kind of version
>>> skew between
>>> the hadoop in the job file and the one on EMR? Although it worked
>>> fine with
>>> a local 0.18.3, so maybe not.
>>>
>>> I'm going to see if I can get the stock Mahout to run with s3n
>>> inputs and
>>> outputs tomorrow and I'll let you all know how that goes.
>>>
>>> Steve
>>> --
>>> Stephen Green // Stephen.Green@sun.com
>>> Principal Investigator \\ http://blogs.sun.com/searchguy
>>> Aura Project // Voice: +1 781-442-0926
>>> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>>>
>>>
>>>
>>>
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
Re: Mahout on Elastic MapReduce
Posted by Sean Owen <sr...@gmail.com>.
OK that is good evidence that it is not the 's3n' that is the problem. Seems
like it was expecting to write to a local filesystem from some earlier bit
of configuration?
Not sure this is helping,
Sean
On Apr 14, 2009 10:08 PM, "Otis Gospodnetic" <ot...@yahoo.com>
wrote:
Hadoop should be able to read directly from S3, I believe:
http://wiki.apache.org/hadoop/AmazonS3
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ---- > From: Sean Owen <sr...@gmail.com> > To:
mahout-user@lucene.apache....
> On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote: > > > > On Apr 14,
2009, at 2:41 PM, Stephen...
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420)
> > at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415)
> > at > > >
org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:...
Re: Mahout on Elastic MapReduce
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hadoop should be able to read directly from S3, I believe: http://wiki.apache.org/hadoop/AmazonS3
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Sean Owen <sr...@gmail.com>
> To: mahout-user@lucene.apache.org
> Sent: Tuesday, April 14, 2009 4:19:51 PM
> Subject: Re: Mahout on Elastic MapReduce
>
> This is a fairly uninformed observation, but: the error seems to be
> from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
> and that makes sense to me. Do we expect Hadoop understands how to
> read from S3? I would expect not. (Though, you point to examples that
> seem to overcome this just fine?)
>
> When I have integrated code with stuff stored on S3, I have always had
> to write extra glue code to copy from S3 to a local file system, do
> work, then copy back.
>
> On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote:
> >
> > On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
> >
> >>
> >> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
> >>
> >>> Hi Stephen,
> >>>
> >>> You are out on the bleeding edge with EMR.
> >>
> >> Yeah, but the view is lovely from here!
> >>
> >>> I've been able to run the kmeans example directly on a small EC2 cluster
> >>> that I started up myself (using the Hadoop src/contrib/ec2 scripts). I have
> >>> not yet tried EMR (just got an account yesterday), but I see that it
> >>> requires you to have your data in S3 as opposed to HDFS.
> >>>
> >>> The job first runs the InputDriver to copy the raw test data into Mahout
> >>> Vector external representation after deleting any pre-existing output files.
> >>> It looks to me like the two delete() snippets you show are pretty
> >>> equivalent. If you have no pre-existing output directory, the Mahout snippet
> >>> won't attempt to delete it.
> >>
> >> I managed to figure that out :-) I'm pretty comfortable with the ideas
> >> behind MapReduce, but being confronted with my first Job is a bit more
> >> daunting than I expected.
> >>
> >>> I too am at a loss to explain what you are seeing. If you can post more
> >>> results I can try to help you read the tea leaves...
> >>
> >> I noticed that the CloudBurst job just deleted the directory without
> >> checking for existence and so I tried the same thing with Mahout:
> >>
> >> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
> >> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
> >> .internal:9000
> >> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> >> at
> >>
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> >> at
> >>
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> >> at
> >>
> org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)
> >> at
> >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
> >> at
> >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
> >>
> >> So no joy there.
> >>
> >> Should I see if I can isolate this as an s3n problem? I suppose I could
> >> try running the Hadoop job locally with it reading and writing the data from
> >> S3 and see if it suffers from the same problem. At least then I could debug
> >> inside Hadoop.
> >>
> >> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
> >> problem it might have been fixed already. That doesn't help much running on
> >> EMR, I guess.
> >>
> >> I'm also going to start a run on EMR that does away with the whole
> >> exists/delete check and see if that works.
> >
> > Following up to myself (my wife will tell you that I talk to myself!) I
> > removed a number of the exists/delete checks: in CanopyClusteringJob,
> > CanopyDriver, KMeansDriver, and ClusterDriver. This allowed the jobs to
> > progress, but they died the death a little later with the following
> > exception (and a few more, I can send the whole log if you like):
> >
> > java.lang.IllegalArgumentException: Wrong FS:
> > s3n://mahoutput/canopies/part-00000, expected:
> > hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
> > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> > at
> >
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> > at
> >
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> > at
> >
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
> > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695)
> > at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420)
> > at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415)
> > at
> >
> org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
> > at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> > at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> > at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
> > at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> > at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
> > at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> >
> > Looking at the exception message there, I would almost swear that it things
> > the whole s3n path is the name of a FS that it doesn't know about, but that
> > might just be a bad message. This message repeats a few times (retrying
> > failed mappers, I guess?) and then the job fails.
> >
> > One thing that occurred to me: the mahout examples job has the hadoop
> > 0.19.1 core jar in it. Could I be seeing some kind of version skew between
> > the hadoop in the job file and the one on EMR? Although it worked fine with
> > a local 0.18.3, so maybe not.
> >
> > I'm going to see if I can get the stock Mahout to run with s3n inputs and
> > outputs tomorrow and I'll let you all know how that goes.
> >
> > Steve
> > --
> > Stephen Green // Stephen.Green@sun.com
> > Principal Investigator \\ http://blogs.sun.com/searchguy
> > Aura Project // Voice: +1 781-442-0926
> > Sun Microsystems Labs \\ Fax: +1 781-442-1692
> >
> >
> >
> >
Re: Mahout on Elastic MapReduce
Posted by Sean Owen <sr...@gmail.com>.
This is a fairly uninformed observation, but: the error seems to be
from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
and that makes sense to me. Do we expect Hadoop understands how to
read from S3? I would expect not. (Though, you point to examples that
seem to overcome this just fine?)
When I have integrated code with stuff stored on S3, I have always had
to write extra glue code to copy from S3 to a local file system, do
work, then copy back.
On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green <St...@sun.com> wrote:
>
> On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
>
>>
>> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
>>
>>> Hi Stephen,
>>>
>>> You are out on the bleeding edge with EMR.
>>
>> Yeah, but the view is lovely from here!
>>
>>> I've been able to run the kmeans example directly on a small EC2 cluster
>>> that I started up myself (using the Hadoop src/contrib/ec2 scripts). I have
>>> not yet tried EMR (just got an account yesterday), but I see that it
>>> requires you to have your data in S3 as opposed to HDFS.
>>>
>>> The job first runs the InputDriver to copy the raw test data into Mahout
>>> Vector external representation after deleting any pre-existing output files.
>>> It looks to me like the two delete() snippets you show are pretty
>>> equivalent. If you have no pre-existing output directory, the Mahout snippet
>>> won't attempt to delete it.
>>
>> I managed to figure that out :-) I'm pretty comfortable with the ideas
>> behind MapReduce, but being confronted with my first Job is a bit more
>> daunting than I expected.
>>
>>> I too am at a loss to explain what you are seeing. If you can post more
>>> results I can try to help you read the tea leaves...
>>
>> I noticed that the CloudBurst job just deleted the directory without
>> checking for existence and so I tried the same thing with Mahout:
>>
>> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
>> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
>> .internal:9000
>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
>> at
>> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>> at
>> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
>> at
>> org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)
>> at
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
>> at
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
>>
>> So no joy there.
>>
>> Should I see if I can isolate this as an s3n problem? I suppose I could
>> try running the Hadoop job locally with it reading and writing the data from
>> S3 and see if it suffers from the same problem. At least then I could debug
>> inside Hadoop.
>>
>> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
>> problem it might have been fixed already. That doesn't help much running on
>> EMR, I guess.
>>
>> I'm also going to start a run on EMR that does away with the whole
>> exists/delete check and see if that works.
>
> Following up to myself (my wife will tell you that I talk to myself!) I
> removed a number of the exists/delete checks: in CanopyClusteringJob,
> CanopyDriver, KMeansDriver, and ClusterDriver. This allowed the jobs to
> progress, but they died the death a little later with the following
> exception (and a few more, I can send the whole log if you like):
>
> java.lang.IllegalArgumentException: Wrong FS:
> s3n://mahoutput/canopies/part-00000, expected:
> hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> at
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> at
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> at
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)
> at
> org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
> Looking at the exception message there, I would almost swear that it things
> the whole s3n path is the name of a FS that it doesn't know about, but that
> might just be a bad message. This message repeats a few times (retrying
> failed mappers, I guess?) and then the job fails.
>
> One thing that occurred to me: the mahout examples job has the hadoop
> 0.19.1 core jar in it. Could I be seeing some kind of version skew between
> the hadoop in the job file and the one on EMR? Although it worked fine with
> a local 0.18.3, so maybe not.
>
> I'm going to see if I can get the stock Mahout to run with s3n inputs and
> outputs tomorrow and I'll let you all know how that goes.
>
> Steve
> --
> Stephen Green // Stephen.Green@sun.com
> Principal Investigator \\ http://blogs.sun.com/searchguy
> Aura Project // Voice: +1 781-442-0926
> Sun Microsystems Labs \\ Fax: +1 781-442-1692
>
>
>
>
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
>
> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
>
>> Hi Stephen,
>>
>> You are out on the bleeding edge with EMR.
>
> Yeah, but the view is lovely from here!
>
>> I've been able to run the kmeans example directly on a small EC2
>> cluster that I started up myself (using the Hadoop src/contrib/ec2
>> scripts). I have not yet tried EMR (just got an account yesterday),
>> but I see that it requires you to have your data in S3 as opposed
>> to HDFS.
>>
>> The job first runs the InputDriver to copy the raw test data into
>> Mahout Vector external representation after deleting any pre-
>> existing output files. It looks to me like the two delete()
>> snippets you show are pretty equivalent. If you have no pre-
>> existing output directory, the Mahout snippet won't attempt to
>> delete it.
>
> I managed to figure that out :-) I'm pretty comfortable with the
> ideas behind MapReduce, but being confronted with my first Job is a
> bit more daunting than I expected.
>
>> I too am at a loss to explain what you are seeing. If you can post
>> more results I can try to help you read the tea leaves...
>
> I noticed that the CloudBurst job just deleted the directory without
> checking for existence and so I tried the same thing with Mahout:
>
> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
> .internal:9000
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
> 320)
> at
> org
> .apache
> .hadoop
> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> at
> org
> .apache
> .hadoop
> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> at
> org
> .apache
> .hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:
> 210)
> at
> org
> .apache
> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
> at
> org
> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
> 46)
>
> So no joy there.
>
> Should I see if I can isolate this as an s3n problem? I suppose I
> could try running the Hadoop job locally with it reading and writing
> the data from S3 and see if it suffers from the same problem. At
> least then I could debug inside Hadoop.
>
> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
> problem it might have been fixed already. That doesn't help much
> running on EMR, I guess.
>
> I'm also going to start a run on EMR that does away with the whole
> exists/delete check and see if that works.
Following up to myself (my wife will tell you that I talk to myself!)
I removed a number of the exists/delete checks: in
CanopyClusteringJob, CanopyDriver, KMeansDriver, and ClusterDriver.
This allowed the jobs to progress, but they died the death a little
later with the following exception (and a few more, I can send the
whole log if you like):
java.lang.IllegalArgumentException: Wrong FS: s3n://mahoutput/canopies/
part-00000, expected: hdfs://domU-12-31-39-00-A5-44.compute-1.internal:
9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
320)
at
org
.apache
.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:
84)
at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:
695)
at org.apache.hadoop.io.SequenceFile
$Reader.<init>(SequenceFile.java:1420)
at org.apache.hadoop.io.SequenceFile
$Reader.<init>(SequenceFile.java:1415)
at
org
.apache
.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
org
.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
at
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
org
.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
at org.apache.hadoop.mapred.TaskTracker
$Child.main(TaskTracker.java:2198)
Looking at the exception message there, I would almost swear that it
things the whole s3n path is the name of a FS that it doesn't know
about, but that might just be a bad message. This message repeats a
few times (retrying failed mappers, I guess?) and then the job fails.
One thing that occurred to me: the mahout examples job has the hadoop
0.19.1 core jar in it. Could I be seeing some kind of version skew
between the hadoop in the job file and the one on EMR? Although it
worked fine with a local 0.18.3, so maybe not.
I'm going to see if I can get the stock Mahout to run with s3n inputs
and outputs tomorrow and I'll let you all know how that goes.
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Stephen Green <St...@sun.com>.
On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
> Hi Stephen,
>
> You are out on the bleeding edge with EMR.
Yeah, but the view is lovely from here!
> I've been able to run the kmeans example directly on a small EC2
> cluster that I started up myself (using the Hadoop src/contrib/ec2
> scripts). I have not yet tried EMR (just got an account yesterday),
> but I see that it requires you to have your data in S3 as opposed to
> HDFS.
>
> The job first runs the InputDriver to copy the raw test data into
> Mahout Vector external representation after deleting any pre-
> existing output files. It looks to me like the two delete() snippets
> you show are pretty equivalent. If you have no pre-existing output
> directory, the Mahout snippet won't attempt to delete it.
I managed to figure that out :-) I'm pretty comfortable with the
ideas behind MapReduce, but being confronted with my first Job is a
bit more daunting than I expected.
> I too am at a loss to explain what you are seeing. If you can post
> more results I can try to help you read the tea leaves...
I noticed that the CloudBurst job just deleted the directory without
checking for existence and so I tried the same thing with Mahout:
java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
expected: hdfs://domU-12-31-38-00-6C-86.compute-1
.internal:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
320)
at
org
.apache
.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:
84)
at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
at
org
.apache
.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)
at
org
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:
83)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
46)
So no joy there.
Should I see if I can isolate this as an s3n problem? I suppose I
could try running the Hadoop job locally with it reading and writing
the data from S3 and see if it suffers from the same problem. At
least then I could debug inside Hadoop.
Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
problem it might have been fixed already. That doesn't help much
running on EMR, I guess.
I'm also going to start a run on EMR that does away with the whole
exists/delete check and see if that works.
Thanks for the help, and I'll let you know how I get on.
Steve
--
Stephen Green // Stephen.Green@sun.com
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692
Re: Mahout on Elastic MapReduce
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Stephen,
You are out on the bleeding edge with EMR. I've been able to run the
kmeans example directly on a small EC2 cluster that I started up myself
(using the Hadoop src/contrib/ec2 scripts). I have not yet tried EMR
(just got an account yesterday), but I see that it requires you to have
your data in S3 as opposed to HDFS.
The job first runs the InputDriver to copy the raw test data into Mahout
Vector external representation after deleting any pre-existing output
files. It looks to me like the two delete() snippets you show are pretty
equivalent. If you have no pre-existing output directory, the Mahout
snippet won't attempt to delete it.
I too am at a loss to explain what you are seeing. If you can post more
results I can try to help you read the tea leaves...
Jeff
Stephen Green wrote:
> I told some folks here at work that I would give a talk on Mahout for
> our reading group and decided that I would use it as an opportunity to
> try Amazon's Elastic MapReduce (EMR).
>
> I downloaded and untarred Hadoop 0.18.3, which is the version that
> Amazon claims they have running so that I could try things out here.
> I can start up Hadoop and sucessfully run a KMeans cluster on the
> synthetic control data using the instructions on the wiki and the
> following command line:
>
> bin/hadoop jar
> ~/Projects/EC2/mahout-0.1/examples/target/mahout-examples-0.1.job
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
> input/testdata output org.apache.mahout.utils.EuclideanDistanceMeasure
> 80 55 0.5 10
>
> I realize there's a shorter invocation, but I'm trying to figure out
> what Amazon needs to run this, so I'm pulled the default arguments
> from the KMeans job.
>
> Now, on Amazon, you can specify a jar file that gets run with
> "bin/hadoop jar" and you also specify the arguments that will be used
> with that jar file.
>
> The trick is that the input and output data need to be in S3 buckets
> and you need to specify the locations with S3 native URIs. I used the
> command line interface to EMR to create a job like so:
>
> elastic-mapreduce -v --create --name KMeans --num-instances 1 \
> --jar s3n://mahout-code/mahout-examples-0.1.job \
> --main-class
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \
> --arg s3n://mahout-input/testdata \
> --arg s3n://mahout-output \
> --arg org.apache.mahout.utils.EuclideanDistanceMeasure \
> --arg 80 --arg 55 --arg 0.5 --arg 10
>
> But this fails with the message: Steps completed with errors. Turns
> out you can have the EMR infrastructure dump the logs for the tasks
> and looking at the stderr for step 1 I see:
>
> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
> expected: hdfs://domU-12-31-39-00-ED-51.compute-1
> .internal:9000
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> at
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>
> at
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
>
> at
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
>
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
> at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:77)
>
> at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:44)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> If I replace the s3n URI for the output with just mahout-output the
> code appears to run without incident (at least the log output looks
> like the log output from my local run.) Unfortunately, the HDFS
> instance into which it's put disappears in a puff of smoke when the
> job finishes running.
>
> Now, I am by no means a Hadoop expert, but it seems like if it can
> load the data from an s3n input URI, then it probably has the right
> classes in there to do that (in fact, it looks like the jets3t jar is
> in the .job file three times!), so it seems like the KMeans job from
> mahout should be happy to use an s3n output URI, but I'm clearly
> misunderstanding something here.
>
> One of the EMR samples is a Java DNA sequence matching thing
> (CloudBurst), which seems to work fine with an s3n URI for the
> output. The setup for it's output looks like the following:
>
> Path oPath = new Path(outpath);
> FileOutputFormat.setOutputPath(conf, oPath);
> System.err.println(" Removing old results");
> FileSystem.get(conf).delete(oPath);
>
> where "conf" is of type org.apache.hadoop.mapred.JobConf. This is a
> bit different than what happens in the KMeans job:
>
> Path outPath = new Path(output);
> client.setConf(conf);
> FileSystem dfs = FileSystem.get(conf);
> if (dfs.exists(outPath))
> dfs.delete(outPath, true);
>
> Trying to use the CloudBurst idiom in the KMeans job produced no joy.
> Any help would be greatly appreciated.
>
> Steve Green