You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stephen Green <St...@sun.com> on 2009/04/14 16:15:49 UTC

Mahout on Elastic MapReduce

I told some folks here at work that I would give a talk on Mahout for  
our reading group and decided that I would use it as an opportunity to  
try Amazon's Elastic MapReduce (EMR).

I downloaded and untarred Hadoop 0.18.3, which is the version that  
Amazon claims they have running so that I could try things out here.    
I can start up Hadoop and sucessfully run a KMeans cluster on the  
synthetic control data using the instructions on the wiki and the  
following command line:

bin/hadoop jar ~/Projects/EC2/mahout-0.1/examples/target/mahout- 
examples-0.1.job  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job input/ 
testdata output org.apache.mahout.utils.EuclideanDistanceMeasure 80 55  
0.5 10

I realize there's a shorter invocation, but I'm trying to figure out  
what Amazon needs to run this, so I'm pulled the default arguments  
from the KMeans job.

Now, on Amazon, you can specify a jar file that gets run with "bin/ 
hadoop jar" and you also specify the arguments that will be used with  
that jar file.

The trick is that the input and output data need to be in S3 buckets  
and you need to specify the locations with S3 native URIs.  I used the  
command line interface to EMR to create a job like so:

elastic-mapreduce -v --create --name KMeans --num-instances 1 \
     --jar s3n://mahout-code/mahout-examples-0.1.job \
     --main-class  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \
     --arg s3n://mahout-input/testdata \
     --arg s3n://mahout-output \
     --arg org.apache.mahout.utils.EuclideanDistanceMeasure \
     --arg 80 --arg 55 --arg 0.5 --arg 10

But this fails with the message:  Steps completed with errors.  Turns  
out you can have the EMR infrastructure dump the logs for the tasks  
and looking at the stderr for step 1 I see:

java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,  
expected: hdfs://domU-12-31-39-00-ED-51.compute-1
.internal:9000
         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
320)
         at  
org 
.apache 
.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java: 
84)
         at  
org 
.apache 
.hadoop 
.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
         at  
org 
.apache 
.hadoop 
.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
         at  
org 
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java: 
77)
         at  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
44)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at  
sun 
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 
39)
         at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

If I replace the s3n URI for the output with just mahout-output the  
code appears to run without incident  (at least the log output looks  
like the log output from my local run.)  Unfortunately, the HDFS  
instance into which it's put disappears in a puff of smoke when the  
job finishes running.

Now, I am by no means a Hadoop expert, but it seems like if it can  
load the data from an s3n input URI, then it probably has the right  
classes in there to do that (in fact, it looks like the jets3t jar is  
in the .job file three times!), so it seems like the KMeans job from  
mahout should be happy to use an s3n output URI, but I'm clearly  
misunderstanding something here.

One of the EMR samples is a Java DNA sequence matching thing  
(CloudBurst), which seems to work fine with an s3n URI for the  
output.  The setup for it's output looks like the following:

		Path oPath = new Path(outpath);
		FileOutputFormat.setOutputPath(conf, oPath);
		System.err.println("  Removing old results");
		FileSystem.get(conf).delete(oPath);

where "conf" is of type org.apache.hadoop.mapred.JobConf.  This is a  
bit different than what happens in the KMeans job:

     Path outPath = new Path(output);
     client.setConf(conf);
     FileSystem dfs = FileSystem.get(conf);
     if (dfs.exists(outPath))
       dfs.delete(outPath, true);

Trying to use the CloudBurst idiom in the KMeans job produced no joy.   
Any help would be greatly appreciated.

Steve Green
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Tim Bass <ti...@gmail.com>.

FYI (cheerleading the great work of the teams):

KMeans Clustering Now Running on Elastic MapReduce

http://www.thecepblog.com/2009/04/19/kmeans-clustering-now-running-on-elastic-mapreduce/

Cheers.

On Sat, Apr 18, 2009 at 6:56 PM, Tim Bass <ti...@gmail.com> wrote:
> Also, given that there was no reply to my call for others who have
> mahout running in EC2, I think it is safe to blog
> that Stephen is the first.  So, without objection, I will write a blog
> post about this soon.
>
> Congrats again!
>
> On Sat, Apr 18, 2009 at 6:49 PM, Stephen Green <St...@sun.com> wrote:
>> Happy to. I'll do it Monday.
>>
>> Stephen Green
>> -- stephen.green@sun.com
>> http://blogs.sun.com/searchguy
>>
>>
>> On Apr 18, 2009, at 1:33, Otis Gospodnetic <ot...@yahoo.com>
>> wrote:
>>
>>>
>>> Talking about records.... Stephen, would you mind putting your
>>> notes/commands/steps on the Mahout Wiki?
>>>
>>>
>>> Thank you,
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> ----- Original Message ----
>>>>
>>>> From: Tim Bass <ti...@gmail.com>
>>>> To: mahout-user@lucene.apache.org
>>>> Sent: Thursday, April 16, 2009 6:27:56 PM
>>>> Subject: Re: Mahout on Elastic MapReduce
>>>>
>>>> For the record.....
>>>>
>>>> Is Stephen's heroic efforts the first instance of Mahout running on EMR?
>>>>
>>>> On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green wrote:
>>>>>
>>>>> On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
>>>>>
>>>>>> Excellent news! Please do open a Jira and tell us what worked. I
>>>>>> suspect
>>>>>> all of the clustering jobs share the same behavior, and other jobs for
>>>>>> other
>>>>>> components may as well.
>>>>>
>>>>> Will do.  I've just done a couple of searches for everywhere that
>>>>> FileSystem.get is called and I've fixed most of them (although come to
>>>>> think
>>>>> of it I did it in the 0.1 release tag.  Crap!  Things haven't changed
>>>>> that
>>>>> much, right?)
>>>>>
>>>>> I'm building and running the tests now, and I'll open a Jira with a
>>>>> patch
>>>>> once that's working.
>>>>>
>>>>>> Congratulations,
>>>>>
>>>>> As my father used to say "Brute force has an elegance all its own".
>>>>>
>>>>> Steve
>>>>> --
>>>>> Stephen Green                      //   Stephen.Green@sun.com
>>>>> Principal Investigator             \\   http://blogs.sun.com/searchguy
>>>>> Aura Project                       //   Voice: +1 781-442-0926
>>>>> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: Mahout on Elastic MapReduce

Posted by Grant Ingersoll <gs...@apache.org>.

On May 7, 2009, at 11:42 AM, Stephen Green wrote:

>
> On May 7, 2009, at 2:12 PM, Tim Bass wrote:
>
>> Is it possible to get permissions to edit that page?
>
> I didn't know that you needed such a thing, and I can't figure out  
> how to add permissions if there are any...

Tim, do you have a wiki account?  You should just need to login.  We  
have write by all setup.

>
>
>> Or would you
>> prefer I ping you privately when I follow in your shoes?
>
> Please, edit away.  That's the wiki spirit!

+1

>
>>
>
>> I plan to follow your example step-by-step soon :-)
>
> Let me know if you have any trouble.
>
> Steve
> -- 
> Stephen Green                      //   Stephen.Green@sun.com
> Principal Investigator             \\   http://blogs.sun.com/searchguy
> Aura Project                       //   Voice: +1 781-442-0926
> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On May 7, 2009, at 2:12 PM, Tim Bass wrote:

> Is it possible to get permissions to edit that page?

I didn't know that you needed such a thing, and I can't figure out how  
to add permissions if there are any...

> Or would you
> prefer I ping you privately when I follow in your shoes?

Please, edit away.  That's the wiki spirit!
>

> I plan to follow your example step-by-step soon :-)

Let me know if you have any trouble.

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Tim Bass <ti...@gmail.com>.

Is it possible to get permissions to edit that page?  Or would you
prefer I ping you privately when I follow in your shoes?

I plan to follow your example step-by-step soon :-)

On Fri, May 8, 2009 at 1:02 AM, Stephen Green <St...@sun.com> wrote:
>
> On May 7, 2009, at 1:59 PM, Tim Bass wrote:
>
>> (... I found the link later in the document :-)
>>
>> "Escape Key"
>
> Dude, that totally worked!  I'll fix it anyways once the wiki comes back.
>
> Steve
> --
> Stephen Green                      //   Stephen.Green@sun.com
> Principal Investigator             \\   http://blogs.sun.com/searchguy
> Aura Project                       //   Voice: +1 781-442-0926
> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>
>
>
>

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On May 7, 2009, at 1:59 PM, Tim Bass wrote:

> (... I found the link later in the document :-)
>
> "Escape Key"

Dude, that totally worked!  I'll fix it anyways once the wiki comes  
back.

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Tim Bass <ti...@gmail.com>.

(... I found the link later in the document :-)

"Escape Key"

On Fri, May 8, 2009 at 12:52 AM, Tim Bass <ti...@gmail.com> wrote:
> Hi Stephen,
>
> On the wiki, you say:
>
> "Download the Ruby command line client for EMR  ... "
>
> Do you mind wrapping the text above to a link to that Ruby client on the wiki?
>
> Thanks!
>
>>>> Happy to. I'll do it Monday.
>>>
>>> But I didn't say *which* Monday, though did I!  Hah!
>>>
>>> Sorry for the delay.  I added a page to the Wiki this afternoon:
>>>
>>>
>>> http://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>>>
>>> As I was writing, I re-tried the examples to make sure that the
>>> invocations on the page work, so I think everything's in order.
>>>
>>> If someone who has the permissions necessary to delete attachments could
>>> delete the extra graphics, I'd appreciate it.
>>>
>>>> On Apr 18, 2009, at 1:33, Otis Gospodnetic <ot...@yahoo.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Talking about records.... Stephen, would you mind putting your
>>>>> notes/commands/steps on the Mahout Wiki?
>>>>>
>>>>>
>>>>> Thank you,
>>>>> Otis
>>>>> --
>>>
>>>
>>> Steve
>>> --
>>> Stephen Green                      //   Stephen.Green@sun.com
>>> Principal Investigator             \\   http://blogs.sun.com/searchguy
>>> Aura Project                       //   Voice: +1 781-442-0926
>>> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>>>
>>>
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>

Re: Mahout on Elastic MapReduce

Posted by Tim Bass <ti...@gmail.com>.

Also, given that there was no reply to my call for others who have
mahout running in EC2, I think it is safe to blog
that Stephen is the first.  So, without objection, I will write a blog
post about this soon.

Congrats again!

On Sat, Apr 18, 2009 at 6:49 PM, Stephen Green <St...@sun.com> wrote:
> Happy to. I'll do it Monday.
>
> Stephen Green
> -- stephen.green@sun.com
> http://blogs.sun.com/searchguy
>
>
> On Apr 18, 2009, at 1:33, Otis Gospodnetic <ot...@yahoo.com>
> wrote:
>
>>
>> Talking about records.... Stephen, would you mind putting your
>> notes/commands/steps on the Mahout Wiki?
>>
>>
>> Thank you,
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>>>
>>> From: Tim Bass <ti...@gmail.com>
>>> To: mahout-user@lucene.apache.org
>>> Sent: Thursday, April 16, 2009 6:27:56 PM
>>> Subject: Re: Mahout on Elastic MapReduce
>>>
>>> For the record.....
>>>
>>> Is Stephen's heroic efforts the first instance of Mahout running on EMR?
>>>
>>> On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green wrote:
>>>>
>>>> On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
>>>>
>>>>> Excellent news! Please do open a Jira and tell us what worked. I
>>>>> suspect
>>>>> all of the clustering jobs share the same behavior, and other jobs for
>>>>> other
>>>>> components may as well.
>>>>
>>>> Will do.  I've just done a couple of searches for everywhere that
>>>> FileSystem.get is called and I've fixed most of them (although come to
>>>> think
>>>> of it I did it in the 0.1 release tag.  Crap!  Things haven't changed
>>>> that
>>>> much, right?)
>>>>
>>>> I'm building and running the tests now, and I'll open a Jira with a
>>>> patch
>>>> once that's working.
>>>>
>>>>> Congratulations,
>>>>
>>>> As my father used to say "Brute force has an elegance all its own".
>>>>
>>>> Steve
>>>> --
>>>> Stephen Green                      //   Stephen.Green@sun.com
>>>> Principal Investigator             \\   http://blogs.sun.com/searchguy
>>>> Aura Project                       //   Voice: +1 781-442-0926
>>>> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>>>>
>>>>
>>>>
>>>>
>>
>

Re: Mahout on Elastic MapReduce

Posted by Tim Bass <ti...@gmail.com>.

Hi Stephen,

On the wiki, you say:

"Download the Ruby command line client for EMR  ... "

Do you mind wrapping the text above to a link to that Ruby client on the wiki?

Thanks!

>>> Happy to. I'll do it Monday.
>>
>> But I didn't say *which* Monday, though did I!  Hah!
>>
>> Sorry for the delay.  I added a page to the Wiki this afternoon:
>>
>>
>> http://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>>
>> As I was writing, I re-tried the examples to make sure that the
>> invocations on the page work, so I think everything's in order.
>>
>> If someone who has the permissions necessary to delete attachments could
>> delete the extra graphics, I'd appreciate it.
>>
>>> On Apr 18, 2009, at 1:33, Otis Gospodnetic <ot...@yahoo.com>
>>> wrote:
>>>
>>>>
>>>> Talking about records.... Stephen, would you mind putting your
>>>> notes/commands/steps on the Mahout Wiki?
>>>>
>>>>
>>>> Thank you,
>>>> Otis
>>>> --
>>
>>
>> Steve
>> --
>> Stephen Green                      //   Stephen.Green@sun.com
>> Principal Investigator             \\   http://blogs.sun.com/searchguy
>> Aura Project                       //   Voice: +1 781-442-0926
>> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Mahout on Elastic MapReduce

Posted by Grant Ingersoll <gs...@apache.org>.

Very cool, Stephen!  I'll try to delete the attachments sometime this  
week.

On May 1, 2009, at 2:42 PM, Stephen Green wrote:

> On Apr 18, 2009, at 7:49 AM, Stephen Green wrote:
>
>> Happy to. I'll do it Monday.
>
> But I didn't say *which* Monday, though did I!  Hah!
>
> Sorry for the delay.  I added a page to the Wiki this afternoon:
>
> http://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
>
> As I was writing, I re-tried the examples to make sure that the  
> invocations on the page work, so I think everything's in order.
>
> If someone who has the permissions necessary to delete attachments  
> could delete the extra graphics, I'd appreciate it.
>
>> On Apr 18, 2009, at 1:33, Otis Gospodnetic <otis_gospodnetic@yahoo.com 
>> > wrote:
>>
>>>
>>> Talking about records.... Stephen, would you mind putting your  
>>> notes/commands/steps on the Mahout Wiki?
>>>
>>>
>>> Thank you,
>>> Otis
>>> --
>
>
> Steve
> -- 
> Stephen Green                      //   Stephen.Green@sun.com
> Principal Investigator             \\   http://blogs.sun.com/searchguy
> Aura Project                       //   Voice: +1 781-442-0926
> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 18, 2009, at 7:49 AM, Stephen Green wrote:

> Happy to. I'll do it Monday.

But I didn't say *which* Monday, though did I!  Hah!

Sorry for the delay.  I added a page to the Wiki this afternoon:

http://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce

As I was writing, I re-tried the examples to make sure that the  
invocations on the page work, so I think everything's in order.

If someone who has the permissions necessary to delete attachments  
could delete the extra graphics, I'd appreciate it.

> On Apr 18, 2009, at 1:33, Otis Gospodnetic  
> <ot...@yahoo.com> wrote:
>
>>
>> Talking about records.... Stephen, would you mind putting your  
>> notes/commands/steps on the Mahout Wiki?
>>
>>
>> Thank you,
>> Otis
>> --

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

Happy to. I'll do it Monday.

Stephen Green
--  
stephen.green@sun.com
http://blogs.sun.com/searchguy


On Apr 18, 2009, at 1:33, Otis Gospodnetic  
<ot...@yahoo.com> wrote:

>
> Talking about records.... Stephen, would you mind putting your notes/ 
> commands/steps on the Mahout Wiki?
>
>
> Thank you,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Tim Bass <ti...@gmail.com>
>> To: mahout-user@lucene.apache.org
>> Sent: Thursday, April 16, 2009 6:27:56 PM
>> Subject: Re: Mahout on Elastic MapReduce
>>
>> For the record.....
>>
>> Is Stephen's heroic efforts the first instance of Mahout running on  
>> EMR?
>>
>> On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green wrote:
>>>
>>> On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
>>>
>>>> Excellent news! Please do open a Jira and tell us what worked. I  
>>>> suspect
>>>> all of the clustering jobs share the same behavior, and other  
>>>> jobs for other
>>>> components may as well.
>>>
>>> Will do.  I've just done a couple of searches for everywhere that
>>> FileSystem.get is called and I've fixed most of them (although  
>>> come to think
>>> of it I did it in the 0.1 release tag.  Crap!  Things haven't  
>>> changed that
>>> much, right?)
>>>
>>> I'm building and running the tests now, and I'll open a Jira with  
>>> a patch
>>> once that's working.
>>>
>>>> Congratulations,
>>>
>>> As my father used to say "Brute force has an elegance all its own".
>>>
>>> Steve
>>> --
>>> Stephen Green                      //   Stephen.Green@sun.com
>>> Principal Investigator             \\   http://blogs.sun.com/searchguy
>>> Aura Project                       //   Voice: +1 781-442-0926
>>> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>>>
>>>
>>>
>>>
>

Re: Mahout on Elastic MapReduce

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Talking about records.... Stephen, would you mind putting your notes/commands/steps on the Mahout Wiki?


Thank you,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Tim Bass <ti...@gmail.com>
> To: mahout-user@lucene.apache.org
> Sent: Thursday, April 16, 2009 6:27:56 PM
> Subject: Re: Mahout on Elastic MapReduce
> 
> For the record.....
> 
> Is Stephen's heroic efforts the first instance of Mahout running on EMR?
> 
> On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green wrote:
> >
> > On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
> >
> >> Excellent news! Please do open a Jira and tell us what worked. I suspect
> >> all of the clustering jobs share the same behavior, and other jobs for other
> >> components may as well.
> >
> > Will do.  I've just done a couple of searches for everywhere that
> > FileSystem.get is called and I've fixed most of them (although come to think
> > of it I did it in the 0.1 release tag.  Crap!  Things haven't changed that
> > much, right?)
> >
> > I'm building and running the tests now, and I'll open a Jira with a patch
> > once that's working.
> >
> >> Congratulations,
> >
> > As my father used to say "Brute force has an elegance all its own".
> >
> > Steve
> > --
> > Stephen Green                      //   Stephen.Green@sun.com
> > Principal Investigator             \\   http://blogs.sun.com/searchguy
> > Aura Project                       //   Voice: +1 781-442-0926
> > Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
> >
> >
> >
> >

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 16, 2009, at 11:30 AM, Tim Bass wrote:

> This has been a very interesting.  Thank you for sharing this Stephen.

No worries.  Fortunately, it's one of those weeks where I'm enjoying  
being down in the weeds :-) And there's been pretty steady progress  
thanks to you guys and the folks at Amazon.

> Your posts on EMR and Mahout
> are motivating me to get off the golf course and badminton courts and
> re-energize my efforts with EC2 and EMR.

It does seem like it would be worthwhile to easily launch a short term  
MR job that would (for example) recompute user-user or item-item  
similarity using Taste.

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Tim Bass <ti...@gmail.com>.

This has been a very interesting.  Thank you for sharing this Stephen.
 Your posts on EMR and Mahout
are motivating me to get off the golf course and badminton courts and
re-energize my efforts with EC2 and EMR.

When we advance this a bit further, it would be great to use Amazon
Simple Message Service to move
data to S3 from live systems.

I apologize for lurking and lazy.  I need to get off my "you know
what" and contribute more.

On Thu, Apr 16, 2009 at 10:13 PM, Stephen Green <St...@sun.com> wrote:
>
> On Apr 16, 2009, at 11:10 AM, Jeff Eastman wrote:
>
>> Hi Stephen,
>>
>> It looks to me like you are on the right track. The original kMeans code
>> and job patterns were written over a year ago, probably on a version of
>> Hadoop 10 or 11 IIRC. They have made significant changes to the file system
>> in the interim and nobody - except you - has tried to run kMeans on EMR.
>
> A man, alone... This would make a good movie!
>
>> It certainly seems like dfs.exists(outPath) should be false if you have no
>> outPath. You have a sharp machete and are making good progress breaking a
>> jungle trail to EMR.
>
> I'll see if I can get the s3n thing working locally and then return to spend
> a few more dimes at Amazon.  Once I figure it out, I'll add an issue to Jira
> and see if I can work up a patch for mahout-dev.
>
>> If you'd like to chat on the phone or Skype, please contact me directly
>> (jeff at windwardsolutions dot com).
>
> Thanks!
>
> Steve
> --
> Stephen Green                      //   Stephen.Green@sun.com
> Principal Investigator             \\   http://blogs.sun.com/searchguy
> Aura Project                       //   Voice: +1 781-442-0926
> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>
>
>
>

Re: Mahout on Elastic MapReduce

Posted by Tim Bass <ti...@gmail.com>.

For the record.....

Is Stephen's heroic efforts the first instance of Mahout running on EMR?

On Fri, Apr 17, 2009 at 3:08 AM, Stephen Green <St...@sun.com> wrote:
>
> On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:
>
>> Excellent news! Please do open a Jira and tell us what worked. I suspect
>> all of the clustering jobs share the same behavior, and other jobs for other
>> components may as well.
>
> Will do.  I've just done a couple of searches for everywhere that
> FileSystem.get is called and I've fixed most of them (although come to think
> of it I did it in the 0.1 release tag.  Crap!  Things haven't changed that
> much, right?)
>
> I'm building and running the tests now, and I'll open a Jira with a patch
> once that's working.
>
>> Congratulations,
>
> As my father used to say "Brute force has an elegance all its own".
>
> Steve
> --
> Stephen Green                      //   Stephen.Green@sun.com
> Principal Investigator             \\   http://blogs.sun.com/searchguy
> Aura Project                       //   Voice: +1 781-442-0926
> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>
>
>
>

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 16, 2009, at 4:04 PM, Jeff Eastman wrote:

> Excellent news! Please do open a Jira and tell us what worked. I  
> suspect all of the clustering jobs share the same behavior, and  
> other jobs for other components may as well.

Will do.  I've just done a couple of searches for everywhere that  
FileSystem.get is called and I've fixed most of them (although come to  
think of it I did it in the 0.1 release tag.  Crap!  Things haven't  
changed that much, right?)

I'm building and running the tests now, and I'll open a Jira with a  
patch once that's working.

> Congratulations,

As my father used to say "Brute force has an elegance all its own".

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Excellent news! Please do open a Jira and tell us what worked. I suspect 
all of the clustering jobs share the same behavior, and other jobs for 
other components may as well.

Congratulations,
Jeff

Stephen Green wrote:
> Good news.  I just had a successful run on EMR of the KMeans 
> clustering of the synthetic control data.
>
> Shall I add an issue to Jira for the problems that I encountered and 
> see if I can get together a patch?
>
> Steve

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

Good news.  I just had a successful run on EMR of the KMeans  
clustering of the synthetic control data.

Shall I add an issue to Jira for the problems that I encountered and  
see if I can get together a patch?

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 16, 2009, at 10:27 AM, Stephen Green wrote:

> java.lang.NullPointerException
>        at  
> org 
> .apache 
> .hadoop 
> .fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:310)
>        at  
> org 
> .apache 
> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
>        at  
> org 
> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
> 45)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at  
> sun 
> .reflect 
> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at  
> sun 
> .reflect 
> .DelegatingMethodAccessorImpl 
> .invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> (The line numbers in kmeans.Job are weird because I added logging.)
>
> If the Hadoop on EMR is really 0.18.3, then the null pointer here is  
> the store in the NativeS3FileSystem.  But there's another problem:   
> I deleted the output path before I started the run, so the existence  
> check should have failed and dfs.delete never should have been  
> called.  I added a bit of logging to the KMeans job and here's what  
> it says about the output path:
>

OK, I figured this one out.  I gave it the URI s3n://mahout-output/ as  
the output directory.  This is a problem because the URI parsing code  
interprets mahout-output as a host name in the URI, which means that  
roundabout NativeS3FileSystem:319, it gets the key "" from pathToKey,  
which apparently indicates the root directory, which is always  
supposed to exist, and so the input path is used to create a  
directory, which generates the output directory and a non-null result  
from getFileStatus, and therefore a true response from the exists call.

Using a subdirectory on the URI (s3n://mahout-output/kmeans) gets the  
key kmeans, which moves things along a little farther.

This is a weird disconnect in the pathToKey code, I think.

Steve
--
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 16, 2009, at 11:10 AM, Jeff Eastman wrote:

> Hi Stephen,
>
> It looks to me like you are on the right track. The original kMeans  
> code and job patterns were written over a year ago, probably on a  
> version of Hadoop 10 or 11 IIRC. They have made significant changes  
> to the file system in the interim and nobody - except you - has  
> tried to run kMeans on EMR.

A man, alone... This would make a good movie!

> It certainly seems like dfs.exists(outPath) should be false if you  
> have no outPath. You have a sharp machete and are making good  
> progress breaking a jungle trail to EMR.

I'll see if I can get the s3n thing working locally and then return to  
spend a few more dimes at Amazon.  Once I figure it out, I'll add an  
issue to Jira and see if I can work up a patch for mahout-dev.

> If you'd like to chat on the phone or Skype, please contact me  
> directly (jeff at windwardsolutions dot com).

Thanks!

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Stephen,

It looks to me like you are on the right track. The original kMeans code 
and job patterns were written over a year ago, probably on a version of 
Hadoop 10 or 11 IIRC. They have made significant changes to the file 
system in the interim and nobody - except you - has tried to run kMeans 
on EMR.

The logic about using the incorrect file system method is sound, and 
your fix seems like it should work. I don't expect the hadoop version 
differences to impact you since kMeans has not been updated recently to 
take advantage of hadoop improvements.

It certainly seems like dfs.exists(outPath) should be false if you have 
no outPath. You have a sharp machete and are making good progress 
breaking a jungle trail to EMR. If you'd like to chat on the phone or 
Skype, please contact me directly (jeff at windwardsolutions dot com).

Jeff


Stephen Green wrote:
> A bit more progress.  I asked about this problem on Amazon's EMR 
> forums.  Here's the thread:
>
> http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945
>
> The answer from Amazon was:
>
>> This appears to be an issue with Mahout. This exception is fairly 
>> common and matches the pattern of "Wrong FS: s3n://*/, expected: 
>> hdfs://*:9000". This occurs when you try and use an S3N path with 
>> HDFS. Typically this occurs because the code asks for the wrong 
>> FileSystem.
>>
>> This could happen because a developer used the wrong static method on 
>> Hadoop's FileSystem class:
>>
>> http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html 
>>
>>
>> If you call FileSystem.get(Configuration conf) you'll get an instance 
>> of the cluster's default file system, which in our case is HDFS. 
>> Instead, if you have a URI and want a reference to the FileSystem 
>> that URI points to, you should call the method FileSystem.get(URI 
>> uri, Configuration conf).
>>
>
> He offered a solution that involved using DistCp to copy data from S3 
> to HDFS and then back again, but since I have the Mahout source, I 
> decided to pursue things a bit further.  I went into the source and 
> modified the places where the filesystem is fetched to do the following:
>
>     FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>
> (there were 3 places that I changed it, but I expect there are more 
> lying around.)  This is the idiom used by the CloudBurst example on EMR.
>
> Making this change fixes the exception that I was getting, but I'm now 
> getting a different exception:
>
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:310) 
>
>         at 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83) 
>
>         at 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:45) 
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
>
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
>
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> (The line numbers in kmeans.Job are weird because I added logging.)
>
> If the Hadoop on EMR is really 0.18.3, then the null pointer here is 
> the store in the NativeS3FileSystem.  But there's another problem:  I 
> deleted the output path before I started the run, so the existence 
> check should have failed and dfs.delete never should have been 
> called.  I added a bit of logging to the KMeans job and here's what it 
> says about the output path:
>
> 2009-04-16 14:04:35,757 INFO 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
> lass org.apache.hadoop.fs.s3native.NativeS3FileSystem
>
> So it got the right output file system type.
>
> 2009-04-16 14:04:35,758 INFO 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
> mahout-output/ exists: true
>
> Shouldn't dfs.exists(outPath) fail for a non-existent path?  And 
> didn't the store have to exist (i.e., be non-null) for it to figure 
> this out?  I guess this really is starting to verge into base hadoop 
> territory.
>
> I'm rapidly getting to the point where I need to solve this one just 
> to prove to myself that I can get it to run!
>
> Steve

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 16, 2009, at 10:27 AM, Stephen Green wrote:

> If the Hadoop on EMR is really 0.18.3, then the null pointer here is  
> the store in the NativeS3FileSystem.

Perhaps it's not really 0.18.3.  I decided to finally try a local run  
with s3n URIs for the input and output and got a very similar error:

java.lang.NullPointerException
	at  
org 
.apache 
.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java: 
303)
	at  
org 
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java: 
83)
	at  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
45)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at  
sun 
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 
39)
	at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
	at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

But at a different spot.

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

A bit more progress.  I asked about this problem on Amazon's EMR  
forums.  Here's the thread:

http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945

The answer from Amazon was:

> This appears to be an issue with Mahout. This exception is fairly  
> common and matches the pattern of "Wrong FS: s3n://*/, expected:  
> hdfs://*:9000". This occurs when you try and use an S3N path with  
> HDFS. Typically this occurs because the code asks for the wrong  
> FileSystem.
>
> This could happen because a developer used the wrong static method  
> on Hadoop's FileSystem class:
>
> http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html
>
> If you call FileSystem.get(Configuration conf) you'll get an  
> instance of the cluster's default file system, which in our case is  
> HDFS. Instead, if you have a URI and want a reference to the  
> FileSystem that URI points to, you should call the method  
> FileSystem.get(URI uri, Configuration conf).
>

He offered a solution that involved using DistCp to copy data from S3  
to HDFS and then back again, but since I have the Mahout source, I  
decided to pursue things a bit further.  I went into the source and  
modified the places where the filesystem is fetched to do the following:

     FileSystem dfs = FileSystem.get(outPath.toUri(), conf);

(there were 3 places that I changed it, but I expect there are more  
lying around.)  This is the idiom used by the CloudBurst example on EMR.

Making this change fixes the exception that I was getting, but I'm now  
getting a different exception:

java.lang.NullPointerException
         at  
org 
.apache 
.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java: 
310)
         at  
org 
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java: 
83)
         at  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
45)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at  
sun 
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 
39)
         at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

(The line numbers in kmeans.Job are weird because I added logging.)

If the Hadoop on EMR is really 0.18.3, then the null pointer here is  
the store in the NativeS3FileSystem.  But there's another problem:  I  
deleted the output path before I started the run, so the existence  
check should have failed and dfs.delete never should have been  
called.  I added a bit of logging to the KMeans job and here's what it  
says about the output path:

2009-04-16 14:04:35,757 INFO  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
lass org.apache.hadoop.fs.s3native.NativeS3FileSystem

So it got the right output file system type.

2009-04-16 14:04:35,758 INFO  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
mahout-output/ exists: true

Shouldn't dfs.exists(outPath) fail for a non-existent path?  And  
didn't the store have to exist (i.e., be non-null) for it to figure  
this out?  I guess this really is starting to verge into base hadoop  
territory.

I'm rapidly getting to the point where I need to solve this one just  
to prove to myself that I can get it to run!

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 15, 2009, at 3:28 PM, Stephen Green wrote:

>
> On Apr 15, 2009, at 3:16 PM, Grant Ingersoll wrote:
>
>>
>> On Apr 15, 2009, at 2:54 PM, Stephen Green wrote:
>>
>>>
>>> On Apr 15, 2009, at 2:19 PM, Jeff Eastman wrote:
>>>
>>>> None of the clustering implementations hard code the filesystem.  
>>>> The file names are constructed from the input and output filepath  
>>>> arguments.
>>>
>>> This is what I saw as well.
>>>
>>>> Grant Ingersoll wrote:
>>>>> I seem to recall this being something you have to set in your  
>>>>> Hadoop configuration.  Or, let me double check that we aren't  
>>>>> hard-coding the FS in our Job.
>>>
>>> I'm pretty sure that the configuration on EMR allows this, as the  
>>> job has no problem reading from the s3n input data.
>>>
>>
>> Can you share your Hadoop configuration?  i.e. the contents of the  
>> conf directory?  You might actually have some luck asking this  
>> question on core-users over at Hadoop.
>
> Well, Amazon controls the configuration, but I could probably log  
> into the master while it's running and pull it out.

Not sure it'll help, but I've attached the config file for the hadoop  
that runs in EMR.  I've redacted my AWS credentials, but they were  
correct.  I pulled the rest of the conf directory if you want to have  
a look at that.

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 15, 2009, at 3:16 PM, Grant Ingersoll wrote:

>
> On Apr 15, 2009, at 2:54 PM, Stephen Green wrote:
>
>>
>> On Apr 15, 2009, at 2:19 PM, Jeff Eastman wrote:
>>
>>> None of the clustering implementations hard code the filesystem.  
>>> The file names are constructed from the input and output filepath  
>>> arguments.
>>
>> This is what I saw as well.
>>
>>> Grant Ingersoll wrote:
>>>> I seem to recall this being something you have to set in your  
>>>> Hadoop configuration.  Or, let me double check that we aren't  
>>>> hard-coding the FS in our Job.
>>
>> I'm pretty sure that the configuration on EMR allows this, as the  
>> job has no problem reading from the s3n input data.
>>
>
> Can you share your Hadoop configuration?  i.e. the contents of the  
> conf directory?  You might actually have some luck asking this  
> question on core-users over at Hadoop.

Well, Amazon controls the configuration, but I could probably log into  
the master while it's running and pull it out.

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 15, 2009, at 2:54 PM, Stephen Green wrote:

>
> On Apr 15, 2009, at 2:19 PM, Jeff Eastman wrote:
>
>> None of the clustering implementations hard code the filesystem.  
>> The file names are constructed from the input and output filepath  
>> arguments.
>
> This is what I saw as well.
>
>> Grant Ingersoll wrote:
>>> I seem to recall this being something you have to set in your  
>>> Hadoop configuration.  Or, let me double check that we aren't hard- 
>>> coding the FS in our Job.
>
> I'm pretty sure that the configuration on EMR allows this, as the  
> job has no problem reading from the s3n input data.
>

Can you share your Hadoop configuration?  i.e. the contents of the  
conf directory?  You might actually have some luck asking this  
question on core-users over at Hadoop.

> I'm going to keep plugging!
>
> Steve
> -- 
> Stephen Green                      //   Stephen.Green@sun.com
> Principal Investigator             \\   http://blogs.sun.com/searchguy
> Aura Project                       //   Voice: +1 781-442-0926
> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 15, 2009, at 2:19 PM, Jeff Eastman wrote:

> None of the clustering implementations hard code the filesystem. The  
> file names are constructed from the input and output filepath  
> arguments.

This is what I saw as well.

> Grant Ingersoll wrote:
>> I seem to recall this being something you have to set in your  
>> Hadoop configuration.  Or, let me double check that we aren't hard- 
>> coding the FS in our Job.

I'm pretty sure that the configuration on EMR allows this, as the job  
has no problem reading from the s3n input data.

I'm going to keep plugging!

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

None of the clustering implementations hard code the filesystem. The 
file names are constructed from the input and output filepath arguments.

Jeff

Grant Ingersoll wrote:
> I seem to recall this being something you have to set in your Hadoop 
> configuration.  Or, let me double check that we aren't hard-coding the 
> FS in our Job.
>
> -Grant
>
> On Apr 15, 2009, at 1:27 PM, Stephen Green wrote:
>
>> On Apr 14, 2009, at 6:54 PM, Stephen Green wrote:
>>> On Apr 14, 2009, at 5:17 PM, Grant Ingersoll wrote:
>>>
>>>> I would be concerned about the fact that EMR is using 0.18 and 
>>>> Mahout is on 0.19 (which of course raises another concern expressed 
>>>> by Owen O'Malley to me at ApacheCon: No one uses 0.19)
>>>
>>> Well, I did run Mahout locally on a 0.18.3 install, but that was 
>>> writing to and reading from HDFS.  I can build a custom 
>>> mahout-examples that has the 0.18.3 Hadoop jars (or perhaps no 
>>> hadoop jar at all...) I'm guessing if EMR is on 0.18.3 and it gets 
>>> popular, then you're going to have to deal with that problem.
>>
>>
>> More fun today.  I checked out the mahout-0.1 release and rebuilt 
>> mahout.  I took the mahout-examples job, removed the hadoop jar, and 
>> then tried to run the KMeans clustering against the synthetic control 
>> data.  This failed with the same exception that I was originally 
>> getting yesterday:
>>
>> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output/, 
>> expected: hdfs://domU-12-31-38-01-C5-22.compute-1.internal:9000
>>        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
>>        at 
>> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84) 
>>
>>        at 
>> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140) 
>>
>>        at 
>> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408) 
>>
>>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
>>        at 
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:77) 
>>
>>        at 
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:43) 
>>
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
>>
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
>>
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>>        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>
>> Steve
>> -- 
>> Stephen Green                      //   Stephen.Green@sun.com
>> Principal Investigator             \\   http://blogs.sun.com/searchguy
>> Aura Project                       //   Voice: +1 781-442-0926
>> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>

Re: Mahout on Elastic MapReduce

Posted by Grant Ingersoll <gs...@apache.org>.

I seem to recall this being something you have to set in your Hadoop  
configuration.  Or, let me double check that we aren't hard-coding the  
FS in our Job.

-Grant

On Apr 15, 2009, at 1:27 PM, Stephen Green wrote:

> On Apr 14, 2009, at 6:54 PM, Stephen Green wrote:
>> On Apr 14, 2009, at 5:17 PM, Grant Ingersoll wrote:
>>
>>> I would be concerned about the fact that EMR is using 0.18 and  
>>> Mahout is on 0.19 (which of course raises another concern  
>>> expressed by Owen O'Malley to me at ApacheCon: No one uses 0.19)
>>
>> Well, I did run Mahout locally on a 0.18.3 install, but that was  
>> writing to and reading from HDFS.  I can build a custom mahout- 
>> examples that has the 0.18.3 Hadoop jars (or perhaps no hadoop jar  
>> at all...) I'm guessing if EMR is on 0.18.3 and it gets popular,  
>> then you're going to have to deal with that problem.
>
>
> More fun today.  I checked out the mahout-0.1 release and rebuilt  
> mahout.  I took the mahout-examples job, removed the hadoop jar, and  
> then tried to run the KMeans clustering against the synthetic  
> control data.  This failed with the same exception that I was  
> originally getting yesterday:
>
> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output/,  
> expected: hdfs://domU-12-31-38-01-C5-22.compute-1.internal:9000
>        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
> 320)
>        at  
> org 
> .apache 
> .hadoop 
> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>        at  
> org 
> .apache 
> .hadoop 
> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
>        at  
> org 
> .apache 
> .hadoop 
> .dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: 
> 408)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
>        at  
> org 
> .apache 
> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:77)
>        at  
> org 
> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
> 43)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at  
> sun 
> .reflect 
> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at  
> sun 
> .reflect 
> .DelegatingMethodAccessorImpl 
> .invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> Steve
> -- 
> Stephen Green                      //   Stephen.Green@sun.com
> Principal Investigator             \\   http://blogs.sun.com/searchguy
> Aura Project                       //   Voice: +1 781-442-0926
> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 14, 2009, at 6:54 PM, Stephen Green wrote:
> On Apr 14, 2009, at 5:17 PM, Grant Ingersoll wrote:
>
>> I would be concerned about the fact that EMR is using 0.18 and  
>> Mahout is on 0.19 (which of course raises another concern expressed  
>> by Owen O'Malley to me at ApacheCon: No one uses 0.19)
>
> Well, I did run Mahout locally on a 0.18.3 install, but that was  
> writing to and reading from HDFS.  I can build a custom mahout- 
> examples that has the 0.18.3 Hadoop jars (or perhaps no hadoop jar  
> at all...) I'm guessing if EMR is on 0.18.3 and it gets popular,  
> then you're going to have to deal with that problem.


More fun today.  I checked out the mahout-0.1 release and rebuilt  
mahout.  I took the mahout-examples job, removed the hadoop jar, and  
then tried to run the KMeans clustering against the synthetic control  
data.  This failed with the same exception that I was originally  
getting yesterday:

java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output/,  
expected: hdfs://domU-12-31-38-01-C5-22.compute-1.internal:9000
         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
320)
         at  
org 
.apache 
.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java: 
84)
         at  
org 
.apache 
.hadoop 
.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
         at  
org 
.apache 
.hadoop 
.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
         at  
org 
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java: 
77)
         at  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
43)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at  
sun 
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 
39)
         at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 14, 2009, at 5:17 PM, Grant Ingersoll wrote:

> I would be concerned about the fact that EMR is using 0.18 and  
> Mahout is on 0.19 (which of course raises another concern expressed  
> by Owen O'Malley to me at ApacheCon: No one uses 0.19)

Well, I did run Mahout locally on a 0.18.3 install, but that was  
writing to and reading from HDFS.  I can build a custom mahout- 
examples that has the 0.18.3 Hadoop jars (or perhaps no hadoop jar at  
all...) I'm guessing if EMR is on 0.18.3 and it gets popular, then  
you're going to have to deal with that problem.

> I'd say you should try reproducing the problem on the same version  
> that Mahout uses.

That'll be a bit tricky in the EMR case as that's Amazon's business  
(ask me about trying to get a 64bit Solaris AMI on Amazon's version of  
Xen...)

>
> FWIW, any committer on the Mahout project can likely get credits to  
> use AWS.

I'm happy to share my limited experience.

Also:

>> ----- Original Message ----
>>> From: Sean Owen <sr...@gmail.com>
>>> To: mahout-user@lucene.apache.org
>>> Sent: Tuesday, April 14, 2009 4:19:51 PM
>>> Subject: Re: Mahout on Elastic MapReduce
>>>
>>> This is a fairly uninformed observation, but: the error seems to be
>>> from Hadoop. It seems to say that it understands hdfs:, but not  
>>> s3n:,
>>> and that makes sense to me. Do we expect Hadoop understands how to
>>> read from S3? I would expect not. (Though, you point to examples  
>>> that
>>> seem to overcome this just fine?)

As Otis pointed out, Hadoop can handle S3 a couple of ways, and the  
example that I've been working seems to be able to read the input data  
from an s3n URI no problem.

>>> When I have integrated code with stuff stored on S3, I have always  
>>> had
>>> to write extra glue code to copy from S3 to a local file system, do
>>> work, then copy back.

I think you do need to copy from S3 to HDFS, but I think that happens  
automagically (?  My Hadoop ignorance is starting to show!)

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Grant Ingersoll <gs...@apache.org>.

I would be concerned about the fact that EMR is using 0.18 and Mahout  
is on 0.19 (which of course raises another concern expressed by Owen  
O'Malley to me at ApacheCon: No one uses 0.19)

I'd say you should try reproducing the problem on the same version  
that Mahout uses.

FWIW, any committer on the Mahout project can likely get credits to  
use AWS.

On Apr 14, 2009, at 5:08 PM, Otis Gospodnetic wrote:

>
> Hadoop should be able to read directly from S3, I believe: http://wiki.apache.org/hadoop/AmazonS3
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Sean Owen <sr...@gmail.com>
>> To: mahout-user@lucene.apache.org
>> Sent: Tuesday, April 14, 2009 4:19:51 PM
>> Subject: Re: Mahout on Elastic MapReduce
>>
>> This is a fairly uninformed observation, but: the error seems to be
>> from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
>> and that makes sense to me. Do we expect Hadoop understands how to
>> read from S3? I would expect not. (Though, you point to examples that
>> seem to overcome this just fine?)
>>
>> When I have integrated code with stuff stored on S3, I have always  
>> had
>> to write extra glue code to copy from S3 to a local file system, do
>> work, then copy back.
>>
>> On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote:
>>>
>>> On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
>>>
>>>>
>>>> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
>>>>
>>>>> Hi Stephen,
>>>>>
>>>>> You are out on the bleeding edge with EMR.
>>>>
>>>> Yeah, but the view is lovely from here!
>>>>
>>>>> I've been able to run the kmeans example directly on a small EC2  
>>>>> cluster
>>>>> that I started up myself (using the Hadoop src/contrib/ec2  
>>>>> scripts). I have
>>>>> not yet tried EMR (just got an account yesterday), but I see  
>>>>> that it
>>>>> requires you to have your data in S3 as opposed to HDFS.
>>>>>
>>>>> The job first runs the InputDriver to copy the raw test data  
>>>>> into Mahout
>>>>> Vector external representation after deleting any pre-existing  
>>>>> output files.
>>>>> It looks to me like the two delete() snippets you show are pretty
>>>>> equivalent. If you have no pre-existing output directory, the  
>>>>> Mahout snippet
>>>>> won't attempt to delete it.
>>>>
>>>> I managed to figure that out :-)  I'm pretty comfortable with the  
>>>> ideas
>>>> behind MapReduce, but being confronted with my first Job is a bit  
>>>> more
>>>> daunting than I expected.
>>>>
>>>>> I too am at a loss to explain what you are seeing. If you can  
>>>>> post more
>>>>> results I can try to help you read the tea leaves...
>>>>
>>>> I noticed that the CloudBurst job just deleted the directory  
>>>> without
>>>> checking for existence and so I tried the same thing with Mahout:
>>>>
>>>> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
>>>> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
>>>> .internal:9000
>>>>      at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
>>>> 320)
>>>>      at
>>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>>>>      at
>>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java: 
>> 140)
>>>>      at
>>>>
>> org 
>> .apache 
>> .hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java: 
>> 210)
>>>>      at
>>>> org 
>>>> .apache 
>>>> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
>>>>      at
>>>> org 
>>>> .apache 
>>>> .mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
>>>>
>>>> So no joy there.
>>>>
>>>> Should I see if I can isolate this as an s3n problem?  I suppose  
>>>> I could
>>>> try running the Hadoop job locally with it reading and writing  
>>>> the data from
>>>> S3 and see if it suffers from the same problem.  At least then I  
>>>> could debug
>>>> inside Hadoop.
>>>>
>>>> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
>>>> problem it might have been fixed already.  That doesn't help much  
>>>> running on
>>>> EMR, I guess.
>>>>
>>>> I'm also going to start a run on EMR that does away with the whole
>>>> exists/delete check and see if that works.
>>>
>>> Following up to myself (my wife will tell you that I talk to  
>>> myself!)  I
>>> removed a number of the exists/delete checks:  in  
>>> CanopyClusteringJob,
>>> CanopyDriver, KMeansDriver, and ClusterDriver.  This allowed the  
>>> jobs to
>>> progress, but they died the death a little later with the following
>>> exception (and a few more, I can send the whole log if you like):
>>>
>>> java.lang.IllegalArgumentException: Wrong FS:
>>> s3n://mahoutput/canopies/part-00000, expected:
>>> hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
>>>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
>>> 320)
>>>       at
>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>>>       at
>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java: 
>> 140)
>>>       at
>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: 
>> 408)
>>>       at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java: 
>>> 695)
>>>       at
>>> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420)
>>>       at
>>> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415)
>>>       at
>>>
>> org 
>> .apache 
>> .mahout 
>> .clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
>>>       at
>>> org 
>>> .apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>>>       at
>>> org 
>>> .apache 
>>> .hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>>>       at  
>>> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>>>       at
>>> org 
>>> .apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>>>       at
>>> org 
>>> .apache 
>>> .hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
>>>       at
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
>>> 2198)
>>>
>>> Looking at the exception message there, I would almost swear that  
>>> it things
>>> the whole s3n path is the name of a FS that it doesn't know about,  
>>> but that
>>> might just be a bad message.  This message repeats a few times  
>>> (retrying
>>> failed mappers, I guess?) and then the job fails.
>>>
>>> One thing that occurred to me:  the mahout examples job has the  
>>> hadoop
>>> 0.19.1 core jar in it.  Could I be seeing some kind of version  
>>> skew between
>>> the hadoop in the job file and the one on EMR?  Although it worked  
>>> fine with
>>> a local 0.18.3, so maybe not.
>>>
>>> I'm going to see if I can get the stock Mahout to run with s3n  
>>> inputs and
>>> outputs tomorrow and I'll let you all know how that goes.
>>>
>>> Steve
>>> --
>>> Stephen Green                      //   Stephen.Green@sun.com
>>> Principal Investigator             \\   http://blogs.sun.com/searchguy
>>> Aura Project                       //   Voice: +1 781-442-0926
>>> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>>>
>>>
>>>
>>>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Mahout on Elastic MapReduce

Posted by Sean Owen <sr...@gmail.com>.

OK that is good evidence that it is not the 's3n' that is the problem. Seems
like it was expecting to write to a local filesystem from some earlier bit
of configuration?

Not sure this is helping,
Sean

On Apr 14, 2009 10:08 PM, "Otis Gospodnetic" <ot...@yahoo.com>
wrote:


Hadoop should be able to read directly from S3, I believe:
http://wiki.apache.org/hadoop/AmazonS3

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ---- > From: Sean Owen <sr...@gmail.com> > To:
mahout-user@lucene.apache....

> On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote: > > > > On Apr 14,
2009, at 2:41 PM, Stephen...
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415)

> > at > > >
org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:...

Re: Mahout on Elastic MapReduce

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hadoop should be able to read directly from S3, I believe: http://wiki.apache.org/hadoop/AmazonS3

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Sean Owen <sr...@gmail.com>
> To: mahout-user@lucene.apache.org
> Sent: Tuesday, April 14, 2009 4:19:51 PM
> Subject: Re: Mahout on Elastic MapReduce
> 
> This is a fairly uninformed observation, but: the error seems to be
> from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
> and that makes sense to me. Do we expect Hadoop understands how to
> read from S3? I would expect not. (Though, you point to examples that
> seem to overcome this just fine?)
> 
> When I have integrated code with stuff stored on S3, I have always had
> to write extra glue code to copy from S3 to a local file system, do
> work, then copy back.
> 
> On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote:
> >
> > On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
> >
> >>
> >> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
> >>
> >>> Hi Stephen,
> >>>
> >>> You are out on the bleeding edge with EMR.
> >>
> >> Yeah, but the view is lovely from here!
> >>
> >>> I've been able to run the kmeans example directly on a small EC2 cluster
> >>> that I started up myself (using the Hadoop src/contrib/ec2 scripts). I have
> >>> not yet tried EMR (just got an account yesterday), but I see that it
> >>> requires you to have your data in S3 as opposed to HDFS.
> >>>
> >>> The job first runs the InputDriver to copy the raw test data into Mahout
> >>> Vector external representation after deleting any pre-existing output files.
> >>> It looks to me like the two delete() snippets you show are pretty
> >>> equivalent. If you have no pre-existing output directory, the Mahout snippet
> >>> won't attempt to delete it.
> >>
> >> I managed to figure that out :-)  I'm pretty comfortable with the ideas
> >> behind MapReduce, but being confronted with my first Job is a bit more
> >> daunting than I expected.
> >>
> >>> I too am at a loss to explain what you are seeing. If you can post more
> >>> results I can try to help you read the tea leaves...
> >>
> >> I noticed that the CloudBurst job just deleted the directory without
> >> checking for existence and so I tried the same thing with Mahout:
> >>
> >> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
> >> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
> >> .internal:9000
> >>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> >>       at
> >> 
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> >>       at
> >> 
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> >>       at
> >> 
> org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)
> >>       at
> >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
> >>       at
> >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
> >>
> >> So no joy there.
> >>
> >> Should I see if I can isolate this as an s3n problem?  I suppose I could
> >> try running the Hadoop job locally with it reading and writing the data from
> >> S3 and see if it suffers from the same problem.  At least then I could debug
> >> inside Hadoop.
> >>
> >> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
> >> problem it might have been fixed already.  That doesn't help much running on
> >> EMR, I guess.
> >>
> >> I'm also going to start a run on EMR that does away with the whole
> >> exists/delete check and see if that works.
> >
> > Following up to myself (my wife will tell you that I talk to myself!)  I
> > removed a number of the exists/delete checks:  in CanopyClusteringJob,
> > CanopyDriver, KMeansDriver, and ClusterDriver.  This allowed the jobs to
> > progress, but they died the death a little later with the following
> > exception (and a few more, I can send the whole log if you like):
> >
> > java.lang.IllegalArgumentException: Wrong FS:
> > s3n://mahoutput/canopies/part-00000, expected:
> > hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
> >        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> >        at
> > 
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> >        at
> > 
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> >        at
> > 
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
> >        at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415)
> >        at
> > 
> org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> >        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
> >        at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> >
> > Looking at the exception message there, I would almost swear that it things
> > the whole s3n path is the name of a FS that it doesn't know about, but that
> > might just be a bad message.  This message repeats a few times (retrying
> > failed mappers, I guess?) and then the job fails.
> >
> > One thing that occurred to me:  the mahout examples job has the hadoop
> > 0.19.1 core jar in it.  Could I be seeing some kind of version skew between
> > the hadoop in the job file and the one on EMR?  Although it worked fine with
> > a local 0.18.3, so maybe not.
> >
> > I'm going to see if I can get the stock Mahout to run with s3n inputs and
> > outputs tomorrow and I'll let you all know how that goes.
> >
> > Steve
> > --
> > Stephen Green                      //   Stephen.Green@sun.com
> > Principal Investigator             \\   http://blogs.sun.com/searchguy
> > Aura Project                       //   Voice: +1 781-442-0926
> > Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
> >
> >
> >
> >

Re: Mahout on Elastic MapReduce

Posted by Sean Owen <sr...@gmail.com>.

This is a fairly uninformed observation, but: the error seems to be
from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
and that makes sense to me. Do we expect Hadoop understands how to
read from S3? I would expect not. (Though, you point to examples that
seem to overcome this just fine?)

When I have integrated code with stuff stored on S3, I have always had
to write extra glue code to copy from S3 to a local file system, do
work, then copy back.

On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green <St...@sun.com> wrote:
>
> On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
>
>>
>> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
>>
>>> Hi Stephen,
>>>
>>> You are out on the bleeding edge with EMR.
>>
>> Yeah, but the view is lovely from here!
>>
>>> I've been able to run the kmeans example directly on a small EC2 cluster
>>> that I started up myself (using the Hadoop src/contrib/ec2 scripts). I have
>>> not yet tried EMR (just got an account yesterday), but I see that it
>>> requires you to have your data in S3 as opposed to HDFS.
>>>
>>> The job first runs the InputDriver to copy the raw test data into Mahout
>>> Vector external representation after deleting any pre-existing output files.
>>> It looks to me like the two delete() snippets you show are pretty
>>> equivalent. If you have no pre-existing output directory, the Mahout snippet
>>> won't attempt to delete it.
>>
>> I managed to figure that out :-)  I'm pretty comfortable with the ideas
>> behind MapReduce, but being confronted with my first Job is a bit more
>> daunting than I expected.
>>
>>> I too am at a loss to explain what you are seeing. If you can post more
>>> results I can try to help you read the tea leaves...
>>
>> I noticed that the CloudBurst job just deleted the directory without
>> checking for existence and so I tried the same thing with Mahout:
>>
>> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
>> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
>> .internal:9000
>>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
>>       at
>> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>>       at
>> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
>>       at
>> org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)
>>       at
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
>>       at
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
>>
>> So no joy there.
>>
>> Should I see if I can isolate this as an s3n problem?  I suppose I could
>> try running the Hadoop job locally with it reading and writing the data from
>> S3 and see if it suffers from the same problem.  At least then I could debug
>> inside Hadoop.
>>
>> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
>> problem it might have been fixed already.  That doesn't help much running on
>> EMR, I guess.
>>
>> I'm also going to start a run on EMR that does away with the whole
>> exists/delete check and see if that works.
>
> Following up to myself (my wife will tell you that I talk to myself!)  I
> removed a number of the exists/delete checks:  in CanopyClusteringJob,
> CanopyDriver, KMeansDriver, and ClusterDriver.  This allowed the jobs to
> progress, but they died the death a little later with the following
> exception (and a few more, I can send the whole log if you like):
>
> java.lang.IllegalArgumentException: Wrong FS:
> s3n://mahoutput/canopies/part-00000, expected:
> hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
>        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
>        at
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>        at
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
>        at
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
>        at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)
>        at
> org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
>        at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>        at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>        at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>        at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
>        at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
> Looking at the exception message there, I would almost swear that it things
> the whole s3n path is the name of a FS that it doesn't know about, but that
> might just be a bad message.  This message repeats a few times (retrying
> failed mappers, I guess?) and then the job fails.
>
> One thing that occurred to me:  the mahout examples job has the hadoop
> 0.19.1 core jar in it.  Could I be seeing some kind of version skew between
> the hadoop in the job file and the one on EMR?  Although it worked fine with
> a local 0.18.3, so maybe not.
>
> I'm going to see if I can get the stock Mahout to run with s3n inputs and
> outputs tomorrow and I'll let you all know how that goes.
>
> Steve
> --
> Stephen Green                      //   Stephen.Green@sun.com
> Principal Investigator             \\   http://blogs.sun.com/searchguy
> Aura Project                       //   Voice: +1 781-442-0926
> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>
>
>
>

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:

>
> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
>
>> Hi Stephen,
>>
>> You are out on the bleeding edge with EMR.
>
> Yeah, but the view is lovely from here!
>
>> I've been able to run the kmeans example directly on a small EC2  
>> cluster that I started up myself (using the Hadoop src/contrib/ec2  
>> scripts). I have not yet tried EMR (just got an account yesterday),  
>> but I see that it requires you to have your data in S3 as opposed  
>> to HDFS.
>>
>> The job first runs the InputDriver to copy the raw test data into  
>> Mahout Vector external representation after deleting any pre- 
>> existing output files. It looks to me like the two delete()  
>> snippets you show are pretty equivalent. If you have no pre- 
>> existing output directory, the Mahout snippet won't attempt to  
>> delete it.
>
> I managed to figure that out :-)  I'm pretty comfortable with the  
> ideas behind MapReduce, but being confronted with my first Job is a  
> bit more daunting than I expected.
>
>> I too am at a loss to explain what you are seeing. If you can post  
>> more results I can try to help you read the tea leaves...
>
> I noticed that the CloudBurst job just deleted the directory without  
> checking for existence and so I tried the same thing with Mahout:
>
> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,  
> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
> .internal:9000
>        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
> 320)
>        at  
> org 
> .apache 
> .hadoop 
> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>        at  
> org 
> .apache 
> .hadoop 
> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
>        at  
> org 
> .apache 
> .hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java: 
> 210)
>        at  
> org 
> .apache 
> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
>        at  
> org 
> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
> 46)
>
> So no joy there.
>
> Should I see if I can isolate this as an s3n problem?  I suppose I  
> could try running the Hadoop job locally with it reading and writing  
> the data from S3 and see if it suffers from the same problem.  At  
> least then I could debug inside Hadoop.
>
> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n  
> problem it might have been fixed already.  That doesn't help much  
> running on EMR, I guess.
>
> I'm also going to start a run on EMR that does away with the whole  
> exists/delete check and see if that works.

Following up to myself (my wife will tell you that I talk to myself!)   
I removed a number of the exists/delete checks:  in  
CanopyClusteringJob, CanopyDriver, KMeansDriver, and ClusterDriver.   
This allowed the jobs to progress, but they died the death a little  
later with the following exception (and a few more, I can send the  
whole log if you like):

java.lang.IllegalArgumentException: Wrong FS: s3n://mahoutput/canopies/ 
part-00000, expected: hdfs://domU-12-31-39-00-A5-44.compute-1.internal: 
9000
         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
320)
         at  
org 
.apache 
.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java: 
84)
         at  
org 
.apache 
.hadoop 
.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
         at  
org 
.apache 
.hadoop 
.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
         at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java: 
695)
         at org.apache.hadoop.io.SequenceFile 
$Reader.<init>(SequenceFile.java:1420)
         at org.apache.hadoop.io.SequenceFile 
$Reader.<init>(SequenceFile.java:1415)
         at  
org 
.apache 
.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
         at  
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
         at  
org 
.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
         at  
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
         at  
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
         at  
org 
.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
         at org.apache.hadoop.mapred.TaskTracker 
$Child.main(TaskTracker.java:2198)

Looking at the exception message there, I would almost swear that it  
things the whole s3n path is the name of a FS that it doesn't know  
about, but that might just be a bad message.  This message repeats a  
few times (retrying failed mappers, I guess?) and then the job fails.

One thing that occurred to me:  the mahout examples job has the hadoop  
0.19.1 core jar in it.  Could I be seeing some kind of version skew  
between the hadoop in the job file and the one on EMR?  Although it  
worked fine with a local 0.18.3, so maybe not.

I'm going to see if I can get the stock Mahout to run with s3n inputs  
and outputs tomorrow and I'll let you all know how that goes.

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Stephen Green <St...@sun.com>.

On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:

> Hi Stephen,
>
> You are out on the bleeding edge with EMR.

Yeah, but the view is lovely from here!

> I've been able to run the kmeans example directly on a small EC2  
> cluster that I started up myself (using the Hadoop src/contrib/ec2  
> scripts). I have not yet tried EMR (just got an account yesterday),  
> but I see that it requires you to have your data in S3 as opposed to  
> HDFS.
>
> The job first runs the InputDriver to copy the raw test data into  
> Mahout Vector external representation after deleting any pre- 
> existing output files. It looks to me like the two delete() snippets  
> you show are pretty equivalent. If you have no pre-existing output  
> directory, the Mahout snippet won't attempt to delete it.

I managed to figure that out :-)  I'm pretty comfortable with the  
ideas behind MapReduce, but being confronted with my first Job is a  
bit more daunting than I expected.

> I too am at a loss to explain what you are seeing. If you can post  
> more results I can try to help you read the tea leaves...

I noticed that the CloudBurst job just deleted the directory without  
checking for existence and so I tried the same thing with Mahout:

java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,  
expected: hdfs://domU-12-31-38-00-6C-86.compute-1
.internal:9000
         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
320)
         at  
org 
.apache 
.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java: 
84)
         at  
org 
.apache 
.hadoop 
.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
         at  
org 
.apache 
.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)
         at  
org 
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java: 
83)
         at  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
46)

So no joy there.

Should I see if I can isolate this as an s3n problem?  I suppose I  
could try running the Hadoop job locally with it reading and writing  
the data from S3 and see if it suffers from the same problem.  At  
least then I could debug inside Hadoop.

Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n  
problem it might have been fixed already.  That doesn't help much  
running on EMR, I guess.

I'm also going to start a run on EMR that does away with the whole  
exists/delete check and see if that works.

Thanks for the help, and I'll let you know how I get on.

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Stephen,

You are out on the bleeding edge with EMR. I've been able to run the 
kmeans example directly on a small EC2 cluster that I started up myself 
(using the Hadoop src/contrib/ec2 scripts). I have not yet tried EMR 
(just got an account yesterday), but I see that it requires you to have 
your data in S3 as opposed to HDFS.

The job first runs the InputDriver to copy the raw test data into Mahout 
Vector external representation after deleting any pre-existing output 
files. It looks to me like the two delete() snippets you show are pretty 
equivalent. If you have no pre-existing output directory, the Mahout 
snippet won't attempt to delete it.

I too am at a loss to explain what you are seeing. If you can post more 
results I can try to help you read the tea leaves...
Jeff

Stephen Green wrote:
> I told some folks here at work that I would give a talk on Mahout for 
> our reading group and decided that I would use it as an opportunity to 
> try Amazon's Elastic MapReduce (EMR).
>
> I downloaded and untarred Hadoop 0.18.3, which is the version that 
> Amazon claims they have running so that I could try things out here.   
> I can start up Hadoop and sucessfully run a KMeans cluster on the 
> synthetic control data using the instructions on the wiki and the 
> following command line:
>
> bin/hadoop jar 
> ~/Projects/EC2/mahout-0.1/examples/target/mahout-examples-0.1.job 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job 
> input/testdata output org.apache.mahout.utils.EuclideanDistanceMeasure 
> 80 55 0.5 10
>
> I realize there's a shorter invocation, but I'm trying to figure out 
> what Amazon needs to run this, so I'm pulled the default arguments 
> from the KMeans job.
>
> Now, on Amazon, you can specify a jar file that gets run with 
> "bin/hadoop jar" and you also specify the arguments that will be used 
> with that jar file.
>
> The trick is that the input and output data need to be in S3 buckets 
> and you need to specify the locations with S3 native URIs.  I used the 
> command line interface to EMR to create a job like so:
>
> elastic-mapreduce -v --create --name KMeans --num-instances 1 \
>     --jar s3n://mahout-code/mahout-examples-0.1.job \
>     --main-class 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \
>     --arg s3n://mahout-input/testdata \
>     --arg s3n://mahout-output \
>     --arg org.apache.mahout.utils.EuclideanDistanceMeasure \
>     --arg 80 --arg 55 --arg 0.5 --arg 10
>
> But this fails with the message:  Steps completed with errors.  Turns 
> out you can have the EMR infrastructure dump the logs for the tasks 
> and looking at the stderr for step 1 I see:
>
> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output, 
> expected: hdfs://domU-12-31-39-00-ED-51.compute-1
> .internal:9000
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84) 
>
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140) 
>
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408) 
>
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
>         at 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:77) 
>
>         at 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:44) 
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
>
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
>
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> If I replace the s3n URI for the output with just mahout-output the 
> code appears to run without incident  (at least the log output looks 
> like the log output from my local run.)  Unfortunately, the HDFS 
> instance into which it's put disappears in a puff of smoke when the 
> job finishes running.
>
> Now, I am by no means a Hadoop expert, but it seems like if it can 
> load the data from an s3n input URI, then it probably has the right 
> classes in there to do that (in fact, it looks like the jets3t jar is 
> in the .job file three times!), so it seems like the KMeans job from 
> mahout should be happy to use an s3n output URI, but I'm clearly 
> misunderstanding something here.
>
> One of the EMR samples is a Java DNA sequence matching thing 
> (CloudBurst), which seems to work fine with an s3n URI for the 
> output.  The setup for it's output looks like the following:
>
>         Path oPath = new Path(outpath);
>         FileOutputFormat.setOutputPath(conf, oPath);
>         System.err.println("  Removing old results");
>         FileSystem.get(conf).delete(oPath);
>
> where "conf" is of type org.apache.hadoop.mapred.JobConf.  This is a 
> bit different than what happens in the KMeans job:
>
>     Path outPath = new Path(output);
>     client.setConf(conf);
>     FileSystem dfs = FileSystem.get(conf);
>     if (dfs.exists(outPath))
>       dfs.delete(outPath, true);
>
> Trying to use the CloudBurst idiom in the KMeans job produced no joy.  
> Any help would be greatly appreciated.
>
> Steve Green