You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Spitz <ms...@meebo-inc.com> on 2010/10/26 22:53:59 UTC

Getting mahout to run on the DFS

I'm running mahout-0.3 (stable downloaded from the site), and I'm trying to
run lda as a hadoop job.  Specifically, from within mahout-0.3 (totally
clean extraction of the tarball), I'm running:
./bin/mahout lda -i myurls-seqdir-sparse/vectors -o myurls-lda -k 20 -v
50000 -w

myurls-seqdir-sparse exists on the dfs in my home directory (I can `hadoop
dfs -ls` it), and it's in the right format (I borrowed some lines from the
build-reuters.sh script)

I run the command with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR set.
 The mahout script confirms this by saying:
"running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 and
HADOOP_CONF_DIR=/etc/hadoop-0.20"

But then I get the following:
10/10/26 13:17:33 ERROR driver.MahoutDriver: MahoutDriver failed with args:
[-i, myurls-seqdir-sparse/vectors, -o, myurls-lda, -k, 20, -v, 50000, -w,
null]
Input path does not exist:
file:/home/mspitz/mahoutplayground/mahout-0.3/myurls-seqdir-sparse/vectors

It's true, 'myurls-seqdir-sparse' doesn't exist locally, but shouldn't it be
looking on the DFS if I'm running it as a hadoop jar job?  If it helps, the
explicit command it's executing (from the script) is:
/usr/lib/hadoop-0.20/bin/hadoop jar
 /home/mspitz/mahoutplayground/mahout-0.3/mahout-examples-0.3.job
org.apache.mahout.driver.MahoutDriver lda -i myurls-seqdir-sparse/vectors -o
myurls-lda -k 20 -v 50000 -w

The HDFS works mighty fine and is configured properly (I run Pig jobs on it
all the time).   I just can't get the mahout job to run over hadoop.

Any thoughts?

Thanks, folks!

-Matt

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
Gah, that's ridiculous.  I didn't specify MAHOUT_HOME, which makes our
HADOOP_CLASSPATH identical.

So, running locally (no HADOOP_HOME/HADOOP_CONF_DIR set), kmeans runs fine
with -xm mapreduce and -xm sequential.

Running on hadoop (using HADOOP_HOME/HADOOP_CONF_DIR), kmeans runs fine with
-xm sequential, but runs into the exception mentioned above with -xm
mapreduce.

There's gotta be something different about the way in which we browse our
filesystems on the DFS.  Or perhaps the permissions with which these things
are created?

Looks like the clusters/part-randomSeed is -rw-r--r--, which is the same as
all of the chunk-* files in reuters-out-seqdir.

I'm stumped.

-Matt

On Mon, Nov 1, 2010 at 12:26 PM, Jeff Eastman <je...@narus.com> wrote:

> Frustrating. We're both running CHD3, right? That's Hadoop 0.20.2. I added
> the echos you suggested and my Classpath: output is empty. My Command:
> output is essentially the same as what you reported.
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Monday, November 01, 2010 8:59 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> Blast!  I ran it as another user, and no dice.  Same error.
>
> I guess my question for you was to figure out what your classpath was and
> see if there was anything different.  bin/mahout is just a simple script,
> and I was just adding a quick 'echo' to it.
>
> What version of hadoop are you running?  I wonder if the "Path" class is
> defined differently for different versions.
>
> Thanks,
> Matt
>
> On Sat, Oct 30, 2010 at 4:56 PM, Jeff Eastman <jdog@windwardsolutions.com
> >wrote:
>
> > On 10/29/10 10:09 AM, Jeff Eastman wrote:
> >
> >> Ok, very interesting. I think you are onto the root cause. I can't work
> on
> >> this until the weekend but will investigate further then.
> >>
> >>  I tried creating another user on my CHD3 box and, for a minute, thought
> I
> > could duplicate something like your problem. But it was a permission
> problem
> > in examples/bin/work that resulted in 0 vectors being output from
> > seq2sparse. That caused an array indexing error in RandomSeed generator
> but
> > it went away when I made /work be 777. Even in that situation, I got the
> > same error (of course) running kmeans -xm sequential.
> >
> > You can modify bin/mahout to your heart's content. I hope you are having
> > better luck than I am. Build-reuters works perfectly under both userIds.
> >
>

RE: Getting mahout to run on the DFS

Posted by Jeff Eastman <je...@Narus.com>.
Frustrating. We're both running CHD3, right? That's Hadoop 0.20.2. I added the echos you suggested and my Classpath: output is empty. My Command: output is essentially the same as what you reported.

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com] 
Sent: Monday, November 01, 2010 8:59 AM
To: user@mahout.apache.org
Subject: Re: Getting mahout to run on the DFS

Blast!  I ran it as another user, and no dice.  Same error.

I guess my question for you was to figure out what your classpath was and
see if there was anything different.  bin/mahout is just a simple script,
and I was just adding a quick 'echo' to it.

What version of hadoop are you running?  I wonder if the "Path" class is
defined differently for different versions.

Thanks,
Matt

On Sat, Oct 30, 2010 at 4:56 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> On 10/29/10 10:09 AM, Jeff Eastman wrote:
>
>> Ok, very interesting. I think you are onto the root cause. I can't work on
>> this until the weekend but will investigate further then.
>>
>>  I tried creating another user on my CHD3 box and, for a minute, thought I
> could duplicate something like your problem. But it was a permission problem
> in examples/bin/work that resulted in 0 vectors being output from
> seq2sparse. That caused an array indexing error in RandomSeed generator but
> it went away when I made /work be 777. Even in that situation, I got the
> same error (of course) running kmeans -xm sequential.
>
> You can modify bin/mahout to your heart's content. I hope you are having
> better luck than I am. Build-reuters works perfectly under both userIds.
>

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
Blast!  I ran it as another user, and no dice.  Same error.

I guess my question for you was to figure out what your classpath was and
see if there was anything different.  bin/mahout is just a simple script,
and I was just adding a quick 'echo' to it.

What version of hadoop are you running?  I wonder if the "Path" class is
defined differently for different versions.

Thanks,
Matt

On Sat, Oct 30, 2010 at 4:56 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> On 10/29/10 10:09 AM, Jeff Eastman wrote:
>
>> Ok, very interesting. I think you are onto the root cause. I can't work on
>> this until the weekend but will investigate further then.
>>
>>  I tried creating another user on my CHD3 box and, for a minute, thought I
> could duplicate something like your problem. But it was a permission problem
> in examples/bin/work that resulted in 0 vectors being output from
> seq2sparse. That caused an array indexing error in RandomSeed generator but
> it went away when I made /work be 777. Even in that situation, I got the
> same error (of course) running kmeans -xm sequential.
>
> You can modify bin/mahout to your heart's content. I hope you are having
> better luck than I am. Build-reuters works perfectly under both userIds.
>

Re: Getting mahout to run on the DFS

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
On 10/29/10 10:09 AM, Jeff Eastman wrote:
> Ok, very interesting. I think you are onto the root cause. I can't work on this until the weekend but will investigate further then.
>
I tried creating another user on my CHD3 box and, for a minute, thought 
I could duplicate something like your problem. But it was a permission 
problem in examples/bin/work that resulted in 0 vectors being output 
from seq2sparse. That caused an array indexing error in RandomSeed 
generator but it went away when I made /work be 777. Even in that 
situation, I got the same error (of course) running kmeans -xm sequential.

You can modify bin/mahout to your heart's content. I hope you are having 
better luck than I am. Build-reuters works perfectly under both userIds.

RE: Getting mahout to run on the DFS

Posted by Jeff Eastman <je...@Narus.com>.
Ok, very interesting. I think you are onto the root cause. I can't work on this until the weekend but will investigate further then.

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com]
Sent: Friday, October 29, 2010 9:42 AM
To: user@mahout.apache.org
Subject: Re: Getting mahout to run on the DFS

OK!  Further investigation!

Running with --method sequential works just fine, so my guess is that it's
something in the difference in which we set the clustersIn parameter between
clusterDataMR() and clusterDataSeq()

According to this line:
10/10/29 09:26:20 INFO kmeans.KMeansDriver: Input:
examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors Clusters In:
examples/bin/work/clusters/part-randomSeed Out:
examples/bin/work/reuters-kmeans Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

CLUSTERS_IN_OPTION is 'examples/bin/work/clusters/part-randomSeed'
Both start with:
Path clustersIn = new
Path(getOption(DefaultOptionsCreator.CLUSTERS_IN_OPTION);

And they proceed as follows:
*clusterDataSeq()*
KMeansUtil.configureWithClusterInfo(clustersIn, clusters);
... check to see if clusters is empty ...
... proceed with clustering ...

*clusterDataMR()*
conf.set(KMeansConfigKeys.CLUSTER_PATH_KEY, clustersIn.toString());
-- KMeansMapper.setup()
String clusterPath = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);
KMeansUtil.configureWithClusterInfo(new Path(clusterPath), clusters)

I wonder if there might be something fishy with converting path to and from
a string?

Can you modify bin/mahout to print the classpath and the command it runs?

*echo "Classpath: $HADOOP_CLASSPATH"*
*echo "Command:   $HADOOP_HOME/bin/hadoop jar $MAHOUT_JOB $CLASS $@"*
*exec "$HADOOP_HOME/bin/hadoop" jar $MAHOUT_JOB $CLASS "$@"*

Yields:
*Classpath: /home/mspitz/mahoutplayground/mahout-distribution-0.4/conf:*
* Command:   /usr/lib/hadoop-0.20/bin/hadoop jar
/home/mspitz/mahoutplayground/mahout-distribution-0.4/mahout-examples-0.4-job.jar
org.apache.mahout.driver.MahoutDriver clusterdump -s
examples/bin/work/reuters-kmeans/clusters-10 -d
examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 20*

Thanks,
Matt

On Thu, Oct 28, 2010 at 5:04 PM, Jeff Eastman <je...@narus.com> wrote:

> Have you checked the Hadoop logs? The only way I know to get more output
> would be to add some printouts to the code.
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, October 28, 2010 11:00 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> Is there a way to get more/nicer output so we can track this down?
>
> On Thu, Oct 28, 2010 at 1:57 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > I'm puzzled too. I just unzipped the same distribution and it ran without
> > issues on my CHD3 unicluster. I'm running as my own userId, not
> > hadoop-anything, on RHE Linux 5.3 64-bit VM on VMware.
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Thursday, October 28, 2010 10:12 AM
> > To: user@mahout.apache.org
> > Subject: Re: Getting mahout to run on the DFS
> >
> > OK, using
> >
> >
> https://repository.apache.org/content/repositories/orgapachemahout-004/org/apache/mahout/mahout-distribution/0.4/mahout-distribution-0.4.zip
> >
> > Same error on a clean unzip.  Running 64-bit Linux CentOS 5.3.
> >
> > Running the lda command over hadoop yields results as expected.
> >  Puzzling...
> >
> > -Matt
> >
> > On Thu, Oct 28, 2010 at 12:52 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> > > Don't recognize that zip. Can you try with the latest 0.4 RC at
> > >
> https://repository.apache.org/content/repositories/orgapachemahout-004/?
> > I
> > > just ran that successfully on my CHD3 unicluster. What OS are you
> > running?
> > > I'm on 64-bit Linux EL-5.
> > >
> > > -----Original Message-----
> > > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > > Sent: Thursday, October 28, 2010 9:45 AM
> > > To: user@mahout.apache.org
> > > Subject: Re: Getting mahout to run on the DFS
> > >
> > > I'm running mahout-distribution-0.4-20101027.194721-1.zip on CDH3.
> >  Running
> > > examples/bin/build-reuters.sh from a clean unzip results in the same
> > error.
> > >  I definitely have read/write access to the DFS, as  reuters-seqdir and
> > > reuters-seqdir-sparse have been created correctly.
> > >
> > > Running locally with a clean unzip is fine.  It's just the
> > > running-on-the-DFS part that breaks when we try to cluster.
> > >
> > > Thanks,
> > > Matt
> > >
> > > On Thu, Oct 28, 2010 at 12:23 PM, Jeff Eastman <je...@narus.com>
> > wrote:
> > >
> > > > Maybe I missed it, but are you running on trunk? Can you run
> > > > examples/bin/build-reuters.sh out of the box? I'm running that
> > > successfully
> > > > on a CHD3 cluster logged-in as myself.
> > > >
> > > > Jeff
> > > >
> > > > -----Original Message-----
> > > > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > > > Sent: Thursday, October 28, 2010 9:01 AM
> > > > To: user@mahout.apache.org
> > > > Subject: Re: Getting mahout to run on the DFS
> > > >
> > > > Hm.  So, I'm running the cloudera hadoop distribution, and I'm
> running
> > as
> > > a
> > > > hadoop user.
> > > >
> > > > Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
> > > > specified):
> > > > ./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c
> UTF-8
> > > > -chunk 5
> > > >
> > > > ./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
> > > > ./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o
> > > reuters-kmeans
> > > > -k 20 --clustering --maxIter 10 --clusters reuters-clusters
> > > >
> > > > ../reuters-out is a local directory (see
> > > > https://issues.apache.org/jira/browse/MAHOUT-535)
> > > >
> > > > After running the third command, I see a non-empty reuters-clusters
> > > > directory on the DFS, so presumably the initial clusters are getting
> > > > created.
> > > >
> > > > These commands run fine in local mode, but no dice running on the
> DFS.
> >  I
> > > > even copied the reuters-clusters directory from the DFS to my local
> > > > machine,
> > > > hoping that mahout was looking there, but I still got the same
> > error(s):
> > > > 10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
> > > > attempt_201008241139_108731_m_000000_2, Status : FAILED
> > > > java.lang.IllegalStateException: No clusters found. Check your -c
> path.
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > > >        at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > > >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > >
> > > > 10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
> > > > job_201008241139_108731
> > > > 10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
> > > > Exception in thread "main" java.lang.InterruptedException: K-Means
> > > > Iteration
> > > > failed processing reuters-clusters/part-randomSeed
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > > >        at
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > > >        at
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > > >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > >        at
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > >        at
> > > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > >        at
> > > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > > >
> > > > As a sanity check:
> > > > [mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
> > > > reuters-clusters
> > > > Found 1 items
> > > > 8246        hdfs://
> > > > 192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed
> > > >
> > > > By the way, I really really appreciate the help.  Thank you so much.
> > > >
> > > > -Matt
> > > >
> > > > On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
> > > > <jd...@windwardsolutions.com>wrote:
> > > >
> > > > > With k-means, the initial clusters directory can either 1) contain
> > some
> > > > > initial clusters you produced somehow (a common method is via
> Canopy)
> > > or
> > > > 2)
> > > > > be empty. In the empty case; however, you also need to specify the
> > > number
> > > > of
> > > > > initial clusters (-k) so that your input data can be sampled and
> the
> > > > initial
> > > > > clusters put into the empty directory. Note that if you do 1) and
> > also
> > > > > specify -k that your initial clusters will be overwritten by k
> > sampled
> > > > > values from your input data.
> > > > >
> > > > >
> > > > > On 10/28/10 3:35 AM, pragnesh radadia wrote:
> > > > >
> > > > >> are you using cloudera hadoop distribution ?
> > > > >> if yes then run kmean using hadoop or hdfs user to solve your
> > problem
> > > > >>
> > > > >>
> > > > >> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>
> > > >  wrote:
> > > > >>
> > > > >>> Bug report created!  Thanks!
> > > > >>>
> > > > >>> One more random question: when running kmeans, there's a required
> > -c
> > > > >>> (initial clusters) argument.  All the examples I've seen using
> > kmeans
> > > > >>> (namely https://issues.apache.org/jira/browse/MAHOUT-390)
> specify
> > a
> > > > >>> non-existent directory (presumably the algorithm would select
> some
> > > > >>> initial
> > > > >>> random clusters).
> > > > >>>
> > > > >>> But, when specifying some initial, nonexistent clusters
> directory,
> > I
> > > > get
> > > > >>> a
> > > > >>> bunch of:
> > > > >>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> > > > >>> attempt_201008241139_107461_m_000002_2, Status : FAILED
> > > > >>> java.lang.IllegalStateException: No clusters found. Check your -c
> > > path.
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > > > >>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > > > >>>        at
> > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > > > >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > > >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > > >>>
> > > > >>> And the job eventually fails with:
> > > > >>> Exception in thread "main" java.lang.InterruptedException:
> K-Means
> > > > >>> Iteration
> > > > >>> failed processing reuters-clusters/part-randomSeed
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > > > >>>        at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > > >>>        at
> > > > >>>
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > > >>>        at
> > > > >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > > >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > > > >>>
> > > > >>> Any thoughts on this one?
> > > > >>>
> > > > >>> Cheers,
> > > > >>> Matt
> > > > >>>
> > > > >>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<
> ted.dunning@gmail.com
> > >
> > > > >>>  wrote:
> > > > >>>
> > > > >>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<
> mspitz@meebo-inc.com>
> > > > >>>>  wrote:
> > > > >>>>
> > > > >>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things
> out
> > > > >>>>> rather
> > > > >>>>> nicely.
> > > > >>>>>
> > > > >>>>>  There were lots of improvements there.
> > > > >>>>
> > > > >>>>
> > > > >>>>  One thing that I find really weird is that 'mahout
> seqdirectory'
> > > > always
> > > > >>>>> hits
> > > > >>>>> the local filesystem for input, even when running in Hadoop
> mode.
> > > >  So,
> > > > >>>>> if
> > > > >>>>>
> > > > >>>> I
> > > > >>>>
> > > > >>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
> > > > >>>>>
> > > > >>>> seqdirectory
> > > > >>>>
> > > > >>>>> creates an empty sequence file (with no error).  Is this
> > expected?
> > > > >>>>>
> > > > >>>>>  No.  That sounds like a bug.  Can you file a report here:
> > > > >>>> https://issues.apache.org/jira/browse/MAHOUT ?
> > > > >>>>
> > > > >>>>
> > > > >>>>  Is there a nice way to create sequence files that isn't
> > > seqdirectory?
> > > > >>>>>
> > > > >>>>  I'd
> > > > >>>>
> > > > >>>>> like to do a little processing on the documents as they get
> sent
> > to
> > > > the
> > > > >>>>> sequence file without having to generate a second copy on the
> > DFS.
> > > > >>>>>
> > > > >>>>>  Sure.  Just snarf the code from the program in question and
> > > massage
> > > > it
> > > > >>>> as
> > > > >>>> you like.  The command line versions are handy,
> > > > >>>> but it is very common to need to customize.  At that point, the
> > > > command
> > > > >>>> line
> > > > >>>> programs serve as example code.  You don't
> > > > >>>> have to use them and they have no magic.
> > > > >>>>
> > > > >>>> If you think you have some improvements in generality, we can
> push
> > > > them
> > > > >>>> back
> > > > >>>> into the Mahout versions.
> > > > >>>>
> > > > >>>>
> > > > >
> > > >
> > >
> >
>

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
OK!  Further investigation!

Running with --method sequential works just fine, so my guess is that it's
something in the difference in which we set the clustersIn parameter between
clusterDataMR() and clusterDataSeq()

According to this line:
10/10/29 09:26:20 INFO kmeans.KMeansDriver: Input:
examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors Clusters In:
examples/bin/work/clusters/part-randomSeed Out:
examples/bin/work/reuters-kmeans Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

CLUSTERS_IN_OPTION is 'examples/bin/work/clusters/part-randomSeed'
Both start with:
Path clustersIn = new
Path(getOption(DefaultOptionsCreator.CLUSTERS_IN_OPTION);

And they proceed as follows:
*clusterDataSeq()*
KMeansUtil.configureWithClusterInfo(clustersIn, clusters);
... check to see if clusters is empty ...
... proceed with clustering ...

*clusterDataMR()*
conf.set(KMeansConfigKeys.CLUSTER_PATH_KEY, clustersIn.toString());
-- KMeansMapper.setup()
String clusterPath = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);
KMeansUtil.configureWithClusterInfo(new Path(clusterPath), clusters)

I wonder if there might be something fishy with converting path to and from
a string?

Can you modify bin/mahout to print the classpath and the command it runs?

*echo "Classpath: $HADOOP_CLASSPATH"*
*echo "Command:   $HADOOP_HOME/bin/hadoop jar $MAHOUT_JOB $CLASS $@"*
*exec "$HADOOP_HOME/bin/hadoop" jar $MAHOUT_JOB $CLASS "$@"*

Yields:
*Classpath: /home/mspitz/mahoutplayground/mahout-distribution-0.4/conf:*
* Command:   /usr/lib/hadoop-0.20/bin/hadoop jar
/home/mspitz/mahoutplayground/mahout-distribution-0.4/mahout-examples-0.4-job.jar
org.apache.mahout.driver.MahoutDriver clusterdump -s
examples/bin/work/reuters-kmeans/clusters-10 -d
examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 20*

Thanks,
Matt

On Thu, Oct 28, 2010 at 5:04 PM, Jeff Eastman <je...@narus.com> wrote:

> Have you checked the Hadoop logs? The only way I know to get more output
> would be to add some printouts to the code.
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, October 28, 2010 11:00 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> Is there a way to get more/nicer output so we can track this down?
>
> On Thu, Oct 28, 2010 at 1:57 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > I'm puzzled too. I just unzipped the same distribution and it ran without
> > issues on my CHD3 unicluster. I'm running as my own userId, not
> > hadoop-anything, on RHE Linux 5.3 64-bit VM on VMware.
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Thursday, October 28, 2010 10:12 AM
> > To: user@mahout.apache.org
> > Subject: Re: Getting mahout to run on the DFS
> >
> > OK, using
> >
> >
> https://repository.apache.org/content/repositories/orgapachemahout-004/org/apache/mahout/mahout-distribution/0.4/mahout-distribution-0.4.zip
> >
> > Same error on a clean unzip.  Running 64-bit Linux CentOS 5.3.
> >
> > Running the lda command over hadoop yields results as expected.
> >  Puzzling...
> >
> > -Matt
> >
> > On Thu, Oct 28, 2010 at 12:52 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> > > Don't recognize that zip. Can you try with the latest 0.4 RC at
> > >
> https://repository.apache.org/content/repositories/orgapachemahout-004/?
> > I
> > > just ran that successfully on my CHD3 unicluster. What OS are you
> > running?
> > > I'm on 64-bit Linux EL-5.
> > >
> > > -----Original Message-----
> > > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > > Sent: Thursday, October 28, 2010 9:45 AM
> > > To: user@mahout.apache.org
> > > Subject: Re: Getting mahout to run on the DFS
> > >
> > > I'm running mahout-distribution-0.4-20101027.194721-1.zip on CDH3.
> >  Running
> > > examples/bin/build-reuters.sh from a clean unzip results in the same
> > error.
> > >  I definitely have read/write access to the DFS, as  reuters-seqdir and
> > > reuters-seqdir-sparse have been created correctly.
> > >
> > > Running locally with a clean unzip is fine.  It's just the
> > > running-on-the-DFS part that breaks when we try to cluster.
> > >
> > > Thanks,
> > > Matt
> > >
> > > On Thu, Oct 28, 2010 at 12:23 PM, Jeff Eastman <je...@narus.com>
> > wrote:
> > >
> > > > Maybe I missed it, but are you running on trunk? Can you run
> > > > examples/bin/build-reuters.sh out of the box? I'm running that
> > > successfully
> > > > on a CHD3 cluster logged-in as myself.
> > > >
> > > > Jeff
> > > >
> > > > -----Original Message-----
> > > > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > > > Sent: Thursday, October 28, 2010 9:01 AM
> > > > To: user@mahout.apache.org
> > > > Subject: Re: Getting mahout to run on the DFS
> > > >
> > > > Hm.  So, I'm running the cloudera hadoop distribution, and I'm
> running
> > as
> > > a
> > > > hadoop user.
> > > >
> > > > Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
> > > > specified):
> > > > ./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c
> UTF-8
> > > > -chunk 5
> > > >
> > > > ./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
> > > > ./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o
> > > reuters-kmeans
> > > > -k 20 --clustering --maxIter 10 --clusters reuters-clusters
> > > >
> > > > ../reuters-out is a local directory (see
> > > > https://issues.apache.org/jira/browse/MAHOUT-535)
> > > >
> > > > After running the third command, I see a non-empty reuters-clusters
> > > > directory on the DFS, so presumably the initial clusters are getting
> > > > created.
> > > >
> > > > These commands run fine in local mode, but no dice running on the
> DFS.
> >  I
> > > > even copied the reuters-clusters directory from the DFS to my local
> > > > machine,
> > > > hoping that mahout was looking there, but I still got the same
> > error(s):
> > > > 10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
> > > > attempt_201008241139_108731_m_000000_2, Status : FAILED
> > > > java.lang.IllegalStateException: No clusters found. Check your -c
> path.
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > > >        at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > > >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > >
> > > > 10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
> > > > job_201008241139_108731
> > > > 10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
> > > > Exception in thread "main" java.lang.InterruptedException: K-Means
> > > > Iteration
> > > > failed processing reuters-clusters/part-randomSeed
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > > >        at
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > > >        at
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > > >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > >        at
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > >        at
> > > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > >        at
> > > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > > >
> > > > As a sanity check:
> > > > [mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
> > > > reuters-clusters
> > > > Found 1 items
> > > > 8246        hdfs://
> > > > 192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed
> > > >
> > > > By the way, I really really appreciate the help.  Thank you so much.
> > > >
> > > > -Matt
> > > >
> > > > On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
> > > > <jd...@windwardsolutions.com>wrote:
> > > >
> > > > > With k-means, the initial clusters directory can either 1) contain
> > some
> > > > > initial clusters you produced somehow (a common method is via
> Canopy)
> > > or
> > > > 2)
> > > > > be empty. In the empty case; however, you also need to specify the
> > > number
> > > > of
> > > > > initial clusters (-k) so that your input data can be sampled and
> the
> > > > initial
> > > > > clusters put into the empty directory. Note that if you do 1) and
> > also
> > > > > specify -k that your initial clusters will be overwritten by k
> > sampled
> > > > > values from your input data.
> > > > >
> > > > >
> > > > > On 10/28/10 3:35 AM, pragnesh radadia wrote:
> > > > >
> > > > >> are you using cloudera hadoop distribution ?
> > > > >> if yes then run kmean using hadoop or hdfs user to solve your
> > problem
> > > > >>
> > > > >>
> > > > >> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>
> > > >  wrote:
> > > > >>
> > > > >>> Bug report created!  Thanks!
> > > > >>>
> > > > >>> One more random question: when running kmeans, there's a required
> > -c
> > > > >>> (initial clusters) argument.  All the examples I've seen using
> > kmeans
> > > > >>> (namely https://issues.apache.org/jira/browse/MAHOUT-390)
> specify
> > a
> > > > >>> non-existent directory (presumably the algorithm would select
> some
> > > > >>> initial
> > > > >>> random clusters).
> > > > >>>
> > > > >>> But, when specifying some initial, nonexistent clusters
> directory,
> > I
> > > > get
> > > > >>> a
> > > > >>> bunch of:
> > > > >>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> > > > >>> attempt_201008241139_107461_m_000002_2, Status : FAILED
> > > > >>> java.lang.IllegalStateException: No clusters found. Check your -c
> > > path.
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > > > >>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > > > >>>        at
> > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > > > >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > > >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > > >>>
> > > > >>> And the job eventually fails with:
> > > > >>> Exception in thread "main" java.lang.InterruptedException:
> K-Means
> > > > >>> Iteration
> > > > >>> failed processing reuters-clusters/part-randomSeed
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > > > >>>        at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > > >>>        at
> > > > >>>
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > > >>>        at
> > > > >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > > >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > > > >>>
> > > > >>> Any thoughts on this one?
> > > > >>>
> > > > >>> Cheers,
> > > > >>> Matt
> > > > >>>
> > > > >>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<
> ted.dunning@gmail.com
> > >
> > > > >>>  wrote:
> > > > >>>
> > > > >>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<
> mspitz@meebo-inc.com>
> > > > >>>>  wrote:
> > > > >>>>
> > > > >>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things
> out
> > > > >>>>> rather
> > > > >>>>> nicely.
> > > > >>>>>
> > > > >>>>>  There were lots of improvements there.
> > > > >>>>
> > > > >>>>
> > > > >>>>  One thing that I find really weird is that 'mahout
> seqdirectory'
> > > > always
> > > > >>>>> hits
> > > > >>>>> the local filesystem for input, even when running in Hadoop
> mode.
> > > >  So,
> > > > >>>>> if
> > > > >>>>>
> > > > >>>> I
> > > > >>>>
> > > > >>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
> > > > >>>>>
> > > > >>>> seqdirectory
> > > > >>>>
> > > > >>>>> creates an empty sequence file (with no error).  Is this
> > expected?
> > > > >>>>>
> > > > >>>>>  No.  That sounds like a bug.  Can you file a report here:
> > > > >>>> https://issues.apache.org/jira/browse/MAHOUT ?
> > > > >>>>
> > > > >>>>
> > > > >>>>  Is there a nice way to create sequence files that isn't
> > > seqdirectory?
> > > > >>>>>
> > > > >>>>  I'd
> > > > >>>>
> > > > >>>>> like to do a little processing on the documents as they get
> sent
> > to
> > > > the
> > > > >>>>> sequence file without having to generate a second copy on the
> > DFS.
> > > > >>>>>
> > > > >>>>>  Sure.  Just snarf the code from the program in question and
> > > massage
> > > > it
> > > > >>>> as
> > > > >>>> you like.  The command line versions are handy,
> > > > >>>> but it is very common to need to customize.  At that point, the
> > > > command
> > > > >>>> line
> > > > >>>> programs serve as example code.  You don't
> > > > >>>> have to use them and they have no magic.
> > > > >>>>
> > > > >>>> If you think you have some improvements in generality, we can
> push
> > > > them
> > > > >>>> back
> > > > >>>> into the Mahout versions.
> > > > >>>>
> > > > >>>>
> > > > >
> > > >
> > >
> >
>

RE: Getting mahout to run on the DFS

Posted by Jeff Eastman <je...@Narus.com>.
Have you checked the Hadoop logs? The only way I know to get more output would be to add some printouts to the code.

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com]
Sent: Thursday, October 28, 2010 11:00 AM
To: user@mahout.apache.org
Subject: Re: Getting mahout to run on the DFS

Is there a way to get more/nicer output so we can track this down?

On Thu, Oct 28, 2010 at 1:57 PM, Jeff Eastman <je...@narus.com> wrote:

> I'm puzzled too. I just unzipped the same distribution and it ran without
> issues on my CHD3 unicluster. I'm running as my own userId, not
> hadoop-anything, on RHE Linux 5.3 64-bit VM on VMware.
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, October 28, 2010 10:12 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> OK, using
>
> https://repository.apache.org/content/repositories/orgapachemahout-004/org/apache/mahout/mahout-distribution/0.4/mahout-distribution-0.4.zip
>
> Same error on a clean unzip.  Running 64-bit Linux CentOS 5.3.
>
> Running the lda command over hadoop yields results as expected.
>  Puzzling...
>
> -Matt
>
> On Thu, Oct 28, 2010 at 12:52 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > Don't recognize that zip. Can you try with the latest 0.4 RC at
> > https://repository.apache.org/content/repositories/orgapachemahout-004/?
> I
> > just ran that successfully on my CHD3 unicluster. What OS are you
> running?
> > I'm on 64-bit Linux EL-5.
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Thursday, October 28, 2010 9:45 AM
> > To: user@mahout.apache.org
> > Subject: Re: Getting mahout to run on the DFS
> >
> > I'm running mahout-distribution-0.4-20101027.194721-1.zip on CDH3.
>  Running
> > examples/bin/build-reuters.sh from a clean unzip results in the same
> error.
> >  I definitely have read/write access to the DFS, as  reuters-seqdir and
> > reuters-seqdir-sparse have been created correctly.
> >
> > Running locally with a clean unzip is fine.  It's just the
> > running-on-the-DFS part that breaks when we try to cluster.
> >
> > Thanks,
> > Matt
> >
> > On Thu, Oct 28, 2010 at 12:23 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> > > Maybe I missed it, but are you running on trunk? Can you run
> > > examples/bin/build-reuters.sh out of the box? I'm running that
> > successfully
> > > on a CHD3 cluster logged-in as myself.
> > >
> > > Jeff
> > >
> > > -----Original Message-----
> > > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > > Sent: Thursday, October 28, 2010 9:01 AM
> > > To: user@mahout.apache.org
> > > Subject: Re: Getting mahout to run on the DFS
> > >
> > > Hm.  So, I'm running the cloudera hadoop distribution, and I'm running
> as
> > a
> > > hadoop user.
> > >
> > > Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
> > > specified):
> > > ./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c UTF-8
> > > -chunk 5
> > >
> > > ./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
> > > ./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o
> > reuters-kmeans
> > > -k 20 --clustering --maxIter 10 --clusters reuters-clusters
> > >
> > > ../reuters-out is a local directory (see
> > > https://issues.apache.org/jira/browse/MAHOUT-535)
> > >
> > > After running the third command, I see a non-empty reuters-clusters
> > > directory on the DFS, so presumably the initial clusters are getting
> > > created.
> > >
> > > These commands run fine in local mode, but no dice running on the DFS.
>  I
> > > even copied the reuters-clusters directory from the DFS to my local
> > > machine,
> > > hoping that mahout was looking there, but I still got the same
> error(s):
> > > 10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
> > > attempt_201008241139_108731_m_000000_2, Status : FAILED
> > > java.lang.IllegalStateException: No clusters found. Check your -c path.
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > >        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > >
> > > 10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
> > > job_201008241139_108731
> > > 10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
> > > 10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
> > > 10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
> > > 10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
> > > 10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
> > > 10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
> > > Exception in thread "main" java.lang.InterruptedException: K-Means
> > > Iteration
> > > failed processing reuters-clusters/part-randomSeed
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > >        at
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > >        at
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >        at
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > >        at
> > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > >        at
> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > >
> > > As a sanity check:
> > > [mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
> > > reuters-clusters
> > > Found 1 items
> > > 8246        hdfs://
> > > 192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed
> > >
> > > By the way, I really really appreciate the help.  Thank you so much.
> > >
> > > -Matt
> > >
> > > On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
> > > <jd...@windwardsolutions.com>wrote:
> > >
> > > > With k-means, the initial clusters directory can either 1) contain
> some
> > > > initial clusters you produced somehow (a common method is via Canopy)
> > or
> > > 2)
> > > > be empty. In the empty case; however, you also need to specify the
> > number
> > > of
> > > > initial clusters (-k) so that your input data can be sampled and the
> > > initial
> > > > clusters put into the empty directory. Note that if you do 1) and
> also
> > > > specify -k that your initial clusters will be overwritten by k
> sampled
> > > > values from your input data.
> > > >
> > > >
> > > > On 10/28/10 3:35 AM, pragnesh radadia wrote:
> > > >
> > > >> are you using cloudera hadoop distribution ?
> > > >> if yes then run kmean using hadoop or hdfs user to solve your
> problem
> > > >>
> > > >>
> > > >> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>
> > >  wrote:
> > > >>
> > > >>> Bug report created!  Thanks!
> > > >>>
> > > >>> One more random question: when running kmeans, there's a required
> -c
> > > >>> (initial clusters) argument.  All the examples I've seen using
> kmeans
> > > >>> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify
> a
> > > >>> non-existent directory (presumably the algorithm would select some
> > > >>> initial
> > > >>> random clusters).
> > > >>>
> > > >>> But, when specifying some initial, nonexistent clusters directory,
> I
> > > get
> > > >>> a
> > > >>> bunch of:
> > > >>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> > > >>> attempt_201008241139_107461_m_000002_2, Status : FAILED
> > > >>> java.lang.IllegalStateException: No clusters found. Check your -c
> > path.
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > > >>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > > >>>        at
> > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > > >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > >>>
> > > >>> And the job eventually fails with:
> > > >>> Exception in thread "main" java.lang.InterruptedException: K-Means
> > > >>> Iteration
> > > >>> failed processing reuters-clusters/part-randomSeed
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > > >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > >>>        at
> > > >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > >>>        at
> > > >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > > >>>
> > > >>> Any thoughts on this one?
> > > >>>
> > > >>> Cheers,
> > > >>> Matt
> > > >>>
> > > >>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<ted.dunning@gmail.com
> >
> > > >>>  wrote:
> > > >>>
> > > >>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<ms...@meebo-inc.com>
> > > >>>>  wrote:
> > > >>>>
> > > >>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out
> > > >>>>> rather
> > > >>>>> nicely.
> > > >>>>>
> > > >>>>>  There were lots of improvements there.
> > > >>>>
> > > >>>>
> > > >>>>  One thing that I find really weird is that 'mahout seqdirectory'
> > > always
> > > >>>>> hits
> > > >>>>> the local filesystem for input, even when running in Hadoop mode.
> > >  So,
> > > >>>>> if
> > > >>>>>
> > > >>>> I
> > > >>>>
> > > >>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
> > > >>>>>
> > > >>>> seqdirectory
> > > >>>>
> > > >>>>> creates an empty sequence file (with no error).  Is this
> expected?
> > > >>>>>
> > > >>>>>  No.  That sounds like a bug.  Can you file a report here:
> > > >>>> https://issues.apache.org/jira/browse/MAHOUT ?
> > > >>>>
> > > >>>>
> > > >>>>  Is there a nice way to create sequence files that isn't
> > seqdirectory?
> > > >>>>>
> > > >>>>  I'd
> > > >>>>
> > > >>>>> like to do a little processing on the documents as they get sent
> to
> > > the
> > > >>>>> sequence file without having to generate a second copy on the
> DFS.
> > > >>>>>
> > > >>>>>  Sure.  Just snarf the code from the program in question and
> > massage
> > > it
> > > >>>> as
> > > >>>> you like.  The command line versions are handy,
> > > >>>> but it is very common to need to customize.  At that point, the
> > > command
> > > >>>> line
> > > >>>> programs serve as example code.  You don't
> > > >>>> have to use them and they have no magic.
> > > >>>>
> > > >>>> If you think you have some improvements in generality, we can push
> > > them
> > > >>>> back
> > > >>>> into the Mahout versions.
> > > >>>>
> > > >>>>
> > > >
> > >
> >
>

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
Is there a way to get more/nicer output so we can track this down?

On Thu, Oct 28, 2010 at 1:57 PM, Jeff Eastman <je...@narus.com> wrote:

> I'm puzzled too. I just unzipped the same distribution and it ran without
> issues on my CHD3 unicluster. I'm running as my own userId, not
> hadoop-anything, on RHE Linux 5.3 64-bit VM on VMware.
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, October 28, 2010 10:12 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> OK, using
>
> https://repository.apache.org/content/repositories/orgapachemahout-004/org/apache/mahout/mahout-distribution/0.4/mahout-distribution-0.4.zip
>
> Same error on a clean unzip.  Running 64-bit Linux CentOS 5.3.
>
> Running the lda command over hadoop yields results as expected.
>  Puzzling...
>
> -Matt
>
> On Thu, Oct 28, 2010 at 12:52 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > Don't recognize that zip. Can you try with the latest 0.4 RC at
> > https://repository.apache.org/content/repositories/orgapachemahout-004/?
> I
> > just ran that successfully on my CHD3 unicluster. What OS are you
> running?
> > I'm on 64-bit Linux EL-5.
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Thursday, October 28, 2010 9:45 AM
> > To: user@mahout.apache.org
> > Subject: Re: Getting mahout to run on the DFS
> >
> > I'm running mahout-distribution-0.4-20101027.194721-1.zip on CDH3.
>  Running
> > examples/bin/build-reuters.sh from a clean unzip results in the same
> error.
> >  I definitely have read/write access to the DFS, as  reuters-seqdir and
> > reuters-seqdir-sparse have been created correctly.
> >
> > Running locally with a clean unzip is fine.  It's just the
> > running-on-the-DFS part that breaks when we try to cluster.
> >
> > Thanks,
> > Matt
> >
> > On Thu, Oct 28, 2010 at 12:23 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> > > Maybe I missed it, but are you running on trunk? Can you run
> > > examples/bin/build-reuters.sh out of the box? I'm running that
> > successfully
> > > on a CHD3 cluster logged-in as myself.
> > >
> > > Jeff
> > >
> > > -----Original Message-----
> > > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > > Sent: Thursday, October 28, 2010 9:01 AM
> > > To: user@mahout.apache.org
> > > Subject: Re: Getting mahout to run on the DFS
> > >
> > > Hm.  So, I'm running the cloudera hadoop distribution, and I'm running
> as
> > a
> > > hadoop user.
> > >
> > > Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
> > > specified):
> > > ./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c UTF-8
> > > -chunk 5
> > >
> > > ./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
> > > ./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o
> > reuters-kmeans
> > > -k 20 --clustering --maxIter 10 --clusters reuters-clusters
> > >
> > > ../reuters-out is a local directory (see
> > > https://issues.apache.org/jira/browse/MAHOUT-535)
> > >
> > > After running the third command, I see a non-empty reuters-clusters
> > > directory on the DFS, so presumably the initial clusters are getting
> > > created.
> > >
> > > These commands run fine in local mode, but no dice running on the DFS.
>  I
> > > even copied the reuters-clusters directory from the DFS to my local
> > > machine,
> > > hoping that mahout was looking there, but I still got the same
> error(s):
> > > 10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
> > > attempt_201008241139_108731_m_000000_2, Status : FAILED
> > > java.lang.IllegalStateException: No clusters found. Check your -c path.
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > >        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > >
> > > 10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
> > > job_201008241139_108731
> > > 10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
> > > 10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
> > > 10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
> > > 10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
> > > 10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
> > > 10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
> > > Exception in thread "main" java.lang.InterruptedException: K-Means
> > > Iteration
> > > failed processing reuters-clusters/part-randomSeed
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > >        at
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > >        at
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >        at
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > >        at
> > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > >        at
> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > >
> > > As a sanity check:
> > > [mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
> > > reuters-clusters
> > > Found 1 items
> > > 8246        hdfs://
> > > 192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed
> > >
> > > By the way, I really really appreciate the help.  Thank you so much.
> > >
> > > -Matt
> > >
> > > On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
> > > <jd...@windwardsolutions.com>wrote:
> > >
> > > > With k-means, the initial clusters directory can either 1) contain
> some
> > > > initial clusters you produced somehow (a common method is via Canopy)
> > or
> > > 2)
> > > > be empty. In the empty case; however, you also need to specify the
> > number
> > > of
> > > > initial clusters (-k) so that your input data can be sampled and the
> > > initial
> > > > clusters put into the empty directory. Note that if you do 1) and
> also
> > > > specify -k that your initial clusters will be overwritten by k
> sampled
> > > > values from your input data.
> > > >
> > > >
> > > > On 10/28/10 3:35 AM, pragnesh radadia wrote:
> > > >
> > > >> are you using cloudera hadoop distribution ?
> > > >> if yes then run kmean using hadoop or hdfs user to solve your
> problem
> > > >>
> > > >>
> > > >> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>
> > >  wrote:
> > > >>
> > > >>> Bug report created!  Thanks!
> > > >>>
> > > >>> One more random question: when running kmeans, there's a required
> -c
> > > >>> (initial clusters) argument.  All the examples I've seen using
> kmeans
> > > >>> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify
> a
> > > >>> non-existent directory (presumably the algorithm would select some
> > > >>> initial
> > > >>> random clusters).
> > > >>>
> > > >>> But, when specifying some initial, nonexistent clusters directory,
> I
> > > get
> > > >>> a
> > > >>> bunch of:
> > > >>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> > > >>> attempt_201008241139_107461_m_000002_2, Status : FAILED
> > > >>> java.lang.IllegalStateException: No clusters found. Check your -c
> > path.
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > > >>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > > >>>        at
> > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > > >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > >>>
> > > >>> And the job eventually fails with:
> > > >>> Exception in thread "main" java.lang.InterruptedException: K-Means
> > > >>> Iteration
> > > >>> failed processing reuters-clusters/part-randomSeed
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > > >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > >>>        at
> > > >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > >>>        at
> > > >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >>>        at
> > > >>>
> > > >>>
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > > >>>
> > > >>> Any thoughts on this one?
> > > >>>
> > > >>> Cheers,
> > > >>> Matt
> > > >>>
> > > >>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<ted.dunning@gmail.com
> >
> > > >>>  wrote:
> > > >>>
> > > >>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<ms...@meebo-inc.com>
> > > >>>>  wrote:
> > > >>>>
> > > >>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out
> > > >>>>> rather
> > > >>>>> nicely.
> > > >>>>>
> > > >>>>>  There were lots of improvements there.
> > > >>>>
> > > >>>>
> > > >>>>  One thing that I find really weird is that 'mahout seqdirectory'
> > > always
> > > >>>>> hits
> > > >>>>> the local filesystem for input, even when running in Hadoop mode.
> > >  So,
> > > >>>>> if
> > > >>>>>
> > > >>>> I
> > > >>>>
> > > >>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
> > > >>>>>
> > > >>>> seqdirectory
> > > >>>>
> > > >>>>> creates an empty sequence file (with no error).  Is this
> expected?
> > > >>>>>
> > > >>>>>  No.  That sounds like a bug.  Can you file a report here:
> > > >>>> https://issues.apache.org/jira/browse/MAHOUT ?
> > > >>>>
> > > >>>>
> > > >>>>  Is there a nice way to create sequence files that isn't
> > seqdirectory?
> > > >>>>>
> > > >>>>  I'd
> > > >>>>
> > > >>>>> like to do a little processing on the documents as they get sent
> to
> > > the
> > > >>>>> sequence file without having to generate a second copy on the
> DFS.
> > > >>>>>
> > > >>>>>  Sure.  Just snarf the code from the program in question and
> > massage
> > > it
> > > >>>> as
> > > >>>> you like.  The command line versions are handy,
> > > >>>> but it is very common to need to customize.  At that point, the
> > > command
> > > >>>> line
> > > >>>> programs serve as example code.  You don't
> > > >>>> have to use them and they have no magic.
> > > >>>>
> > > >>>> If you think you have some improvements in generality, we can push
> > > them
> > > >>>> back
> > > >>>> into the Mahout versions.
> > > >>>>
> > > >>>>
> > > >
> > >
> >
>

RE: Getting mahout to run on the DFS

Posted by Jeff Eastman <je...@Narus.com>.
I'm puzzled too. I just unzipped the same distribution and it ran without issues on my CHD3 unicluster. I'm running as my own userId, not hadoop-anything, on RHE Linux 5.3 64-bit VM on VMware. 

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com] 
Sent: Thursday, October 28, 2010 10:12 AM
To: user@mahout.apache.org
Subject: Re: Getting mahout to run on the DFS

OK, using
https://repository.apache.org/content/repositories/orgapachemahout-004/org/apache/mahout/mahout-distribution/0.4/mahout-distribution-0.4.zip

Same error on a clean unzip.  Running 64-bit Linux CentOS 5.3.

Running the lda command over hadoop yields results as expected.  Puzzling...

-Matt

On Thu, Oct 28, 2010 at 12:52 PM, Jeff Eastman <je...@narus.com> wrote:

> Don't recognize that zip. Can you try with the latest 0.4 RC at
> https://repository.apache.org/content/repositories/orgapachemahout-004/? I
> just ran that successfully on my CHD3 unicluster. What OS are you running?
> I'm on 64-bit Linux EL-5.
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, October 28, 2010 9:45 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> I'm running mahout-distribution-0.4-20101027.194721-1.zip on CDH3.  Running
> examples/bin/build-reuters.sh from a clean unzip results in the same error.
>  I definitely have read/write access to the DFS, as  reuters-seqdir and
> reuters-seqdir-sparse have been created correctly.
>
> Running locally with a clean unzip is fine.  It's just the
> running-on-the-DFS part that breaks when we try to cluster.
>
> Thanks,
> Matt
>
> On Thu, Oct 28, 2010 at 12:23 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > Maybe I missed it, but are you running on trunk? Can you run
> > examples/bin/build-reuters.sh out of the box? I'm running that
> successfully
> > on a CHD3 cluster logged-in as myself.
> >
> > Jeff
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Thursday, October 28, 2010 9:01 AM
> > To: user@mahout.apache.org
> > Subject: Re: Getting mahout to run on the DFS
> >
> > Hm.  So, I'm running the cloudera hadoop distribution, and I'm running as
> a
> > hadoop user.
> >
> > Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
> > specified):
> > ./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c UTF-8
> > -chunk 5
> >
> > ./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
> > ./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o
> reuters-kmeans
> > -k 20 --clustering --maxIter 10 --clusters reuters-clusters
> >
> > ../reuters-out is a local directory (see
> > https://issues.apache.org/jira/browse/MAHOUT-535)
> >
> > After running the third command, I see a non-empty reuters-clusters
> > directory on the DFS, so presumably the initial clusters are getting
> > created.
> >
> > These commands run fine in local mode, but no dice running on the DFS.  I
> > even copied the reuters-clusters directory from the DFS to my local
> > machine,
> > hoping that mahout was looking there, but I still got the same error(s):
> > 10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
> > attempt_201008241139_108731_m_000000_2, Status : FAILED
> > java.lang.IllegalStateException: No clusters found. Check your -c path.
> >        at
> >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> >        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >
> > 10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
> > job_201008241139_108731
> > 10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
> > 10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
> > 10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
> > 10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
> > 10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
> > 10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
> > Exception in thread "main" java.lang.InterruptedException: K-Means
> > Iteration
> > failed processing reuters-clusters/part-randomSeed
> >        at
> >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> >        at
> >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> >        at
> >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> >        at
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> >        at
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >        at
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >        at
> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >        at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> >
> > As a sanity check:
> > [mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
> > reuters-clusters
> > Found 1 items
> > 8246        hdfs://
> > 192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed
> >
> > By the way, I really really appreciate the help.  Thank you so much.
> >
> > -Matt
> >
> > On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
> > <jd...@windwardsolutions.com>wrote:
> >
> > > With k-means, the initial clusters directory can either 1) contain some
> > > initial clusters you produced somehow (a common method is via Canopy)
> or
> > 2)
> > > be empty. In the empty case; however, you also need to specify the
> number
> > of
> > > initial clusters (-k) so that your input data can be sampled and the
> > initial
> > > clusters put into the empty directory. Note that if you do 1) and also
> > > specify -k that your initial clusters will be overwritten by k sampled
> > > values from your input data.
> > >
> > >
> > > On 10/28/10 3:35 AM, pragnesh radadia wrote:
> > >
> > >> are you using cloudera hadoop distribution ?
> > >> if yes then run kmean using hadoop or hdfs user to solve your problem
> > >>
> > >>
> > >> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>
> >  wrote:
> > >>
> > >>> Bug report created!  Thanks!
> > >>>
> > >>> One more random question: when running kmeans, there's a required -c
> > >>> (initial clusters) argument.  All the examples I've seen using kmeans
> > >>> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
> > >>> non-existent directory (presumably the algorithm would select some
> > >>> initial
> > >>> random clusters).
> > >>>
> > >>> But, when specifying some initial, nonexistent clusters directory, I
> > get
> > >>> a
> > >>> bunch of:
> > >>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> > >>> attempt_201008241139_107461_m_000002_2, Status : FAILED
> > >>> java.lang.IllegalStateException: No clusters found. Check your -c
> path.
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > >>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > >>>        at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > >>>
> > >>> And the job eventually fails with:
> > >>> Exception in thread "main" java.lang.InterruptedException: K-Means
> > >>> Iteration
> > >>> failed processing reuters-clusters/part-randomSeed
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >>>        at
> > >>>
> > >>>
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >>>        at
> > >>>
> > >>>
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > >>>        at
> > >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > >>>        at
> > >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >>>        at
> > >>>
> > >>>
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >>>        at
> > >>>
> > >>>
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > >>>
> > >>> Any thoughts on this one?
> > >>>
> > >>> Cheers,
> > >>> Matt
> > >>>
> > >>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<te...@gmail.com>
> > >>>  wrote:
> > >>>
> > >>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<ms...@meebo-inc.com>
> > >>>>  wrote:
> > >>>>
> > >>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out
> > >>>>> rather
> > >>>>> nicely.
> > >>>>>
> > >>>>>  There were lots of improvements there.
> > >>>>
> > >>>>
> > >>>>  One thing that I find really weird is that 'mahout seqdirectory'
> > always
> > >>>>> hits
> > >>>>> the local filesystem for input, even when running in Hadoop mode.
> >  So,
> > >>>>> if
> > >>>>>
> > >>>> I
> > >>>>
> > >>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
> > >>>>>
> > >>>> seqdirectory
> > >>>>
> > >>>>> creates an empty sequence file (with no error).  Is this expected?
> > >>>>>
> > >>>>>  No.  That sounds like a bug.  Can you file a report here:
> > >>>> https://issues.apache.org/jira/browse/MAHOUT ?
> > >>>>
> > >>>>
> > >>>>  Is there a nice way to create sequence files that isn't
> seqdirectory?
> > >>>>>
> > >>>>  I'd
> > >>>>
> > >>>>> like to do a little processing on the documents as they get sent to
> > the
> > >>>>> sequence file without having to generate a second copy on the DFS.
> > >>>>>
> > >>>>>  Sure.  Just snarf the code from the program in question and
> massage
> > it
> > >>>> as
> > >>>> you like.  The command line versions are handy,
> > >>>> but it is very common to need to customize.  At that point, the
> > command
> > >>>> line
> > >>>> programs serve as example code.  You don't
> > >>>> have to use them and they have no magic.
> > >>>>
> > >>>> If you think you have some improvements in generality, we can push
> > them
> > >>>> back
> > >>>> into the Mahout versions.
> > >>>>
> > >>>>
> > >
> >
>

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
OK, using
https://repository.apache.org/content/repositories/orgapachemahout-004/org/apache/mahout/mahout-distribution/0.4/mahout-distribution-0.4.zip

Same error on a clean unzip.  Running 64-bit Linux CentOS 5.3.

Running the lda command over hadoop yields results as expected.  Puzzling...

-Matt

On Thu, Oct 28, 2010 at 12:52 PM, Jeff Eastman <je...@narus.com> wrote:

> Don't recognize that zip. Can you try with the latest 0.4 RC at
> https://repository.apache.org/content/repositories/orgapachemahout-004/? I
> just ran that successfully on my CHD3 unicluster. What OS are you running?
> I'm on 64-bit Linux EL-5.
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, October 28, 2010 9:45 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> I'm running mahout-distribution-0.4-20101027.194721-1.zip on CDH3.  Running
> examples/bin/build-reuters.sh from a clean unzip results in the same error.
>  I definitely have read/write access to the DFS, as  reuters-seqdir and
> reuters-seqdir-sparse have been created correctly.
>
> Running locally with a clean unzip is fine.  It's just the
> running-on-the-DFS part that breaks when we try to cluster.
>
> Thanks,
> Matt
>
> On Thu, Oct 28, 2010 at 12:23 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > Maybe I missed it, but are you running on trunk? Can you run
> > examples/bin/build-reuters.sh out of the box? I'm running that
> successfully
> > on a CHD3 cluster logged-in as myself.
> >
> > Jeff
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Thursday, October 28, 2010 9:01 AM
> > To: user@mahout.apache.org
> > Subject: Re: Getting mahout to run on the DFS
> >
> > Hm.  So, I'm running the cloudera hadoop distribution, and I'm running as
> a
> > hadoop user.
> >
> > Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
> > specified):
> > ./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c UTF-8
> > -chunk 5
> >
> > ./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
> > ./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o
> reuters-kmeans
> > -k 20 --clustering --maxIter 10 --clusters reuters-clusters
> >
> > ../reuters-out is a local directory (see
> > https://issues.apache.org/jira/browse/MAHOUT-535)
> >
> > After running the third command, I see a non-empty reuters-clusters
> > directory on the DFS, so presumably the initial clusters are getting
> > created.
> >
> > These commands run fine in local mode, but no dice running on the DFS.  I
> > even copied the reuters-clusters directory from the DFS to my local
> > machine,
> > hoping that mahout was looking there, but I still got the same error(s):
> > 10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
> > attempt_201008241139_108731_m_000000_2, Status : FAILED
> > java.lang.IllegalStateException: No clusters found. Check your -c path.
> >        at
> >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> >        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >
> > 10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
> > job_201008241139_108731
> > 10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
> > 10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
> > 10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
> > 10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
> > 10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
> > 10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
> > Exception in thread "main" java.lang.InterruptedException: K-Means
> > Iteration
> > failed processing reuters-clusters/part-randomSeed
> >        at
> >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> >        at
> >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> >        at
> >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> >        at
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> >        at
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >        at
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >        at
> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >        at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> >
> > As a sanity check:
> > [mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
> > reuters-clusters
> > Found 1 items
> > 8246        hdfs://
> > 192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed
> >
> > By the way, I really really appreciate the help.  Thank you so much.
> >
> > -Matt
> >
> > On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
> > <jd...@windwardsolutions.com>wrote:
> >
> > > With k-means, the initial clusters directory can either 1) contain some
> > > initial clusters you produced somehow (a common method is via Canopy)
> or
> > 2)
> > > be empty. In the empty case; however, you also need to specify the
> number
> > of
> > > initial clusters (-k) so that your input data can be sampled and the
> > initial
> > > clusters put into the empty directory. Note that if you do 1) and also
> > > specify -k that your initial clusters will be overwritten by k sampled
> > > values from your input data.
> > >
> > >
> > > On 10/28/10 3:35 AM, pragnesh radadia wrote:
> > >
> > >> are you using cloudera hadoop distribution ?
> > >> if yes then run kmean using hadoop or hdfs user to solve your problem
> > >>
> > >>
> > >> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>
> >  wrote:
> > >>
> > >>> Bug report created!  Thanks!
> > >>>
> > >>> One more random question: when running kmeans, there's a required -c
> > >>> (initial clusters) argument.  All the examples I've seen using kmeans
> > >>> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
> > >>> non-existent directory (presumably the algorithm would select some
> > >>> initial
> > >>> random clusters).
> > >>>
> > >>> But, when specifying some initial, nonexistent clusters directory, I
> > get
> > >>> a
> > >>> bunch of:
> > >>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> > >>> attempt_201008241139_107461_m_000002_2, Status : FAILED
> > >>> java.lang.IllegalStateException: No clusters found. Check your -c
> path.
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > >>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > >>>        at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > >>>
> > >>> And the job eventually fails with:
> > >>> Exception in thread "main" java.lang.InterruptedException: K-Means
> > >>> Iteration
> > >>> failed processing reuters-clusters/part-randomSeed
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >>>        at
> > >>>
> > >>>
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >>>        at
> > >>>
> > >>>
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > >>>        at
> > >>>
> > >>>
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > >>>        at
> > >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > >>>        at
> > >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >>>        at
> > >>>
> > >>>
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >>>        at
> > >>>
> > >>>
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > >>>
> > >>> Any thoughts on this one?
> > >>>
> > >>> Cheers,
> > >>> Matt
> > >>>
> > >>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<te...@gmail.com>
> > >>>  wrote:
> > >>>
> > >>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<ms...@meebo-inc.com>
> > >>>>  wrote:
> > >>>>
> > >>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out
> > >>>>> rather
> > >>>>> nicely.
> > >>>>>
> > >>>>>  There were lots of improvements there.
> > >>>>
> > >>>>
> > >>>>  One thing that I find really weird is that 'mahout seqdirectory'
> > always
> > >>>>> hits
> > >>>>> the local filesystem for input, even when running in Hadoop mode.
> >  So,
> > >>>>> if
> > >>>>>
> > >>>> I
> > >>>>
> > >>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
> > >>>>>
> > >>>> seqdirectory
> > >>>>
> > >>>>> creates an empty sequence file (with no error).  Is this expected?
> > >>>>>
> > >>>>>  No.  That sounds like a bug.  Can you file a report here:
> > >>>> https://issues.apache.org/jira/browse/MAHOUT ?
> > >>>>
> > >>>>
> > >>>>  Is there a nice way to create sequence files that isn't
> seqdirectory?
> > >>>>>
> > >>>>  I'd
> > >>>>
> > >>>>> like to do a little processing on the documents as they get sent to
> > the
> > >>>>> sequence file without having to generate a second copy on the DFS.
> > >>>>>
> > >>>>>  Sure.  Just snarf the code from the program in question and
> massage
> > it
> > >>>> as
> > >>>> you like.  The command line versions are handy,
> > >>>> but it is very common to need to customize.  At that point, the
> > command
> > >>>> line
> > >>>> programs serve as example code.  You don't
> > >>>> have to use them and they have no magic.
> > >>>>
> > >>>> If you think you have some improvements in generality, we can push
> > them
> > >>>> back
> > >>>> into the Mahout versions.
> > >>>>
> > >>>>
> > >
> >
>

RE: Getting mahout to run on the DFS

Posted by Jeff Eastman <je...@Narus.com>.
Don't recognize that zip. Can you try with the latest 0.4 RC at https://repository.apache.org/content/repositories/orgapachemahout-004/? I just ran that successfully on my CHD3 unicluster. What OS are you running? I'm on 64-bit Linux EL-5.

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com] 
Sent: Thursday, October 28, 2010 9:45 AM
To: user@mahout.apache.org
Subject: Re: Getting mahout to run on the DFS

I'm running mahout-distribution-0.4-20101027.194721-1.zip on CDH3.  Running
examples/bin/build-reuters.sh from a clean unzip results in the same error.
 I definitely have read/write access to the DFS, as  reuters-seqdir and
reuters-seqdir-sparse have been created correctly.

Running locally with a clean unzip is fine.  It's just the
running-on-the-DFS part that breaks when we try to cluster.

Thanks,
Matt

On Thu, Oct 28, 2010 at 12:23 PM, Jeff Eastman <je...@narus.com> wrote:

> Maybe I missed it, but are you running on trunk? Can you run
> examples/bin/build-reuters.sh out of the box? I'm running that successfully
> on a CHD3 cluster logged-in as myself.
>
> Jeff
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, October 28, 2010 9:01 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> Hm.  So, I'm running the cloudera hadoop distribution, and I'm running as a
> hadoop user.
>
> Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
> specified):
> ./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c UTF-8
> -chunk 5
>
> ./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
> ./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o reuters-kmeans
> -k 20 --clustering --maxIter 10 --clusters reuters-clusters
>
> ../reuters-out is a local directory (see
> https://issues.apache.org/jira/browse/MAHOUT-535)
>
> After running the third command, I see a non-empty reuters-clusters
> directory on the DFS, so presumably the initial clusters are getting
> created.
>
> These commands run fine in local mode, but no dice running on the DFS.  I
> even copied the reuters-clusters directory from the DFS to my local
> machine,
> hoping that mahout was looking there, but I still got the same error(s):
> 10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
> attempt_201008241139_108731_m_000000_2, Status : FAILED
> java.lang.IllegalStateException: No clusters found. Check your -c path.
>        at
>
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
> job_201008241139_108731
> 10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
> 10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
> 10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
> 10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
> 10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
> 10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
> Exception in thread "main" java.lang.InterruptedException: K-Means
> Iteration
> failed processing reuters-clusters/part-randomSeed
>        at
>
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
>        at
>
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
>        at
>
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
> As a sanity check:
> [mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
> reuters-clusters
> Found 1 items
> 8246        hdfs://
> 192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed
>
> By the way, I really really appreciate the help.  Thank you so much.
>
> -Matt
>
> On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
> > With k-means, the initial clusters directory can either 1) contain some
> > initial clusters you produced somehow (a common method is via Canopy) or
> 2)
> > be empty. In the empty case; however, you also need to specify the number
> of
> > initial clusters (-k) so that your input data can be sampled and the
> initial
> > clusters put into the empty directory. Note that if you do 1) and also
> > specify -k that your initial clusters will be overwritten by k sampled
> > values from your input data.
> >
> >
> > On 10/28/10 3:35 AM, pragnesh radadia wrote:
> >
> >> are you using cloudera hadoop distribution ?
> >> if yes then run kmean using hadoop or hdfs user to solve your problem
> >>
> >>
> >> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>
>  wrote:
> >>
> >>> Bug report created!  Thanks!
> >>>
> >>> One more random question: when running kmeans, there's a required -c
> >>> (initial clusters) argument.  All the examples I've seen using kmeans
> >>> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
> >>> non-existent directory (presumably the algorithm would select some
> >>> initial
> >>> random clusters).
> >>>
> >>> But, when specifying some initial, nonexistent clusters directory, I
> get
> >>> a
> >>> bunch of:
> >>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> >>> attempt_201008241139_107461_m_000002_2, Status : FAILED
> >>> java.lang.IllegalStateException: No clusters found. Check your -c path.
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> >>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> >>>        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >>>
> >>> And the job eventually fails with:
> >>> Exception in thread "main" java.lang.InterruptedException: K-Means
> >>> Iteration
> >>> failed processing reuters-clusters/part-randomSeed
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>        at
> >>>
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>        at
> >>>
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> >>>        at
> >>>
> >>>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>>        at
> >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>>        at
> >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>        at
> >>>
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>        at
> >>>
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> >>>
> >>> Any thoughts on this one?
> >>>
> >>> Cheers,
> >>> Matt
> >>>
> >>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<te...@gmail.com>
> >>>  wrote:
> >>>
> >>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<ms...@meebo-inc.com>
> >>>>  wrote:
> >>>>
> >>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out
> >>>>> rather
> >>>>> nicely.
> >>>>>
> >>>>>  There were lots of improvements there.
> >>>>
> >>>>
> >>>>  One thing that I find really weird is that 'mahout seqdirectory'
> always
> >>>>> hits
> >>>>> the local filesystem for input, even when running in Hadoop mode.
>  So,
> >>>>> if
> >>>>>
> >>>> I
> >>>>
> >>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
> >>>>>
> >>>> seqdirectory
> >>>>
> >>>>> creates an empty sequence file (with no error).  Is this expected?
> >>>>>
> >>>>>  No.  That sounds like a bug.  Can you file a report here:
> >>>> https://issues.apache.org/jira/browse/MAHOUT ?
> >>>>
> >>>>
> >>>>  Is there a nice way to create sequence files that isn't seqdirectory?
> >>>>>
> >>>>  I'd
> >>>>
> >>>>> like to do a little processing on the documents as they get sent to
> the
> >>>>> sequence file without having to generate a second copy on the DFS.
> >>>>>
> >>>>>  Sure.  Just snarf the code from the program in question and massage
> it
> >>>> as
> >>>> you like.  The command line versions are handy,
> >>>> but it is very common to need to customize.  At that point, the
> command
> >>>> line
> >>>> programs serve as example code.  You don't
> >>>> have to use them and they have no magic.
> >>>>
> >>>> If you think you have some improvements in generality, we can push
> them
> >>>> back
> >>>> into the Mahout versions.
> >>>>
> >>>>
> >
>

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
I'm running mahout-distribution-0.4-20101027.194721-1.zip on CDH3.  Running
examples/bin/build-reuters.sh from a clean unzip results in the same error.
 I definitely have read/write access to the DFS, as  reuters-seqdir and
reuters-seqdir-sparse have been created correctly.

Running locally with a clean unzip is fine.  It's just the
running-on-the-DFS part that breaks when we try to cluster.

Thanks,
Matt

On Thu, Oct 28, 2010 at 12:23 PM, Jeff Eastman <je...@narus.com> wrote:

> Maybe I missed it, but are you running on trunk? Can you run
> examples/bin/build-reuters.sh out of the box? I'm running that successfully
> on a CHD3 cluster logged-in as myself.
>
> Jeff
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, October 28, 2010 9:01 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> Hm.  So, I'm running the cloudera hadoop distribution, and I'm running as a
> hadoop user.
>
> Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
> specified):
> ./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c UTF-8
> -chunk 5
>
> ./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
> ./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o reuters-kmeans
> -k 20 --clustering --maxIter 10 --clusters reuters-clusters
>
> ../reuters-out is a local directory (see
> https://issues.apache.org/jira/browse/MAHOUT-535)
>
> After running the third command, I see a non-empty reuters-clusters
> directory on the DFS, so presumably the initial clusters are getting
> created.
>
> These commands run fine in local mode, but no dice running on the DFS.  I
> even copied the reuters-clusters directory from the DFS to my local
> machine,
> hoping that mahout was looking there, but I still got the same error(s):
> 10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
> attempt_201008241139_108731_m_000000_2, Status : FAILED
> java.lang.IllegalStateException: No clusters found. Check your -c path.
>        at
>
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
> job_201008241139_108731
> 10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
> 10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
> 10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
> 10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
> 10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
> 10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
> Exception in thread "main" java.lang.InterruptedException: K-Means
> Iteration
> failed processing reuters-clusters/part-randomSeed
>        at
>
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
>        at
>
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
>        at
>
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
> As a sanity check:
> [mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
> reuters-clusters
> Found 1 items
> 8246        hdfs://
> 192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed
>
> By the way, I really really appreciate the help.  Thank you so much.
>
> -Matt
>
> On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
> > With k-means, the initial clusters directory can either 1) contain some
> > initial clusters you produced somehow (a common method is via Canopy) or
> 2)
> > be empty. In the empty case; however, you also need to specify the number
> of
> > initial clusters (-k) so that your input data can be sampled and the
> initial
> > clusters put into the empty directory. Note that if you do 1) and also
> > specify -k that your initial clusters will be overwritten by k sampled
> > values from your input data.
> >
> >
> > On 10/28/10 3:35 AM, pragnesh radadia wrote:
> >
> >> are you using cloudera hadoop distribution ?
> >> if yes then run kmean using hadoop or hdfs user to solve your problem
> >>
> >>
> >> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>
>  wrote:
> >>
> >>> Bug report created!  Thanks!
> >>>
> >>> One more random question: when running kmeans, there's a required -c
> >>> (initial clusters) argument.  All the examples I've seen using kmeans
> >>> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
> >>> non-existent directory (presumably the algorithm would select some
> >>> initial
> >>> random clusters).
> >>>
> >>> But, when specifying some initial, nonexistent clusters directory, I
> get
> >>> a
> >>> bunch of:
> >>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> >>> attempt_201008241139_107461_m_000002_2, Status : FAILED
> >>> java.lang.IllegalStateException: No clusters found. Check your -c path.
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> >>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> >>>        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >>>
> >>> And the job eventually fails with:
> >>> Exception in thread "main" java.lang.InterruptedException: K-Means
> >>> Iteration
> >>> failed processing reuters-clusters/part-randomSeed
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>        at
> >>>
> >>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>        at
> >>>
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>        at
> >>>
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> >>>        at
> >>>
> >>>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>>        at
> >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>>        at
> >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>        at
> >>>
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>        at
> >>>
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> >>>
> >>> Any thoughts on this one?
> >>>
> >>> Cheers,
> >>> Matt
> >>>
> >>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<te...@gmail.com>
> >>>  wrote:
> >>>
> >>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<ms...@meebo-inc.com>
> >>>>  wrote:
> >>>>
> >>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out
> >>>>> rather
> >>>>> nicely.
> >>>>>
> >>>>>  There were lots of improvements there.
> >>>>
> >>>>
> >>>>  One thing that I find really weird is that 'mahout seqdirectory'
> always
> >>>>> hits
> >>>>> the local filesystem for input, even when running in Hadoop mode.
>  So,
> >>>>> if
> >>>>>
> >>>> I
> >>>>
> >>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
> >>>>>
> >>>> seqdirectory
> >>>>
> >>>>> creates an empty sequence file (with no error).  Is this expected?
> >>>>>
> >>>>>  No.  That sounds like a bug.  Can you file a report here:
> >>>> https://issues.apache.org/jira/browse/MAHOUT ?
> >>>>
> >>>>
> >>>>  Is there a nice way to create sequence files that isn't seqdirectory?
> >>>>>
> >>>>  I'd
> >>>>
> >>>>> like to do a little processing on the documents as they get sent to
> the
> >>>>> sequence file without having to generate a second copy on the DFS.
> >>>>>
> >>>>>  Sure.  Just snarf the code from the program in question and massage
> it
> >>>> as
> >>>> you like.  The command line versions are handy,
> >>>> but it is very common to need to customize.  At that point, the
> command
> >>>> line
> >>>> programs serve as example code.  You don't
> >>>> have to use them and they have no magic.
> >>>>
> >>>> If you think you have some improvements in generality, we can push
> them
> >>>> back
> >>>> into the Mahout versions.
> >>>>
> >>>>
> >
>

RE: Getting mahout to run on the DFS

Posted by Jeff Eastman <je...@Narus.com>.
Maybe I missed it, but are you running on trunk? Can you run examples/bin/build-reuters.sh out of the box? I'm running that successfully on a CHD3 cluster logged-in as myself. 

Jeff

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com] 
Sent: Thursday, October 28, 2010 9:01 AM
To: user@mahout.apache.org
Subject: Re: Getting mahout to run on the DFS

Hm.  So, I'm running the cloudera hadoop distribution, and I'm running as a
hadoop user.

Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
specified):
./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c UTF-8
-chunk 5

./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o reuters-kmeans
-k 20 --clustering --maxIter 10 --clusters reuters-clusters

../reuters-out is a local directory (see
https://issues.apache.org/jira/browse/MAHOUT-535)

After running the third command, I see a non-empty reuters-clusters
directory on the DFS, so presumably the initial clusters are getting
created.

These commands run fine in local mode, but no dice running on the DFS.  I
even copied the reuters-clusters directory from the DFS to my local machine,
hoping that mahout was looking there, but I still got the same error(s):
10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
attempt_201008241139_108731_m_000000_2, Status : FAILED
java.lang.IllegalStateException: No clusters found. Check your -c path.
        at
org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
job_201008241139_108731
10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
Exception in thread "main" java.lang.InterruptedException: K-Means Iteration
failed processing reuters-clusters/part-randomSeed
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

As a sanity check:
[mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
reuters-clusters
Found 1 items
8246        hdfs://
192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed

By the way, I really really appreciate the help.  Thank you so much.

-Matt

On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> With k-means, the initial clusters directory can either 1) contain some
> initial clusters you produced somehow (a common method is via Canopy) or 2)
> be empty. In the empty case; however, you also need to specify the number of
> initial clusters (-k) so that your input data can be sampled and the initial
> clusters put into the empty directory. Note that if you do 1) and also
> specify -k that your initial clusters will be overwritten by k sampled
> values from your input data.
>
>
> On 10/28/10 3:35 AM, pragnesh radadia wrote:
>
>> are you using cloudera hadoop distribution ?
>> if yes then run kmean using hadoop or hdfs user to solve your problem
>>
>>
>> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>  wrote:
>>
>>> Bug report created!  Thanks!
>>>
>>> One more random question: when running kmeans, there's a required -c
>>> (initial clusters) argument.  All the examples I've seen using kmeans
>>> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
>>> non-existent directory (presumably the algorithm would select some
>>> initial
>>> random clusters).
>>>
>>> But, when specifying some initial, nonexistent clusters directory, I get
>>> a
>>> bunch of:
>>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
>>> attempt_201008241139_107461_m_000002_2, Status : FAILED
>>> java.lang.IllegalStateException: No clusters found. Check your -c path.
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
>>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> And the job eventually fails with:
>>> Exception in thread "main" java.lang.InterruptedException: K-Means
>>> Iteration
>>> failed processing reuters-clusters/part-randomSeed
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at
>>>
>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>        at
>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>        at
>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>>
>>> Any thoughts on this one?
>>>
>>> Cheers,
>>> Matt
>>>
>>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<te...@gmail.com>
>>>  wrote:
>>>
>>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<ms...@meebo-inc.com>
>>>>  wrote:
>>>>
>>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out
>>>>> rather
>>>>> nicely.
>>>>>
>>>>>  There were lots of improvements there.
>>>>
>>>>
>>>>  One thing that I find really weird is that 'mahout seqdirectory' always
>>>>> hits
>>>>> the local filesystem for input, even when running in Hadoop mode.  So,
>>>>> if
>>>>>
>>>> I
>>>>
>>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
>>>>>
>>>> seqdirectory
>>>>
>>>>> creates an empty sequence file (with no error).  Is this expected?
>>>>>
>>>>>  No.  That sounds like a bug.  Can you file a report here:
>>>> https://issues.apache.org/jira/browse/MAHOUT ?
>>>>
>>>>
>>>>  Is there a nice way to create sequence files that isn't seqdirectory?
>>>>>
>>>>  I'd
>>>>
>>>>> like to do a little processing on the documents as they get sent to the
>>>>> sequence file without having to generate a second copy on the DFS.
>>>>>
>>>>>  Sure.  Just snarf the code from the program in question and massage it
>>>> as
>>>> you like.  The command line versions are handy,
>>>> but it is very common to need to customize.  At that point, the command
>>>> line
>>>> programs serve as example code.  You don't
>>>> have to use them and they have no magic.
>>>>
>>>> If you think you have some improvements in generality, we can push them
>>>> back
>>>> into the Mahout versions.
>>>>
>>>>
>

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
Hm.  So, I'm running the cloudera hadoop distribution, and I'm running as a
hadoop user.

Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
specified):
./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c UTF-8
-chunk 5

./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o reuters-kmeans
-k 20 --clustering --maxIter 10 --clusters reuters-clusters

../reuters-out is a local directory (see
https://issues.apache.org/jira/browse/MAHOUT-535)

After running the third command, I see a non-empty reuters-clusters
directory on the DFS, so presumably the initial clusters are getting
created.

These commands run fine in local mode, but no dice running on the DFS.  I
even copied the reuters-clusters directory from the DFS to my local machine,
hoping that mahout was looking there, but I still got the same error(s):
10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
attempt_201008241139_108731_m_000000_2, Status : FAILED
java.lang.IllegalStateException: No clusters found. Check your -c path.
        at
org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
job_201008241139_108731
10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
Exception in thread "main" java.lang.InterruptedException: K-Means Iteration
failed processing reuters-clusters/part-randomSeed
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

As a sanity check:
[mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
reuters-clusters
Found 1 items
8246        hdfs://
192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed

By the way, I really really appreciate the help.  Thank you so much.

-Matt

On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> With k-means, the initial clusters directory can either 1) contain some
> initial clusters you produced somehow (a common method is via Canopy) or 2)
> be empty. In the empty case; however, you also need to specify the number of
> initial clusters (-k) so that your input data can be sampled and the initial
> clusters put into the empty directory. Note that if you do 1) and also
> specify -k that your initial clusters will be overwritten by k sampled
> values from your input data.
>
>
> On 10/28/10 3:35 AM, pragnesh radadia wrote:
>
>> are you using cloudera hadoop distribution ?
>> if yes then run kmean using hadoop or hdfs user to solve your problem
>>
>>
>> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>  wrote:
>>
>>> Bug report created!  Thanks!
>>>
>>> One more random question: when running kmeans, there's a required -c
>>> (initial clusters) argument.  All the examples I've seen using kmeans
>>> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
>>> non-existent directory (presumably the algorithm would select some
>>> initial
>>> random clusters).
>>>
>>> But, when specifying some initial, nonexistent clusters directory, I get
>>> a
>>> bunch of:
>>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
>>> attempt_201008241139_107461_m_000002_2, Status : FAILED
>>> java.lang.IllegalStateException: No clusters found. Check your -c path.
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
>>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> And the job eventually fails with:
>>> Exception in thread "main" java.lang.InterruptedException: K-Means
>>> Iteration
>>> failed processing reuters-clusters/part-randomSeed
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at
>>>
>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>        at
>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>        at
>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>>
>>> Any thoughts on this one?
>>>
>>> Cheers,
>>> Matt
>>>
>>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<te...@gmail.com>
>>>  wrote:
>>>
>>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<ms...@meebo-inc.com>
>>>>  wrote:
>>>>
>>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out
>>>>> rather
>>>>> nicely.
>>>>>
>>>>>  There were lots of improvements there.
>>>>
>>>>
>>>>  One thing that I find really weird is that 'mahout seqdirectory' always
>>>>> hits
>>>>> the local filesystem for input, even when running in Hadoop mode.  So,
>>>>> if
>>>>>
>>>> I
>>>>
>>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
>>>>>
>>>> seqdirectory
>>>>
>>>>> creates an empty sequence file (with no error).  Is this expected?
>>>>>
>>>>>  No.  That sounds like a bug.  Can you file a report here:
>>>> https://issues.apache.org/jira/browse/MAHOUT ?
>>>>
>>>>
>>>>  Is there a nice way to create sequence files that isn't seqdirectory?
>>>>>
>>>>  I'd
>>>>
>>>>> like to do a little processing on the documents as they get sent to the
>>>>> sequence file without having to generate a second copy on the DFS.
>>>>>
>>>>>  Sure.  Just snarf the code from the program in question and massage it
>>>> as
>>>> you like.  The command line versions are handy,
>>>> but it is very common to need to customize.  At that point, the command
>>>> line
>>>> programs serve as example code.  You don't
>>>> have to use them and they have no magic.
>>>>
>>>> If you think you have some improvements in generality, we can push them
>>>> back
>>>> into the Mahout versions.
>>>>
>>>>
>

Re: Getting mahout to run on the DFS

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
With k-means, the initial clusters directory can either 1) contain some 
initial clusters you produced somehow (a common method is via Canopy) or 
2) be empty. In the empty case; however, you also need to specify the 
number of initial clusters (-k) so that your input data can be sampled 
and the initial clusters put into the empty directory. Note that if you 
do 1) and also specify -k that your initial clusters will be overwritten 
by k sampled values from your input data.

On 10/28/10 3:35 AM, pragnesh radadia wrote:
> are you using cloudera hadoop distribution ?
> if yes then run kmean using hadoop or hdfs user to solve your problem
>
>
> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<ms...@meebo-inc.com>  wrote:
>> Bug report created!  Thanks!
>>
>> One more random question: when running kmeans, there's a required -c
>> (initial clusters) argument.  All the examples I've seen using kmeans
>> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
>> non-existent directory (presumably the algorithm would select some initial
>> random clusters).
>>
>> But, when specifying some initial, nonexistent clusters directory, I get a
>> bunch of:
>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
>> attempt_201008241139_107461_m_000002_2, Status : FAILED
>> java.lang.IllegalStateException: No clusters found. Check your -c path.
>>         at
>> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
>>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> And the job eventually fails with:
>> Exception in thread "main" java.lang.InterruptedException: K-Means Iteration
>> failed processing reuters-clusters/part-randomSeed
>>         at
>> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
>>         at
>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
>>         at
>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
>>         at
>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
>>         at
>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at
>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>         at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>         at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>> Any thoughts on this one?
>>
>> Cheers,
>> Matt
>>
>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<te...@gmail.com>  wrote:
>>
>>> On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<ms...@meebo-inc.com>  wrote:
>>>
>>>> Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out rather
>>>> nicely.
>>>>
>>> There were lots of improvements there.
>>>
>>>
>>>> One thing that I find really weird is that 'mahout seqdirectory' always
>>>> hits
>>>> the local filesystem for input, even when running in Hadoop mode.  So, if
>>> I
>>>> have 'myurls' on the DFS, but that path doesn't exist locally,
>>> seqdirectory
>>>> creates an empty sequence file (with no error).  Is this expected?
>>>>
>>> No.  That sounds like a bug.  Can you file a report here:
>>> https://issues.apache.org/jira/browse/MAHOUT ?
>>>
>>>
>>>> Is there a nice way to create sequence files that isn't seqdirectory?
>>>   I'd
>>>> like to do a little processing on the documents as they get sent to the
>>>> sequence file without having to generate a second copy on the DFS.
>>>>
>>> Sure.  Just snarf the code from the program in question and massage it as
>>> you like.  The command line versions are handy,
>>> but it is very common to need to customize.  At that point, the command
>>> line
>>> programs serve as example code.  You don't
>>> have to use them and they have no magic.
>>>
>>> If you think you have some improvements in generality, we can push them
>>> back
>>> into the Mahout versions.
>>>


Re: Getting mahout to run on the DFS

Posted by pragnesh radadia <pr...@gmail.com>.
are you using cloudera hadoop distribution ?
if yes then run kmean using hadoop or hdfs user to solve your problem


On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz <ms...@meebo-inc.com> wrote:
> Bug report created!  Thanks!
>
> One more random question: when running kmeans, there's a required -c
> (initial clusters) argument.  All the examples I've seen using kmeans
> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
> non-existent directory (presumably the algorithm would select some initial
> random clusters).
>
> But, when specifying some initial, nonexistent clusters directory, I get a
> bunch of:
> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> attempt_201008241139_107461_m_000002_2, Status : FAILED
> java.lang.IllegalStateException: No clusters found. Check your -c path.
>        at
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> And the job eventually fails with:
> Exception in thread "main" java.lang.InterruptedException: K-Means Iteration
> failed processing reuters-clusters/part-randomSeed
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
> Any thoughts on this one?
>
> Cheers,
> Matt
>
> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz <ms...@meebo-inc.com> wrote:
>>
>> > Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out rather
>> > nicely.
>> >
>>
>> There were lots of improvements there.
>>
>>
>> > One thing that I find really weird is that 'mahout seqdirectory' always
>> > hits
>> > the local filesystem for input, even when running in Hadoop mode.  So, if
>> I
>> > have 'myurls' on the DFS, but that path doesn't exist locally,
>> seqdirectory
>> > creates an empty sequence file (with no error).  Is this expected?
>> >
>>
>> No.  That sounds like a bug.  Can you file a report here:
>> https://issues.apache.org/jira/browse/MAHOUT ?
>>
>>
>> > Is there a nice way to create sequence files that isn't seqdirectory?
>>  I'd
>> > like to do a little processing on the documents as they get sent to the
>> > sequence file without having to generate a second copy on the DFS.
>> >
>>
>> Sure.  Just snarf the code from the program in question and massage it as
>> you like.  The command line versions are handy,
>> but it is very common to need to customize.  At that point, the command
>> line
>> programs serve as example code.  You don't
>> have to use them and they have no magic.
>>
>> If you think you have some improvements in generality, we can push them
>> back
>> into the Mahout versions.
>>
>

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
If it helps, here's the output:

[mspitz@wowzers mahout-0.4]$ MAHOUT_HOME=. HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/etc/hadoop-0.20/conf ./bin/mahout kmeans -i
reuters-seqdir-sparse/tf-vectors/ -o reuters-kmeans -k 20 --maxIter 10
--clusters reuters-clusters
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
HADOOP_CONF_DIR=/etc/hadoop-0.20/conf
10/10/27 15:26:01 WARN driver.MahoutDriver: No kmeans.props found on
classpath, will use command-line arguments only
10/10/27 15:26:01 INFO common.AbstractJob: Command line arguments:
{--clusters=reuters-clusters, --convergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclid
eanDistanceMeasure, --endPhase=2147483647,
--input=reuters-seqdir-sparse/tf-vectors/, --maxIter=10, --method=mapreduce,
--numClusters=20, --output=reuters-kmeans, --startPhase=0, --tempDir=temp}
*10/10/27 15:26:03 INFO common.HadoopUtil: Deleting reuters-clusters*
10/10/27 15:26:04 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
10/10/27 15:26:04 INFO compress.CodecPool: Got brand-new compressor
*10/10/27 15:26:11 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
reuters-clusters/part-randomSeed*
10/10/27 15:26:12 INFO kmeans.KMeansDriver: Input:
reuters-seqdir-sparse/tf-vectors Clusters In:
reuters-clusters/part-randomSeed Out: reuters-kmeans Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
10/10/27 15:26:12 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations:
10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
10/10/27 15:26:12 INFO kmeans.KMeansDriver: K-Means Iteration 1
10/10/27 15:26:12 INFO common.HadoopUtil: Deleting reuters-kmeans/clusters-1
10/10/27 15:26:15 INFO input.FileInputFormat: Total input paths to process :
3
10/10/27 15:26:16 INFO mapred.JobClient: Running job:
job_201008241139_107488
10/10/27 15:26:17 INFO mapred.JobClient:  map 0% reduce 0%
10/10/27 15:26:30 INFO mapred.JobClient: Task Id :
attempt_201008241139_107488_m_000002_0, Status : FAILED
java.lang.IllegalStateException: No clusters found. Check your -c path.
        at
org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

...

It looks like it's writing the random clusters (as expected), but then it
can't read them.

[mspitz@wowzers mahout-0.4]$ hadoop dfs -du reuters-clusters
Found 1 items
7970        hdfs://
192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed


On Wed, Oct 27, 2010 at 6:15 PM, Matt Spitz <ms...@meebo-inc.com> wrote:

> Bug report created!  Thanks!
>
> One more random question: when running kmeans, there's a required -c
> (initial clusters) argument.  All the examples I've seen using kmeans
> (namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
> non-existent directory (presumably the algorithm would select some initial
> random clusters).
>
> But, when specifying some initial, nonexistent clusters directory, I get a
> bunch of:
> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> attempt_201008241139_107461_m_000002_2, Status : FAILED
> java.lang.IllegalStateException: No clusters found. Check your -c path.
>         at
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> And the job eventually fails with:
> Exception in thread "main" java.lang.InterruptedException: K-Means
> Iteration failed processing reuters-clusters/part-randomSeed
>         at
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
>         at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
>         at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
>         at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
>         at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>         at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>         at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
> Any thoughts on this one?
>
> Cheers,
> Matt
>
> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz <ms...@meebo-inc.com> wrote:
>>
>> > Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out rather
>> > nicely.
>> >
>>
>> There were lots of improvements there.
>>
>>
>> > One thing that I find really weird is that 'mahout seqdirectory' always
>> > hits
>> > the local filesystem for input, even when running in Hadoop mode.  So,
>> if I
>> > have 'myurls' on the DFS, but that path doesn't exist locally,
>> seqdirectory
>> > creates an empty sequence file (with no error).  Is this expected?
>> >
>>
>> No.  That sounds like a bug.  Can you file a report here:
>> https://issues.apache.org/jira/browse/MAHOUT ?
>>
>>
>> > Is there a nice way to create sequence files that isn't seqdirectory?
>>  I'd
>> > like to do a little processing on the documents as they get sent to the
>> > sequence file without having to generate a second copy on the DFS.
>> >
>>
>> Sure.  Just snarf the code from the program in question and massage it as
>> you like.  The command line versions are handy,
>> but it is very common to need to customize.  At that point, the command
>> line
>> programs serve as example code.  You don't
>> have to use them and they have no magic.
>>
>> If you think you have some improvements in generality, we can push them
>> back
>> into the Mahout versions.
>>
>
>

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
Bug report created!  Thanks!

One more random question: when running kmeans, there's a required -c
(initial clusters) argument.  All the examples I've seen using kmeans
(namely https://issues.apache.org/jira/browse/MAHOUT-390) specify a
non-existent directory (presumably the algorithm would select some initial
random clusters).

But, when specifying some initial, nonexistent clusters directory, I get a
bunch of:
10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
attempt_201008241139_107461_m_000002_2, Status : FAILED
java.lang.IllegalStateException: No clusters found. Check your -c path.
        at
org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

And the job eventually fails with:
Exception in thread "main" java.lang.InterruptedException: K-Means Iteration
failed processing reuters-clusters/part-randomSeed
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

Any thoughts on this one?

Cheers,
Matt

On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz <ms...@meebo-inc.com> wrote:
>
> > Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out rather
> > nicely.
> >
>
> There were lots of improvements there.
>
>
> > One thing that I find really weird is that 'mahout seqdirectory' always
> > hits
> > the local filesystem for input, even when running in Hadoop mode.  So, if
> I
> > have 'myurls' on the DFS, but that path doesn't exist locally,
> seqdirectory
> > creates an empty sequence file (with no error).  Is this expected?
> >
>
> No.  That sounds like a bug.  Can you file a report here:
> https://issues.apache.org/jira/browse/MAHOUT ?
>
>
> > Is there a nice way to create sequence files that isn't seqdirectory?
>  I'd
> > like to do a little processing on the documents as they get sent to the
> > sequence file without having to generate a second copy on the DFS.
> >
>
> Sure.  Just snarf the code from the program in question and massage it as
> you like.  The command line versions are handy,
> but it is very common to need to customize.  At that point, the command
> line
> programs serve as example code.  You don't
> have to use them and they have no magic.
>
> If you think you have some improvements in generality, we can push them
> back
> into the Mahout versions.
>

Re: Getting mahout to run on the DFS

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz <ms...@meebo-inc.com> wrote:

> Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out rather
> nicely.
>

There were lots of improvements there.


> One thing that I find really weird is that 'mahout seqdirectory' always
> hits
> the local filesystem for input, even when running in Hadoop mode.  So, if I
> have 'myurls' on the DFS, but that path doesn't exist locally, seqdirectory
> creates an empty sequence file (with no error).  Is this expected?
>

No.  That sounds like a bug.  Can you file a report here:
https://issues.apache.org/jira/browse/MAHOUT ?


> Is there a nice way to create sequence files that isn't seqdirectory?  I'd
> like to do a little processing on the documents as they get sent to the
> sequence file without having to generate a second copy on the DFS.
>

Sure.  Just snarf the code from the program in question and massage it as
you like.  The command line versions are handy,
but it is very common to need to customize.  At that point, the command line
programs serve as example code.  You don't
have to use them and they have no magic.

If you think you have some improvements in generality, we can push them back
into the Mahout versions.

Re: Getting mahout to run on the DFS

Posted by Matt Spitz <ms...@meebo-inc.com>.
Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted things out rather
nicely.

One thing that I find really weird is that 'mahout seqdirectory' always hits
the local filesystem for input, even when running in Hadoop mode.  So, if I
have 'myurls' on the DFS, but that path doesn't exist locally, seqdirectory
creates an empty sequence file (with no error).  Is this expected?

Is there a nice way to create sequence files that isn't seqdirectory?  I'd
like to do a little processing on the documents as they get sent to the
sequence file without having to generate a second copy on the DFS.

Thanks, folks,
Matt

On Tue, Oct 26, 2010 at 11:12 PM, Ted Dunning <te...@gmail.com> wrote:

> Matt,
>
> I can't help you with your problem directly, but there have been several
> fixes that are likely to bear on the LDA code
> since 0.3 was released.  In fact, we are in the process right now of
> releasing 0.4.
>
> These fixes have to do with improving the handling of command line
> arguments
> and basic correctness of vector
> math.  My guess is that you will be happier with a more recent edition.
>
> The easiest way to get there is check out the trunk version and compile
> that.  If you are a brave soul you can look
> at the email archives and snag the release candidate that is being voted on
> right now.
>
> On Tue, Oct 26, 2010 at 1:53 PM, Matt Spitz <ms...@meebo-inc.com> wrote:
>
> > I'm running mahout-0.3 (stable downloaded from the site), and I'm trying
> to
> > run lda as a hadoop job.  Specifically, from within mahout-0.3 (totally
> > clean extraction of the tarball), I'm running:
> > ./bin/mahout lda -i myurls-seqdir-sparse/vectors -o myurls-lda -k 20 -v
> > 50000 -w
> >
> > myurls-seqdir-sparse exists on the dfs in my home directory (I can
> `hadoop
> > dfs -ls` it), and it's in the right format (I borrowed some lines from
> the
> > build-reuters.sh script)
> >
> > I run the command with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR set.
> >  The mahout script confirms this by saying:
> > "running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 and
> > HADOOP_CONF_DIR=/etc/hadoop-0.20"
> >
> > But then I get the following:
> > 10/10/26 13:17:33 ERROR driver.MahoutDriver: MahoutDriver failed with
> args:
> > [-i, myurls-seqdir-sparse/vectors, -o, myurls-lda, -k, 20, -v, 50000, -w,
> > null]
> > Input path does not exist:
> >
> file:/home/mspitz/mahoutplayground/mahout-0.3/myurls-seqdir-sparse/vectors
> >
> > It's true, 'myurls-seqdir-sparse' doesn't exist locally, but shouldn't it
> > be
> > looking on the DFS if I'm running it as a hadoop jar job?  If it helps,
> the
> > explicit command it's executing (from the script) is:
> > /usr/lib/hadoop-0.20/bin/hadoop jar
> >  /home/mspitz/mahoutplayground/mahout-0.3/mahout-examples-0.3.job
> > org.apache.mahout.driver.MahoutDriver lda -i myurls-seqdir-sparse/vectors
> > -o
> > myurls-lda -k 20 -v 50000 -w
> >
> > The HDFS works mighty fine and is configured properly (I run Pig jobs on
> it
> > all the time).   I just can't get the mahout job to run over hadoop.
> >
> > Any thoughts?
> >
> > Thanks, folks!
> >
> > -Matt
> >
>

Re: Getting mahout to run on the DFS

Posted by Ted Dunning <te...@gmail.com>.
Matt,

I can't help you with your problem directly, but there have been several
fixes that are likely to bear on the LDA code
since 0.3 was released.  In fact, we are in the process right now of
releasing 0.4.

These fixes have to do with improving the handling of command line arguments
and basic correctness of vector
math.  My guess is that you will be happier with a more recent edition.

The easiest way to get there is check out the trunk version and compile
that.  If you are a brave soul you can look
at the email archives and snag the release candidate that is being voted on
right now.

On Tue, Oct 26, 2010 at 1:53 PM, Matt Spitz <ms...@meebo-inc.com> wrote:

> I'm running mahout-0.3 (stable downloaded from the site), and I'm trying to
> run lda as a hadoop job.  Specifically, from within mahout-0.3 (totally
> clean extraction of the tarball), I'm running:
> ./bin/mahout lda -i myurls-seqdir-sparse/vectors -o myurls-lda -k 20 -v
> 50000 -w
>
> myurls-seqdir-sparse exists on the dfs in my home directory (I can `hadoop
> dfs -ls` it), and it's in the right format (I borrowed some lines from the
> build-reuters.sh script)
>
> I run the command with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR set.
>  The mahout script confirms this by saying:
> "running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 and
> HADOOP_CONF_DIR=/etc/hadoop-0.20"
>
> But then I get the following:
> 10/10/26 13:17:33 ERROR driver.MahoutDriver: MahoutDriver failed with args:
> [-i, myurls-seqdir-sparse/vectors, -o, myurls-lda, -k, 20, -v, 50000, -w,
> null]
> Input path does not exist:
> file:/home/mspitz/mahoutplayground/mahout-0.3/myurls-seqdir-sparse/vectors
>
> It's true, 'myurls-seqdir-sparse' doesn't exist locally, but shouldn't it
> be
> looking on the DFS if I'm running it as a hadoop jar job?  If it helps, the
> explicit command it's executing (from the script) is:
> /usr/lib/hadoop-0.20/bin/hadoop jar
>  /home/mspitz/mahoutplayground/mahout-0.3/mahout-examples-0.3.job
> org.apache.mahout.driver.MahoutDriver lda -i myurls-seqdir-sparse/vectors
> -o
> myurls-lda -k 20 -v 50000 -w
>
> The HDFS works mighty fine and is configured properly (I run Pig jobs on it
> all the time).   I just can't get the mahout job to run over hadoop.
>
> Any thoughts?
>
> Thanks, folks!
>
> -Matt
>