You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ahmad Ammari <am...@gmail.com> on 2011/11/16 10:57:44 UTC

Exceptions when running kmeans from the mahout launcher

Hello,

When trying to run the kmeans program from the mahout launcher (bin/mahout
kmeans) to cluster the reuters dataset on a single-node Hadoop cluster
(HDFS) on my ubuntu laptop, I got an IndexOutOfBoundsException exception.
The input directory (reuters-vectors/tfidf-vectors/) resides on hadoop HDFS
file system. I have checked that and It does exist.

Here is what I run:

bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
reuters-initial-clusters -o reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
-k 20 -x 20 -cl

And here is what I get:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/src/conf
11/11/15 14:37:05 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/user/admin2/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 14:37:05 INFO common.HadoopUtil: Deleting reuters-initial-clusters
11/11/15 14:37:06 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/11/15 14:37:06 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/11/15 14:37:06 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

The previous mahout launcher programs, which creates the sequence files
(seqdirectory) and vectors (seq2sparse) from the reuters directory that has
the text files worked very fine with me. I have checked my HDFS file system
and the input directory: reuters-vectors/tfidf-vectors exists there!!

I then tried copying the input directory (reuters-vectors/tfidf-vectors)
from the HDFS on hadoop to the local file system and tried running the
mahout launcher script again. Now I am getting FileNotFoundException
instead of IndexOutOfBoundsException:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/11/15 19:19:42 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=examples/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 19:19:42 INFO common.HadoopUtil: Deleting reuters-initial-clusters
Exception in thread "main" java.io.FileNotFoundException: File does not
exist: examples/reuters-vectors/tfidf-vectors
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:67)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

What could be wrong? Why ain't mahout finding the input directory/files?
And where the input directory should be? on HDFS or on the local file
system?

Many thanks in advance,
Ahmad

RE: Exceptions when running kmeans from the mahout launcher

Posted by Jeff Eastman <je...@Narus.com>.

Usually we see this error when the expected input vectors are not present. Often it is a configuration issue. Verify your paths exist and that there are vectors where you expect them.

-----Original Message-----
From: Ahmad Ammari [mailto:ammariect@gmail.com] 
Sent: Wednesday, November 16, 2011 1:58 AM
To: user@mahout.apache.org
Subject: Exceptions when running kmeans from the mahout launcher

Hello,

When trying to run the kmeans program from the mahout launcher (bin/mahout
kmeans) to cluster the reuters dataset on a single-node Hadoop cluster
(HDFS) on my ubuntu laptop, I got an IndexOutOfBoundsException exception.
The input directory (reuters-vectors/tfidf-vectors/) resides on hadoop HDFS
file system. I have checked that and It does exist.

Here is what I run:

bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
reuters-initial-clusters -o reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
-k 20 -x 20 -cl

And here is what I get:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/src/conf
11/11/15 14:37:05 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/user/admin2/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 14:37:05 INFO common.HadoopUtil: Deleting reuters-initial-clusters
11/11/15 14:37:06 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/11/15 14:37:06 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/11/15 14:37:06 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

The previous mahout launcher programs, which creates the sequence files
(seqdirectory) and vectors (seq2sparse) from the reuters directory that has
the text files worked very fine with me. I have checked my HDFS file system
and the input directory: reuters-vectors/tfidf-vectors exists there!!

I then tried copying the input directory (reuters-vectors/tfidf-vectors)
from the HDFS on hadoop to the local file system and tried running the
mahout launcher script again. Now I am getting FileNotFoundException
instead of IndexOutOfBoundsException:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/11/15 19:19:42 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=examples/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 19:19:42 INFO common.HadoopUtil: Deleting reuters-initial-clusters
Exception in thread "main" java.io.FileNotFoundException: File does not
exist: examples/reuters-vectors/tfidf-vectors
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:67)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

What could be wrong? Why ain't mahout finding the input directory/files?
And where the input directory should be? on HDFS or on the local file
system?

Many thanks in advance,
Ahmad

Exceptions when running kmeans from the mahout launcher

Posted by Ahmad Ammari <am...@gmail.com>.

Hello,

When trying to run the kmeans program from the mahout launcher (bin/mahout
kmeans) to cluster the reuters dataset on a single-node Hadoop cluster
(HDFS) on my ubuntu laptop, I got an IndexOutOfBoundsException exception.
The input directory (reuters-vectors/tfidf-vectors/) resides on hadoop HDFS
file system. I have checked that and It does exist.

Here is what I run:

bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
reuters-initial-clusters -o reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
-k 20 -x 20 -cl

And here is what I get:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/src/conf
11/11/15 14:37:05 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/user/admin2/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 14:37:05 INFO common.HadoopUtil: Deleting reuters-initial-clusters
11/11/15 14:37:06 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/11/15 14:37:06 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/11/15 14:37:06 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

The previous mahout launcher programs, which creates the sequence files
(seqdirectory) and vectors (seq2sparse) from the reuters directory that has
the text files worked very fine with me. I have checked my HDFS file system
and the input directory: reuters-vectors/tfidf-vectors exists there!!

I then tried copying the input directory (reuters-vectors/tfidf-vectors)
from the HDFS on hadoop to the local file system and tried running the
mahout launcher script again. Now I am getting FileNotFoundException
instead of IndexOutOfBoundsException:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/11/15 19:19:42 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=examples/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 19:19:42 INFO common.HadoopUtil: Deleting reuters-initial-clusters
Exception in thread "main" java.io.FileNotFoundException: File does not
exist: examples/reuters-vectors/tfidf-vectors
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:67)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

What could be wrong?

Why isn't mahout finding the input directory/files?

Where the input directory should be? on HDFS or on the local file system?

How comes the other mahout launcher programs that are used to build the
Sequence Files and Vectors both were able to read the input directories,
but the kmeans program is not?!

Many thanks in advance,
Ahmad