You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Marshall Chen <ia...@gmail.com> on 2010/08/02 18:19:03 UTC

problem when using k-means on sythetic contral data

Hi,all,

      I encounter a problem when  implementing k-means cluster to sythentic
contral data in hadoop cluster,my process in a way like
1) convert text file to seqence file
     mahout seqdirectory -i /user/hadoop/synthetic -o
/user/hadoop/synthetic_seqfile -c UTF-8
2) create vector from seqence file
     mahout seq2sparse -i /user/hadoop/synthetic_seqfile --norm 2 -o
/user/hadoop/synthetic_vector --minDF 5 --maxDFPercent 90
3) cluster
     mahout kmeans -i /user/hadoop/synthetic_vector/vectors/part-00000  --k
6 -o /user/hadoop/synthetic_output -c /user/hadoop/synthetic_output/clusters

  by running cluster, problem show up:
10/08/02 23:53:17 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
10/08/02 23:53:17 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
10/08/02 23:53:17 INFO compress.CodecPool: Got brand-new compressor
10/08/02 23:53:17 ERROR driver.MahoutDriver: MahoutDriver failed with args:
[-i, /user/hadoop/synthetic_vector/tfidf/vectors/part-00000, --k, 6, -o,
/user/hadoop/synthetic_output, -c, /user/hadoop/synthetic_output/clusters,
null]
Index: 0, Size: 0
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
Size: 0
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:113)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:164)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

    Is there any one could point out what problem i have or any suggestion?
Thanks

Bests,
Marshall

Re: problem when using k-means on sythetic contral data

Posted by Ted Dunning <te...@gmail.com>.
Ahh.... thanks for being brave enough to ask.

A JIRA is a bug ticket.  See http://issues.apache.org/jira/browse/MAHOUT

Filing a complete statement of the problem there will really help with
documenting the problem.  Also, if you can develop a patch that helps
fix the problem, you can attach it there and others can help refine it.

On Thu, Aug 19, 2010 at 5:33 AM, rmx <ru...@hotmail.com> wrote:

>
> Sorry about my ignorance Ted, but what is a JIRA and how it suppose to
> submit
> it?
> thanks
> Rui
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/problem-when-using-k-means-on-sythetic-contral-data-tp1016421p1223775.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Re: problem when using k-means on sythetic contral data

Posted by rmx <ru...@hotmail.com>.
Sorry about my ignorance Ted, but what is a JIRA and how it suppose to submit
it?
thanks
Rui
-- 
View this message in context: http://lucene.472066.n3.nabble.com/problem-when-using-k-means-on-sythetic-contral-data-tp1016421p1223775.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: problem when using k-means on sythetic contral data

Posted by Ted Dunning <te...@gmail.com>.
Can you file a JIRA for this?

On Wed, Aug 18, 2010 at 6:41 AM, rmx <ru...@hotmail.com> wrote:

> f you fail the path, Mahout doesnt return any error. Instead produces a
> empty file and then explode when you run the kmeans driver.
>

Re: problem when using k-means on sythetic contral data

Posted by rmx <ru...@hotmail.com>.
Hi. 

I had that problem in the past, it seems that the index was 0 because the
vector file was empty (around 78 KB).
In my case it happened because I was making a mistake in the input path
during the first step.
If you fail the path, Mahout doesnt return any error. Instead produces a
empty file and then explode when you run the kmeans driver.

However, after fixed it the driver exploded with a similar error but this
time with index 1. 
It seems that kmeans doesnt like my sparse vector file. I wounder if this is
because my initial file is with values and not with text.

Please check the path, confirm if the files created are not empty, run the
driver and post here the results.

Good luck
-- 
View this message in context: http://lucene.472066.n3.nabble.com/problem-when-using-k-means-on-sythetic-contral-data-tp1016421p1205880.html
Sent from the Mahout User List mailing list archive at Nabble.com.