You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Divya <di...@k2associates.com.sg> on 2010/11/03 10:44:34 UTC

Kmeans Clustering error with XML input

Hi,

 

Steps I am following for K Means clustering :

I am using one of the chunk of Wikipedia as an input

 

Convert XML into sequence format 

$ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input  -o
D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8

 

Convert Sequence format to Vector format 

$ bin/mahout seqdirectory -i
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
7-pages-articles1.xml  -o D:/

MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8

 

Cluster data 

$ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
-o D:/MahoutResult/wikipedia/kmeans  -c D:/MahoutResult/wik

ipedia/kmeans -k 10  -x 20 -ow -cl

 

 

Whenever I am trying to run Kmeans clustering having XML file as an input 

I am getting following error 

 

Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2

HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf

10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c

onvergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
Measure, --endPhase=2147483647, --inpu

t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
--method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki

pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}

10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
D:/MahoutResult/wikipedia/kmeans

10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes wher

e applicable

10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
Size: 1

        at java.util.ArrayList.RangeCheck(ArrayList.java:547)

        at java.util.ArrayList.get(ArrayList.java:322)

        at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
edGenerator.java:107)

        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

        at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)

        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
.java:68)

        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)

        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)

        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 

 

 

Am I not suppose to use XML file as an input?

 

 

Regards,

Divya

Re: RE: Kmeans Clustering error with XML input

Posted by Matt Spitz <ms...@meebo-inc.com>.

Divya,

'seqdirectory' creates a document for every file in the directory you pass
in.  If there's just one file, there's just one document, and that's not
very interesting.

You basically have two options:
1) Parse the XML file once and break it into 1000s of little files (one per
document, however you define it)
2) Write a new 'seqdirectory' that creates a sequence file based on parsed
XML input.  This actually isn't too difficult, as the seqdirectory code is
pretty straightforward (thanks to whomever did that!).

-Matt

On Mon, Nov 8, 2010 at 1:39 AM, Divya <di...@k2associates.com.sg> wrote:

>  Hi Matt,
>
> I have an XML input file like Wikipedia XML and try to find similar
> documents using K means clustering.
>
> But If pass whole XML file(size 64 MB) as an during kmeans clustering I am
> getting error.
>
>
>
> According to your short answer , if  I have 1000 s documents in an XML file
> I should split my XML file in 1000s chunks.
>
>
>
> Is there any other way I can get similar documents ?
>
>
>
>
>
>
>
> Regards,
>
> Divya
>
>
>
> *From:* Matt Spitz [mailto:mspitz@meebo-inc.com]
> *Sent:* Thursday, November 04, 2010 8:46 PM
> *To:* Divya
> *Cc:* user@mahout.apache.org
>
> *Subject:* Re: RE: Kmeans Clustering error with XML input
>
>
>
> Divya-
>
>
>
> A document is what the clustering algorithm operates on.  It finds
> similarities among the documents and places similar documents into clusters.
>  The 'seqdirectory' command expects you to have a single document in every
> file in the input directory.  What do you expect to happen with your
> Wikipedia clustering?  What are you trying to do?
>
>
>
> Short answer: yes, split the XML file by the <page> tags, putting each
> <page> element in its own separate file.
>
>
>
> -Matt
>
>
>
> On Wed, Nov 3, 2010 at 10:26 PM, Divya <di...@k2associates.com.sg> wrote:
>
> Hi Matt,
> I have Split my file in 10 chunks of 10 MB each.
> Still getting  the error.
> Do you mean the I should split XML file in (in wikipeadia example <page>
> </page>).
>
> I didn't understand what one file = one document meant to.
>
> Regards,
> Divya
>
>
>
>
>
> $ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
>
> -o D:/MahoutResult/wikipedia/Kmeans  -dm  org.apache.mahout
> .common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans
>
> -k 10  -x 20 -ow -cl
>
> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>
> 10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c
> onvergenceDelta=0.5,
> --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure,
>
> --endPhase=2147483647, --input=D:/Mahou
> tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> --method=mapreduce, --numClusters=10,
>
> --output=D:/MahoutResult/wikipedia/Kmea
>
> ns, --overwrite=null, --startPhase=0, --tempDir=temp}
>
> 10/11/04 10:21:22 INFO common.HadoopUtil: Deleting
> D:/MahoutResult/wikipedia/Kmeans
> 10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop
>
> library for your platform... using builtin-java classes wher
> e applicable
>
> 10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5,
> Size: 5
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>        at java.util.ArrayList.get(ArrayList.java:322)
>        at
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> edGenerator.java:107)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> .java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
>
> Sent: Thursday, November 04, 2010 9:44 AM
> To: user@mahout.apache.org
> Cc: dev@mahout.apache.org
>
> Subject: Re: RE: Kmeans Clustering error with XML input
>
> Yes. One file = one document.
>
> Break the file into meaningful documents, one per file, and you should be
> golden.  The algorithm will then cluster these documents.
>
> ---
> Sent while mobile. Please forgive brevity and typos.
> On Nov 3, 2010 9:37 PM, "Divya" <di...@k2associates.com.sg> wrote:
> > Hi,
> >
> > My XML input file is just 64 MB i.e. I am using one of the chunk of
> > Wikipedia example.
> > Still I need to break this XML to get rid of the below error?
> >
> >
> > Thanks in advance
> > Regards,
> > Divya
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Wednesday, November 03, 2010 8:54 PM
> > To: user@mahout.apache.org
> > Cc: dev@mahout.apache.org
> > Subject: Re: Kmeans Clustering error with XML input
> >
> > Divya-
> >
> > Are you using just one input file? As far as I understand, seqdirectory
> > creates one document per file in your input directory. When you try to
> > cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
> > when generating the random input clusters. Which is just as well, because
> > your output won't be very interesting, anyway.
> >
> > Break the XML into at least 10 documents, and you should have better
> luck.
> >
> > -Matt
> >
> > On Wed, Nov 3, 2010 at 5:44 AM, Divya <di...@k2associates.com.sg> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> Steps I am following for K Means clustering :
> >>
> >> I am using one of the chunk of Wikipedia as an input
> >>
> >>
> >>
> >> Convert XML into sequence format
> >>
> >> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o
> >> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
> >>
> >>
> >>
> >> Convert Sequence format to Vector format
> >>
> >> $ bin/mahout seqdirectory -i
> >>
> >>
> >
>
> D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
> >> 7-pages-articles1.xml -o D:/
> >>
> >> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
> >>
> >>
> >>
> >> Cluster data
> >>
> >> $ bin/mahout kmeans -i
> D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> >> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik
> >>
> >> ipedia/kmeans -k 10 -x 20 -ow -cl
> >>
> >>
> >>
> >>
> >>
> >> Whenever I am trying to run Kmeans clustering having XML file as an
> input
> >>
> >> I am getting following error
> >>
> >>
> >>
> >> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
> >>
> >> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
> >>
> >> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
> >> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
> >>
> >> onvergenceDelta=0.5,
> >>
> >>
> >
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
> >> Measure, --endPhase=2147483647, --inpu
> >>
> >> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> >> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
> >>
> >> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
> >>
> >> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
> >> D:/MahoutResult/wikipedia/kmeans
> >>
> >> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load
> native-hadoop
> >> library for your platform... using builtin-java classes wher
> >>
> >> e applicable
> >>
> >> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
> >>
> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
> 1,
> >> Size: 1
> >>
> >> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> >>
> >> at java.util.ArrayList.get(ArrayList.java:322)
> >>
> >> at
> >>
> >>
> >
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> >> edGenerator.java:107)
> >>
> >> at
> >>
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
> >>
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>
> >> at
> >>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> >>
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> >> )
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> >> .java:25)
> >>
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >> at
> >>
> >>
> >
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> >> .java:68)
> >>
> >> at
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>
> >> at
> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >>
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> >> )
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> >> .java:25)
> >>
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Am I not suppose to use XML file as an input?
> >>
> >>
> >>
> >>
> >>
> >> Regards,
> >>
> >> Divya
> >>
> >>
> >
>
>
>

RE: RE: Kmeans Clustering error with XML input

Posted by Divya <di...@k2associates.com.sg>.

Hi Matt,

I have an XML input file like Wikipedia XML and try to find similar
documents using K means clustering.

But If pass whole XML file(size 64 MB) as an during kmeans clustering I am
getting error.

 

According to your short answer , if  I have 1000 s documents in an XML file
I should split my XML file in 1000s chunks.

 

Is there any other way I can get similar documents ?

 

 

 

Regards,

Divya 

 

From: Matt Spitz [mailto:mspitz@meebo-inc.com] 
Sent: Thursday, November 04, 2010 8:46 PM
To: Divya
Cc: user@mahout.apache.org
Subject: Re: RE: Kmeans Clustering error with XML input

 

Divya-

 

A document is what the clustering algorithm operates on.  It finds
similarities among the documents and places similar documents into clusters.
The 'seqdirectory' command expects you to have a single document in every
file in the input directory.  What do you expect to happen with your
Wikipedia clustering?  What are you trying to do?

 

Short answer: yes, split the XML file by the <page> tags, putting each
<page> element in its own separate file.

 

-Matt

 

On Wed, Nov 3, 2010 at 10:26 PM, Divya <di...@k2associates.com.sg> wrote:

Hi Matt,
I have Split my file in 10 chunks of 10 MB each.
Still getting  the error.
Do you mean the I should split XML file in (in wikipeadia example <page>
</page>).

I didn't understand what one file = one document meant to.

Regards,
Divya





$ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors

-o D:/MahoutResult/wikipedia/Kmeans  -dm  org.apache.mahout
.common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans

-k 10  -x 20 -ow -cl

Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf

10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c
onvergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure,

--endPhase=2147483647, --input=D:/Mahou
tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
--method=mapreduce, --numClusters=10,

--output=D:/MahoutResult/wikipedia/Kmea

ns, --overwrite=null, --startPhase=0, --tempDir=temp}

10/11/04 10:21:22 INFO common.HadoopUtil: Deleting
D:/MahoutResult/wikipedia/Kmeans
10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop

library for your platform... using builtin-java classes wher
e applicable

10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5,
Size: 5
       at java.util.ArrayList.RangeCheck(ArrayList.java:547)
       at java.util.ArrayList.get(ArrayList.java:322)
       at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
edGenerator.java:107)
       at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
       at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
.java:68)
       at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
       at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com]

Sent: Thursday, November 04, 2010 9:44 AM
To: user@mahout.apache.org
Cc: dev@mahout.apache.org

Subject: Re: RE: Kmeans Clustering error with XML input

Yes. One file = one document.

Break the file into meaningful documents, one per file, and you should be
golden.  The algorithm will then cluster these documents.

---
Sent while mobile. Please forgive brevity and typos.
On Nov 3, 2010 9:37 PM, "Divya" <di...@k2associates.com.sg> wrote:
> Hi,
>
> My XML input file is just 64 MB i.e. I am using one of the chunk of
> Wikipedia example.
> Still I need to break this XML to get rid of the below error?
>
>
> Thanks in advance
> Regards,
> Divya
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Wednesday, November 03, 2010 8:54 PM
> To: user@mahout.apache.org
> Cc: dev@mahout.apache.org
> Subject: Re: Kmeans Clustering error with XML input
>
> Divya-
>
> Are you using just one input file? As far as I understand, seqdirectory
> creates one document per file in your input directory. When you try to
> cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
> when generating the random input clusters. Which is just as well, because
> your output won't be very interesting, anyway.
>
> Break the XML into at least 10 documents, and you should have better luck.
>
> -Matt
>
> On Wed, Nov 3, 2010 at 5:44 AM, Divya <di...@k2associates.com.sg> wrote:
>
>> Hi,
>>
>>
>>
>> Steps I am following for K Means clustering :
>>
>> I am using one of the chunk of Wikipedia as an input
>>
>>
>>
>> Convert XML into sequence format
>>
>> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o
>> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>>
>>
>>
>> Convert Sequence format to Vector format
>>
>> $ bin/mahout seqdirectory -i
>>
>>
>
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
>> 7-pages-articles1.xml -o D:/
>>
>> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>>
>>
>>
>> Cluster data
>>
>> $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
>> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik
>>
>> ipedia/kmeans -k 10 -x 20 -ow -cl
>>
>>
>>
>>
>>
>> Whenever I am trying to run Kmeans clustering having XML file as an input
>>
>> I am getting following error
>>
>>
>>
>> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>>
>> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>>
>> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
>> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>>
>> onvergenceDelta=0.5,
>>
>>
>
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
>> Measure, --endPhase=2147483647, --inpu
>>
>> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
>> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>>
>> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>>
>> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
>> D:/MahoutResult/wikipedia/kmeans
>>
>> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load
native-hadoop
>> library for your platform... using builtin-java classes wher
>>
>> e applicable
>>
>> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>>
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
>> Size: 1
>>
>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>
>> at java.util.ArrayList.get(ArrayList.java:322)
>>
>> at
>>
>>
>
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
>> edGenerator.java:107)
>>
>> at
>>
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at
>>
>
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at
>>
>>
>
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
>> .java:68)
>>
>> at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>
>> at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>
>>
>>
>>
>>
>>
>> Am I not suppose to use XML file as an input?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Divya
>>
>>
>

Re: RE: Kmeans Clustering error with XML input

Posted by Matt Spitz <ms...@meebo-inc.com>.

Divya-

A document is what the clustering algorithm operates on.  It finds
similarities among the documents and places similar documents into clusters.
 The 'seqdirectory' command expects you to have a single document in every
file in the input directory.  What do you expect to happen with your
Wikipedia clustering?  What are you trying to do?

Short answer: yes, split the XML file by the <page> tags, putting each
<page> element in its own separate file.

-Matt

On Wed, Nov 3, 2010 at 10:26 PM, Divya <di...@k2associates.com.sg> wrote:

> Hi Matt,
> I have Split my file in 10 chunks of 10 MB each.
> Still getting  the error.
> Do you mean the I should split XML file in (in wikipeadia example <page>
> </page>).
>
> I didn't understand what one file = one document meant to.
>
> Regards,
> Divya
>
>
>
>
> $ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> -o D:/MahoutResult/wikipedia/Kmeans  -dm  org.apache.mahout
> .common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans
> -k 10  -x 20 -ow -cl
> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
> 10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c
> onvergenceDelta=0.5,
> --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure,
> --endPhase=2147483647, --input=D:/Mahou
> tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> --method=mapreduce, --numClusters=10,
> --output=D:/MahoutResult/wikipedia/Kmea
> ns, --overwrite=null, --startPhase=0, --tempDir=temp}
> 10/11/04 10:21:22 INFO common.HadoopUtil: Deleting
> D:/MahoutResult/wikipedia/Kmeans
> 10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes wher
> e applicable
> 10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5,
> Size: 5
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>        at java.util.ArrayList.get(ArrayList.java:322)
>        at
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> edGenerator.java:107)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> .java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, November 04, 2010 9:44 AM
> To: user@mahout.apache.org
> Cc: dev@mahout.apache.org
> Subject: Re: RE: Kmeans Clustering error with XML input
>
> Yes. One file = one document.
>
> Break the file into meaningful documents, one per file, and you should be
> golden.  The algorithm will then cluster these documents.
>
> ---
> Sent while mobile. Please forgive brevity and typos.
> On Nov 3, 2010 9:37 PM, "Divya" <di...@k2associates.com.sg> wrote:
> > Hi,
> >
> > My XML input file is just 64 MB i.e. I am using one of the chunk of
> > Wikipedia example.
> > Still I need to break this XML to get rid of the below error?
> >
> >
> > Thanks in advance
> > Regards,
> > Divya
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Wednesday, November 03, 2010 8:54 PM
> > To: user@mahout.apache.org
> > Cc: dev@mahout.apache.org
> > Subject: Re: Kmeans Clustering error with XML input
> >
> > Divya-
> >
> > Are you using just one input file? As far as I understand, seqdirectory
> > creates one document per file in your input directory. When you try to
> > cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
> > when generating the random input clusters. Which is just as well, because
> > your output won't be very interesting, anyway.
> >
> > Break the XML into at least 10 documents, and you should have better
> luck.
> >
> > -Matt
> >
> > On Wed, Nov 3, 2010 at 5:44 AM, Divya <di...@k2associates.com.sg> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> Steps I am following for K Means clustering :
> >>
> >> I am using one of the chunk of Wikipedia as an input
> >>
> >>
> >>
> >> Convert XML into sequence format
> >>
> >> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o
> >> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
> >>
> >>
> >>
> >> Convert Sequence format to Vector format
> >>
> >> $ bin/mahout seqdirectory -i
> >>
> >>
> >
>
> D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
> >> 7-pages-articles1.xml -o D:/
> >>
> >> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
> >>
> >>
> >>
> >> Cluster data
> >>
> >> $ bin/mahout kmeans -i
> D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> >> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik
> >>
> >> ipedia/kmeans -k 10 -x 20 -ow -cl
> >>
> >>
> >>
> >>
> >>
> >> Whenever I am trying to run Kmeans clustering having XML file as an
> input
> >>
> >> I am getting following error
> >>
> >>
> >>
> >> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
> >>
> >> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
> >>
> >> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
> >> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
> >>
> >> onvergenceDelta=0.5,
> >>
> >>
> >
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
> >> Measure, --endPhase=2147483647, --inpu
> >>
> >> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> >> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
> >>
> >> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
> >>
> >> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
> >> D:/MahoutResult/wikipedia/kmeans
> >>
> >> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load
> native-hadoop
> >> library for your platform... using builtin-java classes wher
> >>
> >> e applicable
> >>
> >> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
> >>
> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
> 1,
> >> Size: 1
> >>
> >> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> >>
> >> at java.util.ArrayList.get(ArrayList.java:322)
> >>
> >> at
> >>
> >>
> >
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> >> edGenerator.java:107)
> >>
> >> at
> >>
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
> >>
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>
> >> at
> >>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> >>
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> >> )
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> >> .java:25)
> >>
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >> at
> >>
> >>
> >
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> >> .java:68)
> >>
> >> at
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>
> >> at
> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >>
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> >> )
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> >> .java:25)
> >>
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Am I not suppose to use XML file as an input?
> >>
> >>
> >>
> >>
> >>
> >> Regards,
> >>
> >> Divya
> >>
> >>
> >
>
>

RE: RE: Kmeans Clustering error with XML input

Posted by Divya <di...@k2associates.com.sg>.

Hi Matt,
I have Split my file in 10 chunks of 10 MB each.
Still getting  the error.
Do you mean the I should split XML file in (in wikipeadia example <page>
</page>).

I didn't understand what one file = one document meant to.

Regards,
Divya 




$ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
-o D:/MahoutResult/wikipedia/Kmeans  -dm  org.apache.mahout
.common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans
-k 10  -x 20 -ow -cl
Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c
onvergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure,
--endPhase=2147483647, --input=D:/Mahou
tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
--method=mapreduce, --numClusters=10,
--output=D:/MahoutResult/wikipedia/Kmea
ns, --overwrite=null, --startPhase=0, --tempDir=temp}
10/11/04 10:21:22 INFO common.HadoopUtil: Deleting
D:/MahoutResult/wikipedia/Kmeans
10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes wher
e applicable
10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5,
Size: 5
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
edGenerator.java:107)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com] 
Sent: Thursday, November 04, 2010 9:44 AM
To: user@mahout.apache.org
Cc: dev@mahout.apache.org
Subject: Re: RE: Kmeans Clustering error with XML input

Yes. One file = one document.

Break the file into meaningful documents, one per file, and you should be
golden.  The algorithm will then cluster these documents.

---
Sent while mobile. Please forgive brevity and typos.
On Nov 3, 2010 9:37 PM, "Divya" <di...@k2associates.com.sg> wrote:
> Hi,
>
> My XML input file is just 64 MB i.e. I am using one of the chunk of
> Wikipedia example.
> Still I need to break this XML to get rid of the below error?
>
>
> Thanks in advance
> Regards,
> Divya
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Wednesday, November 03, 2010 8:54 PM
> To: user@mahout.apache.org
> Cc: dev@mahout.apache.org
> Subject: Re: Kmeans Clustering error with XML input
>
> Divya-
>
> Are you using just one input file? As far as I understand, seqdirectory
> creates one document per file in your input directory. When you try to
> cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
> when generating the random input clusters. Which is just as well, because
> your output won't be very interesting, anyway.
>
> Break the XML into at least 10 documents, and you should have better luck.
>
> -Matt
>
> On Wed, Nov 3, 2010 at 5:44 AM, Divya <di...@k2associates.com.sg> wrote:
>
>> Hi,
>>
>>
>>
>> Steps I am following for K Means clustering :
>>
>> I am using one of the chunk of Wikipedia as an input
>>
>>
>>
>> Convert XML into sequence format
>>
>> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o
>> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>>
>>
>>
>> Convert Sequence format to Vector format
>>
>> $ bin/mahout seqdirectory -i
>>
>>
>
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
>> 7-pages-articles1.xml -o D:/
>>
>> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>>
>>
>>
>> Cluster data
>>
>> $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
>> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik
>>
>> ipedia/kmeans -k 10 -x 20 -ow -cl
>>
>>
>>
>>
>>
>> Whenever I am trying to run Kmeans clustering having XML file as an input
>>
>> I am getting following error
>>
>>
>>
>> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>>
>> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>>
>> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
>> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>>
>> onvergenceDelta=0.5,
>>
>>
>
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
>> Measure, --endPhase=2147483647, --inpu
>>
>> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
>> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>>
>> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>>
>> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
>> D:/MahoutResult/wikipedia/kmeans
>>
>> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load
native-hadoop
>> library for your platform... using builtin-java classes wher
>>
>> e applicable
>>
>> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>>
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
>> Size: 1
>>
>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>
>> at java.util.ArrayList.get(ArrayList.java:322)
>>
>> at
>>
>>
>
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
>> edGenerator.java:107)
>>
>> at
>>
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at
>>
>
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at
>>
>>
>
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
>> .java:68)
>>
>> at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>
>> at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>
>>
>>
>>
>>
>>
>> Am I not suppose to use XML file as an input?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Divya
>>
>>
>

Re: RE: Kmeans Clustering error with XML input

Posted by Matt Spitz <ms...@meebo-inc.com>.

Yes. One file = one document.

Break the file into meaningful documents, one per file, and you should be
golden.  The algorithm will then cluster these documents.

---
Sent while mobile. Please forgive brevity and typos.
On Nov 3, 2010 9:37 PM, "Divya" <di...@k2associates.com.sg> wrote:
> Hi,
>
> My XML input file is just 64 MB i.e. I am using one of the chunk of
> Wikipedia example.
> Still I need to break this XML to get rid of the below error?
>
>
> Thanks in advance
> Regards,
> Divya
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Wednesday, November 03, 2010 8:54 PM
> To: user@mahout.apache.org
> Cc: dev@mahout.apache.org
> Subject: Re: Kmeans Clustering error with XML input
>
> Divya-
>
> Are you using just one input file? As far as I understand, seqdirectory
> creates one document per file in your input directory. When you try to
> cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
> when generating the random input clusters. Which is just as well, because
> your output won't be very interesting, anyway.
>
> Break the XML into at least 10 documents, and you should have better luck.
>
> -Matt
>
> On Wed, Nov 3, 2010 at 5:44 AM, Divya <di...@k2associates.com.sg> wrote:
>
>> Hi,
>>
>>
>>
>> Steps I am following for K Means clustering :
>>
>> I am using one of the chunk of Wikipedia as an input
>>
>>
>>
>> Convert XML into sequence format
>>
>> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o
>> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>>
>>
>>
>> Convert Sequence format to Vector format
>>
>> $ bin/mahout seqdirectory -i
>>
>>
>
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
>> 7-pages-articles1.xml -o D:/
>>
>> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>>
>>
>>
>> Cluster data
>>
>> $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
>> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik
>>
>> ipedia/kmeans -k 10 -x 20 -ow -cl
>>
>>
>>
>>
>>
>> Whenever I am trying to run Kmeans clustering having XML file as an input
>>
>> I am getting following error
>>
>>
>>
>> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>>
>> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>>
>> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
>> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>>
>> onvergenceDelta=0.5,
>>
>>
>
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
>> Measure, --endPhase=2147483647, --inpu
>>
>> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
>> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>>
>> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>>
>> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
>> D:/MahoutResult/wikipedia/kmeans
>>
>> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load
native-hadoop
>> library for your platform... using builtin-java classes wher
>>
>> e applicable
>>
>> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>>
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
>> Size: 1
>>
>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>
>> at java.util.ArrayList.get(ArrayList.java:322)
>>
>> at
>>
>>
>
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
>> edGenerator.java:107)
>>
>> at
>>
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at
>>
>
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at
>>
>>
>
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
>> .java:68)
>>
>> at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>
>> at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>
>>
>>
>>
>>
>>
>> Am I not suppose to use XML file as an input?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Divya
>>
>>
>

RE: Kmeans Clustering error with XML input

Posted by Divya <di...@k2associates.com.sg>.

Hi,

My XML input file is just 64 MB i.e. I am using one of the chunk of
Wikipedia example.
Still I need to break this XML to get rid of the below error?


Thanks in advance  
Regards,
Divya 

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com] 
Sent: Wednesday, November 03, 2010 8:54 PM
To: user@mahout.apache.org
Cc: dev@mahout.apache.org
Subject: Re: Kmeans Clustering error with XML input

Divya-

Are you using just one input file?  As far as I understand, seqdirectory
creates one document per file in your input directory.  When you try to
cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
when generating the random input clusters.  Which is just as well, because
your output won't be very interesting, anyway.

Break the XML into at least 10 documents, and you should have better luck.

-Matt

On Wed, Nov 3, 2010 at 5:44 AM, Divya <di...@k2associates.com.sg> wrote:

> Hi,
>
>
>
> Steps I am following for K Means clustering :
>
> I am using one of the chunk of Wikipedia as an input
>
>
>
> Convert XML into sequence format
>
> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input  -o
> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>
>
>
> Convert Sequence format to Vector format
>
> $ bin/mahout seqdirectory -i
>
>
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
> 7-pages-articles1.xml  -o D:/
>
> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>
>
>
> Cluster data
>
> $ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> -o D:/MahoutResult/wikipedia/kmeans  -c D:/MahoutResult/wik
>
> ipedia/kmeans -k 10  -x 20 -ow -cl
>
>
>
>
>
> Whenever I am trying to run Kmeans clustering having XML file as an input
>
> I am getting following error
>
>
>
> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>
> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>
> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>
> onvergenceDelta=0.5,
>
>
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
> Measure, --endPhase=2147483647, --inpu
>
> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>
> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>
> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
> D:/MahoutResult/wikipedia/kmeans
>
> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes wher
>
> e applicable
>
> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
> Size: 1
>
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>
>        at java.util.ArrayList.get(ArrayList.java:322)
>
>        at
>
>
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> edGenerator.java:107)
>
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
>        at
>
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at
>
>
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> .java:68)
>
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>
>        at
org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
>
>
>
>
>
> Am I not suppose to use XML file as an input?
>
>
>
>
>
> Regards,
>
> Divya
>
>

RE: Kmeans Clustering error with XML input

Posted by Divya <di...@k2associates.com.sg>.

Hi,

My XML input file is just 64 MB i.e. I am using one of the chunk of
Wikipedia example.
Still I need to break this XML to get rid of the below error?


Thanks in advance  
Regards,
Divya 

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com] 
Sent: Wednesday, November 03, 2010 8:54 PM
To: user@mahout.apache.org
Cc: dev@mahout.apache.org
Subject: Re: Kmeans Clustering error with XML input

Divya-

Are you using just one input file?  As far as I understand, seqdirectory
creates one document per file in your input directory.  When you try to
cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
when generating the random input clusters.  Which is just as well, because
your output won't be very interesting, anyway.

Break the XML into at least 10 documents, and you should have better luck.

-Matt

On Wed, Nov 3, 2010 at 5:44 AM, Divya <di...@k2associates.com.sg> wrote:

> Hi,
>
>
>
> Steps I am following for K Means clustering :
>
> I am using one of the chunk of Wikipedia as an input
>
>
>
> Convert XML into sequence format
>
> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input  -o
> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>
>
>
> Convert Sequence format to Vector format
>
> $ bin/mahout seqdirectory -i
>
>
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
> 7-pages-articles1.xml  -o D:/
>
> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>
>
>
> Cluster data
>
> $ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> -o D:/MahoutResult/wikipedia/kmeans  -c D:/MahoutResult/wik
>
> ipedia/kmeans -k 10  -x 20 -ow -cl
>
>
>
>
>
> Whenever I am trying to run Kmeans clustering having XML file as an input
>
> I am getting following error
>
>
>
> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>
> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>
> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>
> onvergenceDelta=0.5,
>
>
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
> Measure, --endPhase=2147483647, --inpu
>
> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>
> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>
> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
> D:/MahoutResult/wikipedia/kmeans
>
> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes wher
>
> e applicable
>
> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
> Size: 1
>
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>
>        at java.util.ArrayList.get(ArrayList.java:322)
>
>        at
>
>
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> edGenerator.java:107)
>
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
>        at
>
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at
>
>
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> .java:68)
>
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>
>        at
org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
>
>
>
>
>
> Am I not suppose to use XML file as an input?
>
>
>
>
>
> Regards,
>
> Divya
>
>

Re: Kmeans Clustering error with XML input

Posted by Matt Spitz <ms...@meebo-inc.com>.

Divya-

Are you using just one input file?  As far as I understand, seqdirectory
creates one document per file in your input directory.  When you try to
cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
when generating the random input clusters.  Which is just as well, because
your output won't be very interesting, anyway.

Break the XML into at least 10 documents, and you should have better luck.

-Matt

On Wed, Nov 3, 2010 at 5:44 AM, Divya <di...@k2associates.com.sg> wrote:

> Hi,
>
>
>
> Steps I am following for K Means clustering :
>
> I am using one of the chunk of Wikipedia as an input
>
>
>
> Convert XML into sequence format
>
> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input  -o
> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>
>
>
> Convert Sequence format to Vector format
>
> $ bin/mahout seqdirectory -i
>
> D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
> 7-pages-articles1.xml  -o D:/
>
> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>
>
>
> Cluster data
>
> $ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> -o D:/MahoutResult/wikipedia/kmeans  -c D:/MahoutResult/wik
>
> ipedia/kmeans -k 10  -x 20 -ow -cl
>
>
>
>
>
> Whenever I am trying to run Kmeans clustering having XML file as an input
>
> I am getting following error
>
>
>
> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>
> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>
> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>
> onvergenceDelta=0.5,
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
> Measure, --endPhase=2147483647, --inpu
>
> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>
> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>
> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
> D:/MahoutResult/wikipedia/kmeans
>
> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes wher
>
> e applicable
>
> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
> Size: 1
>
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>
>        at java.util.ArrayList.get(ArrayList.java:322)
>
>        at
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> edGenerator.java:107)
>
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> .java:68)
>
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
>
>
>
>
>
> Am I not suppose to use XML file as an input?
>
>
>
>
>
> Regards,
>
> Divya
>
>