You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by vs <vi...@gmail.com> on 2011/04/22 20:24:30 UTC

kmeans on space-delimited input data,

Mahout Users,

  I have seen posts attempting to an answer the problem i have in hand. But,
i would like to seek some comments from who have been successful in
resolving this issue. 

(1) Input data: A space-delimited symmetric matrix of 500x500 double values.
The entire matrix is in one-single file, say 'raw-data.txt'
     Example:
                  1 0.8 0.9 ....
                  0.8 1 0.7 ....   
                  0.3 0.5 1 ....

(2) Data format conversion:
   
    (a)  Convert 'raw-data.txt' into a sequence format representation using
the commans
            
./mahout seqdirectory -i ~/temp/kmeans-input-dir/raw-dir/ -o
~/temp/kmeans-input-dir/seq-dir -c ascii

           
            
~/temp/kmeans-input-dir/seq-dir> ls -a
            .  ..  chunk-0  .chunk-0.crc


    (b) Convert sequence data into vector format:
           
 /mahout seq2sparse -i ~/temp/kmeans-input-dir/seq-dir/ -o
~/temp/kmeans-input-dir/vec-dir 


           
~/temp/kmeans-input-dir/vec-dir> ls -aR

.  ..  df-count  dictionary.file-0  .dictionary.file-0.crc  frequency.file-0 
.frequency.file-0.crc  tfidf-vectors  tf-vectors  tokenized-documents 
wordcount

./df-count:
.  ..  part-r-00000  .part-r-00000.crc

./tfidf-vectors:
.  ..  part-r-00000  .part-r-00000.crc

./tf-vectors:
.  ..  part-r-00000  .part-r-00000.crc

./tokenized-documents:
.  ..  part-m-00000  .part-m-00000.crc

./wordcount:
.  ..  part-r-00000  .part-r-00000.crc



(3) Run kmeans on the vector data

   
 ./mahout kmeans -c ~/temp/kmeans-input-dir/clusters/ -i
~/temp/kmeans-input-dir/vec-dir/tfidf-vectors/ -o
~/temp/kmeans-input-dir/kmeans-output -x 10 -k 5 -ow

            

11/04/22 13:11:35 INFO common.AbstractJob: Command line arguments:
{--clusters=~/temp/kmeans-input-dir/test-data-1-again/clusters,
--convergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647,
--input=~/temp/kmeans-input-dir/test-data-1-again/vec-dir/tfidf-vectors/,
--maxIter=10, --method=mapreduce, --numClusters=5,
--output=~/temp/kmeans-input-dir/test-data-1-again/kmeans-output,
--overwrite=null, --startPhase=0, --tempDir=temp}
11/04/22 13:11:35 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/04/22 13:11:35 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/04/22 13:11:35 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
Size: 1
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcce

thoughts and comments on the above procedure is highly appreciated.

thanks,

-----
vs
--
View this message in context: http://lucene.472066.n3.nabble.com/kmeans-on-space-delimited-input-data-tp2852337p2852337.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: kmeans on space-delimited input data,

Posted by Vincent Xue <xu...@gmail.com>.

Hello vs,

I am also a beginner mahout user and I think that the problem may be with
your initial step to convert the txt matrix to a sequence file. I had a
similar task to convert a tab delimited matrix into a sequence file of
<IntWrtiable,VectorWrtiable> for SVD computations.

What I did, was to write some custom Java code using the Hadoop and Mahout
API to convert my text file to a SequenceFile. I used a Map/Reduce
implementation but there mus be an easier way.

In your case, it seems that kmeans takes in a sequence file of
<Writable,Canopy> or <Writable,Cluster>.  I can include more details if you
would like but I am also interested to see if there is an easier way.

Vincent

On Fri, Apr 22, 2011 at 7:24 PM, vs <vi...@gmail.com> wrote:

> Mahout Users,
>
>  I have seen posts attempting to an answer the problem i have in hand. But,
> i would like to seek some comments from who have been successful in
> resolving this issue.
>
> (1) Input data: A space-delimited symmetric matrix of 500x500 double
> values.
> The entire matrix is in one-single file, say 'raw-data.txt'
>     Example:
>                  1 0.8 0.9 ....
>                  0.8 1 0.7 ....
>                  0.3 0.5 1 ....
>
> (2) Data format conversion:
>
>    (a)  Convert 'raw-data.txt' into a sequence format representation using
> the commans
>
> ./mahout seqdirectory -i ~/temp/kmeans-input-dir/raw-dir/ -o
> ~/temp/kmeans-input-dir/seq-dir -c ascii
>
>
>
> ~/temp/kmeans-input-dir/seq-dir> ls -a
>            .  ..  chunk-0  .chunk-0.crc
>
>
>    (b) Convert sequence data into vector format:
>
>  /mahout seq2sparse -i ~/temp/kmeans-input-dir/seq-dir/ -o
> ~/temp/kmeans-input-dir/vec-dir
>
>
>
> ~/temp/kmeans-input-dir/vec-dir> ls -aR
>
> .  ..  df-count  dictionary.file-0  .dictionary.file-0.crc
>  frequency.file-0
> .frequency.file-0.crc  tfidf-vectors  tf-vectors  tokenized-documents
> wordcount
>
> ./df-count:
> .  ..  part-r-00000  .part-r-00000.crc
>
> ./tfidf-vectors:
> .  ..  part-r-00000  .part-r-00000.crc
>
> ./tf-vectors:
> .  ..  part-r-00000  .part-r-00000.crc
>
> ./tokenized-documents:
> .  ..  part-m-00000  .part-m-00000.crc
>
> ./wordcount:
> .  ..  part-r-00000  .part-r-00000.crc
>
>
>
> (3) Run kmeans on the vector data
>
>
>  ./mahout kmeans -c ~/temp/kmeans-input-dir/clusters/ -i
> ~/temp/kmeans-input-dir/vec-dir/tfidf-vectors/ -o
> ~/temp/kmeans-input-dir/kmeans-output -x 10 -k 5 -ow
>
>
>
> 11/04/22 13:11:35 INFO common.AbstractJob: Command line arguments:
> {--clusters=~/temp/kmeans-input-dir/test-data-1-again/clusters,
> --convergenceDelta=0.5,
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> --endPhase=2147483647,
> --input=~/temp/kmeans-input-dir/test-data-1-again/vec-dir/tfidf-vectors/,
> --maxIter=10, --method=mapreduce, --numClusters=5,
> --output=~/temp/kmeans-input-dir/test-data-1-again/kmeans-output,
> --overwrite=null, --startPhase=0, --tempDir=temp}
> 11/04/22 13:11:35 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 11/04/22 13:11:35 INFO zlib.ZlibFactory: Successfully loaded & initialized
> native-zlib library
> 11/04/22 13:11:35 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
> Size: 1
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>        at java.util.ArrayList.get(ArrayList.java:322)
>        at
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcce
>
> thoughts and comments on the above procedure is highly appreciated.
>
> thanks,
>
> -----
> vs
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/kmeans-on-space-delimited-input-data-tp2852337p2852337.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>