You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Videnova, Svetlana" <sv...@logica.com> on 2012/08/02 10:57:53 UTC

clustering with kmeans, java app

Hello,

I’m doing java app for clustering my data with kmeans.

Those are the steps:

1)

LuceneDemo : Create index and vectors using lib Lucene.vector, input path of my .txt, output index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important who will be using by mahout .tvf) and vectors looking like that (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text______t€ðàó^æVG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(2):{ [… and others])

Does anyone please can confirm me that the output format looks good? If no, what the vectors generated by lucene.vector should look like?

This is part of the code :
/*Creating vectors*/
                               Map vectorMap = new TreeMap();
                               IndexReader reader = IndexReader.open(index);
                               int numDoc = reader.maxDoc();
                               for(int i = 0; i < numDoc;i++){


                                               TermFreqVector termFreqVector = reader.getTermFreqVector(i, "content");
                                               addTermFreqToMap(vectorMap,termFreqVector);

                               }




2)


MainClass : Create clusters with mahout, input – path of vectors (the vectors generated by step 1 see above) , output -  clusters (looking like : for the moment does not create any clusters cause of this error :
Exception in thread "main" java.io.FileNotFoundException: File file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist.
      at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
      at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
      at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
      at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
      at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
      at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
      at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
      at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
      at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
      at main.MainClass.main(MainClass.java:144))


Does anyone please can help me to solve this exception? I can’t understand why data could not be created… while I’m using hadoop and mahout libs on windows (and I’m admin so should not be problem of rights).


This is part of the code :


            Pair<Long[], List<Path>> calculate =TFIDFConverter.calculateDF(new Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, chuckSize);

            TFIDFConverter.processTfIdf(new Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);

            Path vectorFolder = new Path("output");
            Path canopyCentroids = new Path(outputDir, "canopy-centroids");

            Path clusterOutput = new Path(outputDir, "clusters");

            CanopyDriver.run(vectorFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250, 120, false,3,false);

            KMeansDriver.run(conf, vectorFolder, new Path(canopyCentroids,"clusters-0"), clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true,3, false);


Thank you for your time




Regards

Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

RE: clustering with kmeans, java app

Posted by "Videnova, Svetlana" <sv...@logica.com>.

Hi,

Yes i'm using mahout and hadoop libs on windows.
My cluster output is not written on hdfs but in LOCAL.
Thanks to cygwin I am able to run unix command in order to run mahout on windows.
I changed the path on windows as well.

I didn’t test if wordcount is working, because I am using only mahout libs did not tried to run examples.
I was not following none tutorial but I found this may help you : http://blogs.msdn.com/b/avkashchauhan/archive/2012/03/06/running-apache-mahout-at-hadoop-on-windows-azure-www-hadooponazure-com.aspx



Cheers


-----Message d'origine-----
De : Yuval Feinstein [mailto:yuvalf@citypath.com] 
Envoyé : mardi 7 août 2012 08:16
À : user@mahout.apache.org
Objet : Re: clustering with kmeans, java app

I spent a week trying to get Hadoop to work on Windows 7, and then gave up.
Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work?
http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has lots of details about this.
Some of the possible problems are cygwin paths (!= linux paths), hdfs/local filesystem confusion, your hadoop user (!= your user permissions-wise), or other things listed at the link above.
Good luck,
Yuval

On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana <sv...@logica.com> wrote:
>
> Hello,
>
> I’m doing java app for clustering my data with kmeans.
>
> Those are the steps:
>
> 1)
>
> LuceneDemo : Create index and vectors using lib Lucene.vector, input 
> path of my .txt, output index (segments_1, segments.gen, .fdt, .fdx, 
> .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important 
> who will be using by mahout .tvf) and vectors looking like that 
> (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text______t€ðàó^æ
> VG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,1
> 1:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.46
> 50986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.999714136
> 1236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596
> ,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.999
> 7141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.99971413
> 61236572,8:1.4650986194610596,7:1.4650986194610596,6:1.465098619461059
> 6,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4
> 650986194610596,0:0.9997141361236572}_________Ž__P(2):{ [… and 
> others])
>
> Does anyone please can confirm me that the output format looks good? If no, what the vectors generated by lucene.vector should look like?
>
> This is part of the code :
> /*Creating vectors*/
>                                Map vectorMap = new TreeMap();
>                                IndexReader reader = IndexReader.open(index);
>                                int numDoc = reader.maxDoc();
>                                for(int i = 0; i < numDoc;i++){
>
>
>                                                TermFreqVector termFreqVector = reader.getTermFreqVector(i, "content");
>                                                
> addTermFreqToMap(vectorMap,termFreqVector);
>
>                                }
>
>
>
>
> 2)
>
>
> MainClass : Create clusters with mahout, input – path of vectors (the vectors generated by step 1 see above) , output -  clusters (looking like : for the moment does not create any clusters cause of this error :
> Exception in thread "main" java.io.FileNotFoundException: File file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist.
>       at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
>       at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>       at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>       at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
>       at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
>       at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>       at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>       at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
>       at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
>       at main.MainClass.main(MainClass.java:144))
>
>
> Does anyone please can help me to solve this exception? I can’t understand why data could not be created… while I’m using hadoop and mahout libs on windows (and I’m admin so should not be problem of rights).
>
>
> This is part of the code :
>
>
>             Pair<Long[], List<Path>> calculate 
> =TFIDFConverter.calculateDF(new 
> Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), 
> new Path(outputDir, 
> DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, chuckSize);
>
>             TFIDFConverter.processTfIdf(new 
> Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), 
> new Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, 
> sequentialAccessOutput, false, reduceTasks);
>
>             Path vectorFolder = new Path("output");
>             Path canopyCentroids = new Path(outputDir, 
> "canopy-centroids");
>
>             Path clusterOutput = new Path(outputDir, "clusters");
>
>             CanopyDriver.run(vectorFolder, canopyCentroids, new 
> EuclideanDistanceMeasure(), 250, 120, false,3,false);
>
>             KMeansDriver.run(conf, vectorFolder, new 
> Path(canopyCentroids,"clusters-0"), clusterOutput, new 
> TanimotoDistanceMeasure(), 0.01, 20, true,3, false);
>
>
> Thank you for your time
>
>
>
>
> Regards
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
>


Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

Re: clustering with kmeans, java app

Posted by Yuval Feinstein <yu...@citypath.com>.

I spent a week trying to get Hadoop to work on Windows 7, and then gave up.
Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work?
http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has
lots of details about this.
Some of the possible problems are cygwin paths (!= linux paths),
hdfs/local filesystem confusion, your hadoop user (!= your user
permissions-wise), or other things
listed at the link above.
Good luck,
Yuval

On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana
<sv...@logica.com> wrote:
>
> Hello,
>
> I’m doing java app for clustering my data with kmeans.
>
> Those are the steps:
>
> 1)
>
> LuceneDemo : Create index and vectors using lib Lucene.vector, input path of my .txt, output index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important who will be using by mahout .tvf) and vectors looking like that (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text______t€ðàó^æVG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(2):{ [… and others])
>
> Does anyone please can confirm me that the output format looks good? If no, what the vectors generated by lucene.vector should look like?
>
> This is part of the code :
> /*Creating vectors*/
>                                Map vectorMap = new TreeMap();
>                                IndexReader reader = IndexReader.open(index);
>                                int numDoc = reader.maxDoc();
>                                for(int i = 0; i < numDoc;i++){
>
>
>                                                TermFreqVector termFreqVector = reader.getTermFreqVector(i, "content");
>                                                addTermFreqToMap(vectorMap,termFreqVector);
>
>                                }
>
>
>
>
> 2)
>
>
> MainClass : Create clusters with mahout, input – path of vectors (the vectors generated by step 1 see above) , output -  clusters (looking like : for the moment does not create any clusters cause of this error :
> Exception in thread "main" java.io.FileNotFoundException: File file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist.
>       at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
>       at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>       at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>       at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
>       at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
>       at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>       at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>       at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
>       at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
>       at main.MainClass.main(MainClass.java:144))
>
>
> Does anyone please can help me to solve this exception? I can’t understand why data could not be created… while I’m using hadoop and mahout libs on windows (and I’m admin so should not be problem of rights).
>
>
> This is part of the code :
>
>
>             Pair<Long[], List<Path>> calculate =TFIDFConverter.calculateDF(new Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, chuckSize);
>
>             TFIDFConverter.processTfIdf(new Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);
>
>             Path vectorFolder = new Path("output");
>             Path canopyCentroids = new Path(outputDir, "canopy-centroids");
>
>             Path clusterOutput = new Path(outputDir, "clusters");
>
>             CanopyDriver.run(vectorFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250, 120, false,3,false);
>
>             KMeansDriver.run(conf, vectorFolder, new Path(canopyCentroids,"clusters-0"), clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true,3, false);
>
>
> Thank you for your time
>
>
>
>
> Regards
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
>