You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Kerwin <ke...@gmail.com> on 2012/04/12 07:53:45 UTC

Help with TFIDF vectors and Mahout clustering

Hi,

I am new to Mahout and am using the NewsKMeansClustering class from the
book- Mahout in Action to cluster the Reuters News collection using Mahout
0.5 .I am running this in eclipse on windows XP. I believe that there is no
additional Hadoop configuration required while running in local mode on
Windows (but im not sure what the Hadoop Configuration() in the code gets).

The output of the Kmeans clustering does not show any meaningful result:
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []

On further checking through the individual folders generated I found that
the TFIDF vectors are not being generated properly as I get very few
Sequential access sparse vectors while most of them are empty.
All the other intermediate folders generated like tokenized-documents,
tf-vectors ,df-count etc have meaningful results. Hence trying to change
the thresholds of the clustering algorithm also does not matter to the
output.

I have used only a small subset of the documents from the news collection
but this applies even if I used the entire news document collection.

Here is the code. I have changed minSupport to 1 and minDf to 1 to try and
get some results on purpose.

    int minSupport = 1;
    int minDf = 1;
    int maxDFPercent = 95;
    int maxNGramSize = 2;
    int minLLRValue = 50;
    int reduceTasks = 1;
    int chunkSize = 200;
    int norm = 2;
    boolean sequentialAccessOutput = true;

    String inputDir = "F:/MAHOUT/Source/reuters-seqfilestest";

    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);

    String outputDir = "F:/MAHOUT/Source/reuters-vectors";
    HadoopUtil.delete(conf, new Path(outputDir));

    Path tokenizedPath = new Path(outputDir,
        DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
    MyAnalyzer analyzer = new MyAnalyzer();

   DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzer.getClass()
        .asSubclass(Analyzer.class), tokenizedPath, conf);

    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
      new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2,
true, reduceTasks,
      chunkSize, sequentialAccessOutput, false);

   TFIDFConverter.processTfIdf(
      new Path(outputDir ,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
      new Path(outputDir), conf, chunkSize, minDf,
      maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);

    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
    Path canopyCentroids = new Path(outputDir , "canopy-centroids");
    Path clusterOutput = new Path(outputDir , "clusters");

    CanopyDriver.run(vectorsFolder, canopyCentroids,
      new EuclideanDistanceMeasure(), 250, 120, false, false);

   KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0"),
      clusterOutput, new TanimotoDistanceMeasure(), 0.01,
      20, true, false);

Could you please let me know how to proceed in order to get the TFIDF
vectors populated?

Thanks a lot.

Re: Help with TFIDF vectors and Mahout clustering

Posted by Kerwin <ke...@gmail.com>.

Thank you Lance for your reply.
I did have cygwin/bin in the PATH which has the file chmod.exe. However I
still do not get the TFIDF vectors.The vectors file is just 1KB and I have
seen that its output does not show any vectors. The program however does
not give any error while running in eclipse. I have tried various things in
eclipse.
I moved to using cygwin and now I get the TFIDF vectors after running in
Local Mode.
However I now use the built in commands and classes like seqdirectory ,
seq2sparse and kmeans to get the output using cygwin.
I still need to figure out what goes wrong while using eclipse. Please let
me know if there is anything else that I can check.

On Thu, Apr 12, 2012 at 1:13 AM, Lance Norskog <go...@gmail.com> wrote:

> Did you take care of the 'chmod' problem? Hadoop insists on called the
> 'chmod' program to change the mode of some files or directories it
> makes. If the chmod program is not in the path, Hadoop jobs fail. So,
> you need a binary 'chmod' or 'chmod.exe' in the execution path when
> you run Hadoop inside Eclipse.
>
> On Wed, Apr 11, 2012 at 10:53 PM, Kerwin <ke...@gmail.com> wrote:
> > Hi,
> >
> > I am new to Mahout and am using the NewsKMeansClustering class from the
> > book- Mahout in Action to cluster the Reuters News collection using
> Mahout
> > 0.5 .I am running this in eclipse on windows XP. I believe that there is
> no
> > additional Hadoop configuration required while running in local mode on
> > Windows (but im not sure what the Hadoop Configuration() in the code
> gets).
> >
> > The output of the Kmeans clustering does not show any meaningful result:
> > 0 belongs to cluster 1.0: []
> > 0 belongs to cluster 1.0: []
> > 0 belongs to cluster 1.0: []
> > 0 belongs to cluster 1.0: []
> > 0 belongs to cluster 1.0: []
> >
> > On further checking through the individual folders generated I found that
> > the TFIDF vectors are not being generated properly as I get very few
> > Sequential access sparse vectors while most of them are empty.
> > All the other intermediate folders generated like tokenized-documents,
> > tf-vectors ,df-count etc have meaningful results. Hence trying to change
> > the thresholds of the clustering algorithm also does not matter to the
> > output.
> >
> > I have used only a small subset of the documents from the news collection
> > but this applies even if I used the entire news document collection.
> >
> > Here is the code. I have changed minSupport to 1 and minDf to 1 to try
> and
> > get some results on purpose.
> >
> >    int minSupport = 1;
> >    int minDf = 1;
> >    int maxDFPercent = 95;
> >    int maxNGramSize = 2;
> >    int minLLRValue = 50;
> >    int reduceTasks = 1;
> >    int chunkSize = 200;
> >    int norm = 2;
> >    boolean sequentialAccessOutput = true;
> >
> >    String inputDir = "F:/MAHOUT/Source/reuters-seqfilestest";
> >
> >    Configuration conf = new Configuration();
> >    FileSystem fs = FileSystem.get(conf);
> >
> >    String outputDir = "F:/MAHOUT/Source/reuters-vectors";
> >    HadoopUtil.delete(conf, new Path(outputDir));
> >
> >    Path tokenizedPath = new Path(outputDir,
> >        DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
> >    MyAnalyzer analyzer = new MyAnalyzer();
> >
> >   DocumentProcessor.tokenizeDocuments(new Path(inputDir),
> > analyzer.getClass()
> >        .asSubclass(Analyzer.class), tokenizedPath, conf);
> >
> >    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> >      new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2,
> > true, reduceTasks,
> >      chunkSize, sequentialAccessOutput, false);
> >
> >   TFIDFConverter.processTfIdf(
> >      new Path(outputDir ,
> > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
> >      new Path(outputDir), conf, chunkSize, minDf,
> >      maxDFPercent, norm, true, sequentialAccessOutput, false,
> reduceTasks);
> >
> >    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
> >    Path canopyCentroids = new Path(outputDir , "canopy-centroids");
> >    Path clusterOutput = new Path(outputDir , "clusters");
> >
> >    CanopyDriver.run(vectorsFolder, canopyCentroids,
> >      new EuclideanDistanceMeasure(), 250, 120, false, false);
> >
> >   KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> > "clusters-0"),
> >      clusterOutput, new TanimotoDistanceMeasure(), 0.01,
> >      20, true, false);
> >
> > Could you please let me know how to proceed in order to get the TFIDF
> > vectors populated?
> >
> > Thanks a lot.
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Help with TFIDF vectors and Mahout clustering

Posted by Lance Norskog <go...@gmail.com>.

Did you take care of the 'chmod' problem? Hadoop insists on called the
'chmod' program to change the mode of some files or directories it
makes. If the chmod program is not in the path, Hadoop jobs fail. So,
you need a binary 'chmod' or 'chmod.exe' in the execution path when
you run Hadoop inside Eclipse.

On Wed, Apr 11, 2012 at 10:53 PM, Kerwin <ke...@gmail.com> wrote:
> Hi,
>
> I am new to Mahout and am using the NewsKMeansClustering class from the
> book- Mahout in Action to cluster the Reuters News collection using Mahout
> 0.5 .I am running this in eclipse on windows XP. I believe that there is no
> additional Hadoop configuration required while running in local mode on
> Windows (but im not sure what the Hadoop Configuration() in the code gets).
>
> The output of the Kmeans clustering does not show any meaningful result:
> 0 belongs to cluster 1.0: []
> 0 belongs to cluster 1.0: []
> 0 belongs to cluster 1.0: []
> 0 belongs to cluster 1.0: []
> 0 belongs to cluster 1.0: []
>
> On further checking through the individual folders generated I found that
> the TFIDF vectors are not being generated properly as I get very few
> Sequential access sparse vectors while most of them are empty.
> All the other intermediate folders generated like tokenized-documents,
> tf-vectors ,df-count etc have meaningful results. Hence trying to change
> the thresholds of the clustering algorithm also does not matter to the
> output.
>
> I have used only a small subset of the documents from the news collection
> but this applies even if I used the entire news document collection.
>
> Here is the code. I have changed minSupport to 1 and minDf to 1 to try and
> get some results on purpose.
>
>    int minSupport = 1;
>    int minDf = 1;
>    int maxDFPercent = 95;
>    int maxNGramSize = 2;
>    int minLLRValue = 50;
>    int reduceTasks = 1;
>    int chunkSize = 200;
>    int norm = 2;
>    boolean sequentialAccessOutput = true;
>
>    String inputDir = "F:/MAHOUT/Source/reuters-seqfilestest";
>
>    Configuration conf = new Configuration();
>    FileSystem fs = FileSystem.get(conf);
>
>    String outputDir = "F:/MAHOUT/Source/reuters-vectors";
>    HadoopUtil.delete(conf, new Path(outputDir));
>
>    Path tokenizedPath = new Path(outputDir,
>        DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
>    MyAnalyzer analyzer = new MyAnalyzer();
>
>   DocumentProcessor.tokenizeDocuments(new Path(inputDir),
> analyzer.getClass()
>        .asSubclass(Analyzer.class), tokenizedPath, conf);
>
>    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>      new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2,
> true, reduceTasks,
>      chunkSize, sequentialAccessOutput, false);
>
>   TFIDFConverter.processTfIdf(
>      new Path(outputDir ,
> DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
>      new Path(outputDir), conf, chunkSize, minDf,
>      maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);
>
>    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
>    Path canopyCentroids = new Path(outputDir , "canopy-centroids");
>    Path clusterOutput = new Path(outputDir , "clusters");
>
>    CanopyDriver.run(vectorsFolder, canopyCentroids,
>      new EuclideanDistanceMeasure(), 250, 120, false, false);
>
>   KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> "clusters-0"),
>      clusterOutput, new TanimotoDistanceMeasure(), 0.01,
>      20, true, false);
>
> Could you please let me know how to proceed in order to get the TFIDF
> vectors populated?
>
> Thanks a lot.



-- 
Lance Norskog
goksron@gmail.com