You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "Scott C. Cote" <sc...@gmail.com> on 2013/12/27 20:56:35 UTC
Mahout In Action - NewsKMeansClustering sample not generating clusters

Hello Mahout Trainers and Gurus:

I am plowing through the sample code from Mahout in Action.  Have been
trying to run the example NewsKMeansClustering using the  Reuters dataset.
Found Alex Ott's Blog

http://alexott.blogspot.co.uk/2012/07/getting-started-with-examples-from.htm
l

And downloaded the updated examples for 0.7 mahout.  I took the exploded zip
and modified the pom.xml so that it referenced 0.8 mahout instead of 0.7
mahout.

Of course, there are compile errors (expected), but the only "seemingly"
significant problems are in the helper class called MyAnalyzer.

NOTE: am NOT complaining about the fact that the samples don't compile
properly in 0.8 .  If my efforts to make it work results in sharable code 
then I have helped (or that person who helps me helped).


I need help in potentially two different parts:   Revision of MyAnalyzer
(steps 1 and 2) and/or sidestepping it (step 3)

Steps Taken (total of 3 steps):

Step 1. Performed the sgml2text conversion of reuters data and then
converted the text to sequence files.
Step 2. Attempted to run the java NewsKMeansClustering  with MyAnalyzer -
attempted to modify MyAnalyzer to fit into the 0.8 mahout world

When I try to run the program, the sample blows up with this message:

> 2013-12-27 12:59:29.870 java[86219:1203] Unable to load realm info from
> SCDynamicStore
> 
> SLF4J: Class path contains multiple SLF4J bindings.
> 
> SLF4J: Found binding in
> [jar:file:/Users/scottccote/.m2/repository/org/slf4j/slf4j-jcl/1.7.5/slf4j-jcl
> -1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 
> SLF4J: Found binding in
> [jar:file:/Users/scottccote/.m2/repository/org/slf4j/slf4j-log4j12/1.5.11/slf4
> j-log4j12-1.5.11.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> 
> SLF4J: Actual binding is of type [org.slf4j.impl.JCLLoggerFactory]
> 
> 2013-12-27 12:59:30 NativeCodeLoader [WARN] Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 
> 2013-12-27 12:59:30 JobClient [WARN] Use GenericOptionsParser for parsing the
> arguments. Applications should implement Tool for the same.
> 
> 2013-12-27 12:59:30 LocalJobRunner [WARN] job_local_0001
> 
> java.lang.NullPointerException
> 
> at 
> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.fill(Charac
> terUtils.java:209)
> 
> at 
> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.jav
> a:135)
> 
> at 
> org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(Sequence
> FileTokenizerMapper.java:49)
> 
> at 
> org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(Sequence
> FileTokenizerMapper.java:38)
> 
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> 
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> 
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
> 
> Exception in thread "main" java.lang.IllegalStateException: Job failed!
> 
> at 
> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProce
> ssor.java:95)
> 
> at mia.clustering.ch09.NewsKMeansClustering.main(NewsKMeansClustering.java:53)


Here is the source code to my revised MyAnalyzer  I tried to stay as true
to form of the original "MyAnalyzer" but I'm sure that
I misunderstood "something" in this class when I ported it to the new Lucene
Analyzer interface api.

> public class MyAnalyzer extends Analyzer
> 
> {
> 
> 
> 
> private final Pattern alphabets = Pattern.compile("[a-z]+");
> 
> 
> 
> /*
> 
>  * (non-Javadoc)
> 
>  * @see org.apache.lucene.analysis.Analyzer#createComponents(java.lang.String,
> java.io.Reader)
> 
>  */
> 
> @Override
> 
> protected TokenStreamComponents createComponents(String fieldName, Reader
> reader)
> 
> {
> 
> final Tokenizer source = new StandardTokenizer(Version.LUCENE_CURRENT,
> reader);
> 
> TokenStream result = new StandardFilter(Version.LUCENE_CURRENT, source);
> 
> result = new LowerCaseFilter(Version.LUCENE_CURRENT, result);
> 
> result = new StopFilter(Version.LUCENE_CURRENT, result,
> StandardAnalyzer.STOP_WORDS_SET);
> 
> CharTermAttribute termAtt = result.addAttribute(CharTermAttribute.class);
> 
> StringBuilder buf = new StringBuilder();
> 
> 
> 
> try
> 
> {
> 
> result.reset();
> 
> while ( result.incrementToken() )
> 
> {
> 
> if ( termAtt.length() < 3 )
> 
> continue;
> 
> String word = new String(termAtt.buffer(), 0, termAtt.length());
> 
> Matcher m = alphabets.matcher(word);
> 
> 
> 
> if ( m.matches() )
> 
> {
> 
> buf.append(word).append(" ");
> 
> }
> 
> }
> 
> }
> 
> catch ( IOException e )
> 
> {
> 
> e.printStackTrace();
> 
> }
> 
> 
> 
> TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_CURRENT, new
> StringReader(buf.toString()));
> 
> return new TokenStreamComponents(source, ts);
> 
> }
> 
> }


Step 3. Since I wasn't progressing with "MyAnalyzer" - I commented out the
MyAnalyzer reference inside NewsKMeansClustering and replaced with

> // MyAnalyzer analyzer = new MyAnalyzer();
> 
> System.out.println("tokenizing the documents");
> 
> StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT,
> StandardAnalyzer.STOP_WORDS_SET);
> 
And then I get past the problem mentioned in step 2 - all the way to
calculating the means clusters based on canopy data.
Unfortunately, - no clusters are generated out of canopy process.  I
confirmed by navigating to the folder titled:

> newsClusters/canopy-centroids/clusters-0-final

And issued the command

> mahout seqdumper -i part-r-00000

To see the result

> Input Path: part-r-00000
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.clustering.iterator.ClusterWritable
> Count: 0

So what do I need to do in order for the sample to generate clusters?
NOTE:  I was able to generate clusters using the manual process (command
line methods).