You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/03/12 23:54:06 UTC

Deploy a custom lucene analyzer

I'd like to deploy a simple custom analyzer that puts together some 
lucene filters as outlined in Mahout in Action.

    package = com.mydomain.analyzer

    public class LuceneStemmingAnalyzer  extends Analyzer {

         @SuppressWarnings("deprecation")
         @Override
         public TokenStream tokenStream(String fieldName, Reader reader) {
             // yep it's working so no need to blather
             // System.out.println("Tokenizing using the custom
    LuceneStemmingAnalyzer");
             TokenStream result = new StandardTokenizer(
                     Version.LUCENE_CURRENT, reader);
             result = new LowerCaseFilter(result);
             result = new LengthFilter(result, 3, 50);
             result = new StopFilter(true, result,
    StandardAnalyzer.STOP_WORDS_SET);
             result = new PorterStemFilter(result);
             return result;
         }
    }

So the classname is com.mydomain.analyzer.LuceneStemmingAnalyzer as it 
is passed in to seq2sparse. When I build a jar and put it with the other 
dependency jars seq2sparse works locally thus:

    bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
    com.mydomain.analyzers.LuceneStemmingAnalyzer -chunk 100 -wt tfidf
    -s 2 -md 3 -x 95 -ng 2 -ml 50 -seq -n 2

My question is, what is the recommended way to deploy this analyzer to a 
mahout/hadoop cluster? I assume there is no need in this case to create 
a job jar since we pass the classname in explicitly? Note that it 
extends the lucene abstract class Analyzer which is in the same jar as 
WhiteSpaceAnalyzer the default for seq2sparse.

Thanks.

BTW deploying in the same way as I did locally but on the cluster gives 
me a ClassDefNotFound error for Analyzer. So not my class but the 
abstract class that I extended.