You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/03/12 23:54:06 UTC
Deploy a custom lucene analyzer
I'd like to deploy a simple custom analyzer that puts together some
lucene filters as outlined in Mahout in Action.
package = com.mydomain.analyzer
public class LuceneStemmingAnalyzer extends Analyzer {
@SuppressWarnings("deprecation")
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
// yep it's working so no need to blather
// System.out.println("Tokenizing using the custom
LuceneStemmingAnalyzer");
TokenStream result = new StandardTokenizer(
Version.LUCENE_CURRENT, reader);
result = new LowerCaseFilter(result);
result = new LengthFilter(result, 3, 50);
result = new StopFilter(true, result,
StandardAnalyzer.STOP_WORDS_SET);
result = new PorterStemFilter(result);
return result;
}
}
So the classname is com.mydomain.analyzer.LuceneStemmingAnalyzer as it
is passed in to seq2sparse. When I build a jar and put it with the other
dependency jars seq2sparse works locally thus:
bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
com.mydomain.analyzers.LuceneStemmingAnalyzer -chunk 100 -wt tfidf
-s 2 -md 3 -x 95 -ng 2 -ml 50 -seq -n 2
My question is, what is the recommended way to deploy this analyzer to a
mahout/hadoop cluster? I assume there is no need in this case to create
a job jar since we pass the classname in explicitly? Note that it
extends the lucene abstract class Analyzer which is in the same jar as
WhiteSpaceAnalyzer the default for seq2sparse.
Thanks.
BTW deploying in the same way as I did locally but on the cluster gives
me a ClassDefNotFound error for Analyzer. So not my class but the
abstract class that I extended.