You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Abbas <ab...@gmail.com> on 2012/03/07 10:24:32 UTC
Re: Clustering techniques, tips and tricks
Hi Bogdan,
This is in reply to your previous post where you asked about having word-
stoppers
in Mahout.
Well, recently I was fighting with the same thing and found a solution,
which worked perfectly fine. What you should do is -
1. Create your own (customized) Lucene Analyzer by extending Analyzer class
and overriding tokenStream method
2. Create a jar file containing your custom analyzer. Make sure to have your
lucene jar file in the MANIFEST.mf.
3. Place the jar in mahout/examples/target/dependency. In case you get
ClassNotFoundException in the next step, you may like to put the two jar files
in
hadoop/lib/ as well. Also you can try making entries of the jar files in
HADOOP_CLASSPATH and CLASSPATH environment variable.
4. Then run your seq2sparse command by mentioning your custom analyzer in -a
parameter
5. Run your k-means command as you would otherwise do.
Hope this helps
If you need the complete code for custom analyzer, let me know.
Thanks
Abbas