You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Abbas <ab...@gmail.com> on 2012/03/07 10:24:32 UTC

Re: Clustering techniques, tips and tricks

Hi Bogdan,

This is in reply to your previous post where you asked about having word-
stoppers 
in Mahout. 

Well, recently I was fighting with the same thing and found a solution, 
which worked perfectly fine. What you should do is - 
1. Create your own (customized) Lucene Analyzer by extending Analyzer class 
and overriding tokenStream method

2. Create a jar file containing your custom analyzer. Make sure to have your 
lucene jar file in the MANIFEST.mf. 

3. Place the jar in mahout/examples/target/dependency. In case you get 
ClassNotFoundException in the next step, you may like to put the two jar files 
in 
hadoop/lib/ as well. Also you can try making entries of the jar files in 
HADOOP_CLASSPATH and CLASSPATH environment variable.

4. Then run your seq2sparse command by mentioning your custom analyzer in -a 
parameter

5. Run your k-means command as you would otherwise do.

Hope this helps

If you need the complete code for custom analyzer, let me know.

Thanks
Abbas