You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Viral Parikh <vi...@gmail.com> on 2014/12/04 00:40:55 UTC

Few Questions related Mahout used for Text Clustering

Hi Mahout Users!



Firstly, this community is great and appreciate all the Q & A back and
forth!



I am currently working on Text Clustering and I am using Mahout and
Clustering algorithms (kmeans, krunner, canopy etc) for that.



If anyone has worked on a similar project please let me know. I have a 2
questions as below –



1. In order to choose optimal K, I am running krunner across my vectorized
dataset. In order to choose the right “k”, I am trying to understand the
spread of my observations across all clusters and minimize cluster 1 (which
apparently looks like the catch-all bucket – can anyone confirm?), but I am
observing the final count varies depending on k. See below (please ignore
the blank cells) –



Any idea why the final count varies depending on chosen k?



 [image: Inline image 1]



2. Another thing I noticed, some of my clusters have just n=1 observation?
That doesn’t make sense to me. Is there a way to avoid this, any particular
parameter selection I can tweak?



Thank you and looking forward to your reply.





Cheers,

Viral