You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Viral Parikh <vi...@gmail.com> on 2014/12/04 00:40:55 UTC
Few Questions related Mahout used for Text Clustering
Hi Mahout Users!
Firstly, this community is great and appreciate all the Q & A back and
forth!
I am currently working on Text Clustering and I am using Mahout and
Clustering algorithms (kmeans, krunner, canopy etc) for that.
If anyone has worked on a similar project please let me know. I have a 2
questions as below –
1. In order to choose optimal K, I am running krunner across my vectorized
dataset. In order to choose the right “k”, I am trying to understand the
spread of my observations across all clusters and minimize cluster 1 (which
apparently looks like the catch-all bucket – can anyone confirm?), but I am
observing the final count varies depending on k. See below (please ignore
the blank cells) –
Any idea why the final count varies depending on chosen k?
[image: Inline image 1]
2. Another thing I noticed, some of my clusters have just n=1 observation?
That doesn’t make sense to me. Is there a way to avoid this, any particular
parameter selection I can tweak?
Thank you and looking forward to your reply.
Cheers,
Viral