You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Viral Parikh <Vi...@match.com> on 2014/12/04 20:22:33 UTC

Mahout used for Text Clustering

Hi Mahout Users!
I am currently working on Text Clustering and I am using Mahout and Clustering algorithms (kmeans, LDA, canopy etc) for that.
 I have below questions –
1. Why is Mahout giving out clusters with only 1 observation?
2. Is cluster 1 always catch-all cluster?
3. When I change the k in kmeans and do clusterdump, the total number of observations change as k changes? Why so? Am I missing anything?
4. Does normalization (when creating the vectors) lead to good quality of clustering results, especially for unstructured data. In my case its text data!

Thank you in advance for your help!

Cheers,
V