You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Yang <te...@gmail.com> on 2014/10/21 23:13:27 UTC

mahout kmeans gives a random result for short documents

we are trying to run kmeans  on some product titles
so that we could cluster together similar products
like "nike flex sneaker size 9" vs "nike flex sneaker size 8"
it works fine for most
but it turns out that a lot of the titles are very short (particularly
after filtering stopwords)
so I got many 1-word or 2-word titles
and somehow these got lumped together into a huge cluster
which does not have any similarly between the members at all
I followed some specific examples in this cluster,
it seems that the algorithm is indeed doing what it's supposed to do.


anybody has similar experience clustering particularly short "documents" ?
generally any tricks to force the members to "jump" out and join another
cluster ? (I do see other smaller clusters, with matching words)


Thanks
Yang