You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Frank Scholten <fr...@frankscholten.nl> on 2011/11/08 23:56:11 UTC

Cluster labeling

Hi all,

Sometimes my cluster labels are terms that hardly occur in the
combined text of the documents of a cluster. I would expect to see a
label of a term that occurs very frequently across documents of the
cluster.

For example, suppose there is a cluster of tweets about Mahout. You
would see a lot of occurences of 'Apache Mahout' in every document.
Maybe a few documents have the term 'License' in them. You could end
up with a 'License' label instead of 'Apache Mahout'.

I think this happens when Mahout sorts the cluster centroid by TF-IDF
weight in descending order and fetches the correlated terms. So the
'License' label will be chosen because it has a high TF-IDF even
though it has a low cluster frequency.

Thoughts?

Cheers,

Frank

Fwd: Cluster labeling

Posted by Frank Scholten <fr...@frankscholten.nl>.
Forwarding this to dev.

---------- Forwarded message ----------
From: Frank Scholten <fr...@frankscholten.nl>
Date: Tue, Nov 8, 2011 at 11:56 PM
Subject: Cluster labeling
To: user@mahout.apache.org


Hi all,

Sometimes my cluster labels are terms that hardly occur in the
combined text of the documents of a cluster. I would expect to see a
label of a term that occurs very frequently across documents of the
cluster.

For example, suppose there is a cluster of tweets about Mahout. You
would see a lot of occurences of 'Apache Mahout' in every document.
Maybe a few documents have the term 'License' in them. You could end
up with a 'License' label instead of 'Apache Mahout'.

I think this happens when Mahout sorts the cluster centroid by TF-IDF
weight in descending order and fetches the correlated terms. So the
'License' label will be chosen because it has a high TF-IDF even
though it has a low cluster frequency.

Thoughts?

Cheers,

Frank