You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by David Noel <da...@gmail.com> on 2014/05/24 06:39:10 UTC

Similarity Measures for Text Document Clustering

I found an interesting paper that I thought someone here might find helpful.

http://www.milanmirkovic.com/wp-content/uploads/2012/10/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf

ABSTRACT: ... A wide variety of distance functions and similarity
measures have been used for clustering, such as squared Euclidean
distance, cosine similarity, and relative entropy. In this paper, we
compare and analyze the effectiveness of these measures in partitional
clustering for text document datasets. Our experiments utilize the
standard K-means algorithm and we report results on seven text
document datasets and five distance/similarity measures that have been
most commonly used in text clustering.

TL;DR: For text documents, favor Cosine, Jaccard/Tanimoto, or Pearson
over Euclidean distance measures.

Re: Similarity Measures for Text Document Clustering

Posted by Ted Dunning <te...@gmail.com>.

Note that cosine distance *is* Euclidean with the addition of document
length normalization.




On Fri, May 23, 2014 at 9:39 PM, David Noel <da...@gmail.com> wrote:

> I found an interesting paper that I thought someone here might find
> helpful.
>
>
> http://www.milanmirkovic.com/wp-content/uploads/2012/10/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf
>
> ABSTRACT: ... A wide variety of distance functions and similarity
> measures have been used for clustering, such as squared Euclidean
> distance, cosine similarity, and relative entropy. In this paper, we
> compare and analyze the effectiveness of these measures in partitional
> clustering for text document datasets. Our experiments utilize the
> standard K-means algorithm and we report results on seven text
> document datasets and five distance/similarity measures that have been
> most commonly used in text clustering.
>
> TL;DR: For text documents, favor Cosine, Jaccard/Tanimoto, or Pearson
> over Euclidean distance measures.
>

Re: Similarity Measures for Text Document Clustering

Posted by Ted Dunning <te...@gmail.com>.

I just read this paper and it it very nicely written up.  There are a few
unfortunate omissions:

1) cosine is equivalent to Euclidean with the addition of document and
centroid normalization.

2) the entropy measure given appears to be an ad hoc partial derivation of
mutual information, but this is not mentioned, nor are the differences
examined

3) the tf-idf measure used uses straight tf.  It is usually better to use
log(tf) or sqrt(tf).  This is not examined.

4) the same number of clusters as target categories is used.  Commonly,
clustering is used as a feature for classification and there is no
rationale in that case for the number of clusters to be the same as the
number of target categories.

5) if (4) is accepted, then mutual information is immediately better than
the entropy measure shown since it is normalizes away the number of
clusters.

On Fri, May 23, 2014 at 9:39 PM, David Noel <da...@gmail.com> wrote:

> I found an interesting paper that I thought someone here might find
> helpful.
>
>
> http://www.milanmirkovic.com/wp-content/uploads/2012/10/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf
>
> ABSTRACT: ... A wide variety of distance functions and similarity
> measures have been used for clustering, such as squared Euclidean
> distance, cosine similarity, and relative entropy. In this paper, we
> compare and analyze the effectiveness of these measures in partitional
> clustering for text document datasets. Our experiments utilize the
> standard K-means algorithm and we report results on seven text
> document datasets and five distance/similarity measures that have been
> most commonly used in text clustering.
>
> TL;DR: For text documents, favor Cosine, Jaccard/Tanimoto, or Pearson
> over Euclidean distance measures.
>