You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by David Noel <da...@gmail.com> on 2014/05/12 15:29:33 UTC

Clustering raw articles vs clustering (Stanford's) NER output

I've spent a few weeks tuning Mahout to cluster news articles and have
had decent results. Decent, but still not perfect. In trying to think
of ways to improve my results I had the idea of running Mahout on
output from Stanford's Named Entity Recognizer (NER) instead of the
articles themselves, and seeing how that compared. Has anyone tried
this? Did it generate more cohesive clusters?

Re: Clustering raw articles vs clustering (Stanford's) NER output

Posted by Ted Dunning <te...@gmail.com>.

Clustering with higher level data available for the distance computation is a fine thing.  

The tuning will be very different but the results can be very good when the named entity resolver gets a good hit.  Since named entities tend to be relatively rare, they get high IDF scores and other terms recede a bit as a result if normalization.  

Sent from my iPhone

> On May 12, 2014, at 6:29, David Noel <da...@gmail.com> wrote:
> 
> I've spent a few weeks tuning Mahout to cluster news articles and have
> had decent results. Decent, but still not perfect. In trying to think
> of ways to improve my results I had the idea of running Mahout on
> output from Stanford's Named Entity Recognizer (NER) instead of the
> articles themselves, and seeing how that compared. Has anyone tried
> this? Did it generate more cohesive clusters?