You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Chris Harrington <ch...@heystaks.com> on 2013/02/04 19:57:26 UTC

Does something like an "explain" feature exist in Mahout for clustering.

I was wondering if there was an explain feature in Mahout, something that gives the reason why it did what it did, shows the values of the various features it used to evaluate and choose the result, etc.

Because I have some wildly different text data being clustered together, for example it clustered these 2 together and I'd like to be able to figure out why

Text 1: "Iron Butterfly Bassist Lee Dorman Dies at 70"

Text 2: "The BEST Memes Of 2012 2012 was a landmark year for memes -- and we could say that due to the Ikea Monkey alone -- but it's not always easy…"

Re: Does something like an "explain" feature exist in Mahout for clustering.

Posted by Chris Harrington <ch...@heystaks.com>.
I found this on stack overflow which helped a lot. http://stackoverflow.com/questions/5805225/interpreting-output-from-mahout-clusterdumper

Since I was able to get a map file names to clusters from the link above I was able to build something to output various interesting things. Such as a map of categories to clusters (since the test data was labeled) and the percent of that category's docs that ended up in each cluster (i.e. 23% of category B ended up in cluster 2). 

Then using this same info I created a directory structure of category with cluster text files containing the content of the text files that were clustered into that cluster. 

So for each category checking where the low percent of categories went (i.e. 0.85% of category B ended up in cluster 4) and then checking the text of those docs against the top 50 keywords from the clusterdumper utility showed at least one top keyword was matching and causing the strange clustering I was seeing.

Hopefully the above will be of help to someone else.



On 5 Feb 2013, at 18:43, Chris Harrington wrote:

> I'm currently using KMeans with canopy and Cosine as the measure. The data I'm using has been somewhat curated into categories so I expected them to cluster alongside the other documents in their respective categories. Some of them fall nicely into clusters I'd expect but others are like the examples I gave in the first mail. i suspect some of the oddities are due to noise in the data (of which there is a considerable amount e.g. documents with only 2 words).
> 
> 
> On 4 Feb 2013, at 22:28, Jeff Eastman wrote:
> 
>> That's a really good question. Mahout does not have an "explain" feature; however, you can use the ClusterDumper to print out the cluster centers and vectors clustered within each cluster. Output is pretty verbose and, with large text vectors being truncated, might not be that useful. You might need to write something to do this. Look at the cluster evaluator tests for some hints.
>> 
>> Which algorithm were you using?
>> 
>> On 2/4/13 1:57 PM, Chris Harrington wrote:
>>> I was wondering if there was an explain feature in Mahout, something that gives the reason why it did what it did, shows the values of the various features it used to evaluate and choose the result, etc.
>>> 
>>> Because I have some wildly different text data being clustered together, for example it clustered these 2 together and I'd like to be able to figure out why
>>> 
>>> Text 1: "Iron Butterfly Bassist Lee Dorman Dies at 70"
>>> 
>>> Text 2: "The BEST Memes Of 2012 2012 was a landmark year for memes -- and we could say that due to the Ikea Monkey alone -- but it's not always easy…"
>>> 
>> 
> 


Re: Does something like an "explain" feature exist in Mahout for clustering.

Posted by Chris Harrington <ch...@heystaks.com>.
I'm currently using KMeans with canopy and Cosine as the measure. The data I'm using has been somewhat curated into categories so I expected them to cluster alongside the other documents in their respective categories. Some of them fall nicely into clusters I'd expect but others are like the examples I gave in the first mail. i suspect some of the oddities are due to noise in the data (of which there is a considerable amount e.g. documents with only 2 words).

 
On 4 Feb 2013, at 22:28, Jeff Eastman wrote:

> That's a really good question. Mahout does not have an "explain" feature; however, you can use the ClusterDumper to print out the cluster centers and vectors clustered within each cluster. Output is pretty verbose and, with large text vectors being truncated, might not be that useful. You might need to write something to do this. Look at the cluster evaluator tests for some hints.
> 
> Which algorithm were you using?
> 
> On 2/4/13 1:57 PM, Chris Harrington wrote:
>> I was wondering if there was an explain feature in Mahout, something that gives the reason why it did what it did, shows the values of the various features it used to evaluate and choose the result, etc.
>> 
>> Because I have some wildly different text data being clustered together, for example it clustered these 2 together and I'd like to be able to figure out why
>> 
>> Text 1: "Iron Butterfly Bassist Lee Dorman Dies at 70"
>> 
>> Text 2: "The BEST Memes Of 2012 2012 was a landmark year for memes -- and we could say that due to the Ikea Monkey alone -- but it's not always easy…"
>> 
> 


Re: Does something like an "explain" feature exist in Mahout for clustering.

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
That's a really good question. Mahout does not have an "explain" 
feature; however, you can use the ClusterDumper to print out the cluster 
centers and vectors clustered within each cluster. Output is pretty 
verbose and, with large text vectors being truncated, might not be that 
useful. You might need to write something to do this. Look at the 
cluster evaluator tests for some hints.

Which algorithm were you using?

On 2/4/13 1:57 PM, Chris Harrington wrote:
> I was wondering if there was an explain feature in Mahout, something that gives the reason why it did what it did, shows the values of the various features it used to evaluate and choose the result, etc.
>
> Because I have some wildly different text data being clustered together, for example it clustered these 2 together and I'd like to be able to figure out why
>
> Text 1: "Iron Butterfly Bassist Lee Dorman Dies at 70"
>
> Text 2: "The BEST Memes Of 2012 2012 was a landmark year for memes -- and we could say that due to the Ikea Monkey alone -- but it's not always easy…"
>


Re: Does something like an "explain" feature exist in Mahout for clustering.

Posted by Steven Bourke <st...@ucd.ie>.

Sent from phone

On 4 Feb 2013, at 18:57, Chris Harrington <ch...@heystaks.com> wrote:

> I was wondering if there was an explain feature in Mahout, something that gives the reason why it did what it did, shows the values of the various features it used to evaluate and choose the result, etc.
> 
> Because I have some wildly different text data being clustered together, for example it clustered these 2 together and I'd like to be able to figure out why
> 
> Text 1: "Iron Butterfly Bassist Lee Dorman Dies at 70"
> 
> Text 2: "The BEST Memes Of 2012 2012 was a landmark year for memes -- and we could say that due to the Ikea Monkey alone -- but it's not always easy…"