You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Shashikant Kore (JIRA)" <ji...@apache.org> on 2009/08/13 12:37:14 UTC

[jira] Created: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Get (better) cluster labels using Log Likelihood Ratio
------------------------------------------------------

                 Key: MAHOUT-163
                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
             Project: Mahout
          Issue Type: Improvement
            Reporter: Shashikant Kore
         Attachments: mahout-cluster-labels-llr.patch

Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by Ted Dunning <te...@gmail.com>.

Yes.

:-)

I should contribute a patch, actually.

On Thu, Aug 13, 2009 at 7:48 AM, Grant Ingersoll (JIRA) <ji...@apache.org>wrote:

> Does it make sense to have some generic LLR capabilities, that can be
> utilized in multiple places?
>



-- 
Ted Dunning, CTO
DeepDyve

[jira] Updated: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-163:
-----------------------------------

    Attachment: MAHOUT-163-17sep.patch

Updated patch. Using bitset to find in-cluster DF instead of deleting the out-of-cluster documents from the index. Any failure will not affect the index.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-163:
-----------------------------------

    Attachment: mahout-163.patch

Revised patch.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-163:
-----------------------------------

    Attachment: mahout-cluster-labels-llr.patch

Pacth for getting top labels with LLR.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>         Attachments: mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742842#action_12742842 ] 

Grant Ingersoll commented on MAHOUT-163:
----------------------------------------

I only briefly scanned the patch, but I've seen Ted mention LLR several times now in relation to several different things.  Does it make sense to have some generic LLR capabilities, that can be utilized in multiple places?

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>         Attachments: mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-163:
-----------------------------------

    Fix Version/s:     (was: 0.2)
                   0.3

Moving to 0.3, I'd like to see this be a little bit more generic in terms of where the original vectors come from.  In other words, I wonder if we can have a version that doesn't assume Lucene?  Thus, I don't want to rush this in.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795570#action_12795570 ] 

Shashikant Kore commented on MAHOUT-163:
----------------------------------------

Grant, 

Yes, it should have been configurable number.  

If the corpus size is big (tens of thousands of documents or more), the size I was working with, most likely such clusters are formed by outliers. Ignoring such clusters doesn't have any impact on the quality.


> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795360#action_12795360 ] 

Grant Ingersoll commented on MAHOUT-163:
----------------------------------------

Committed revision 894684.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795349#action_12795349 ] 

Grant Ingersoll commented on MAHOUT-163:
----------------------------------------

Shashi,

Am looking at this again and was wondering why it skips the labels when there is less than 50 ids?  Seems like this could be configurable or are you saying that it only performs well w/ at least 50?

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752861#action_12752861 ] 

Grant Ingersoll commented on MAHOUT-163:
----------------------------------------

Shashi, I'm having trouble applying the patch.  Can you generate it again against trunk?

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754590#action_12754590 ] 

Grant Ingersoll commented on MAHOUT-163:
----------------------------------------

Hmm, deleting the out of cluster docs from the index seems pretty harsh for a class that is just supposed to print out labels, even if we do undelete them.  If there were to be an error between those two events, that could screw up the index.  We should probably generate a DocIdSet of the docs out of the cluster and then use that in conjunction with a FilterIndexReader to skip, etc. those docs that are not in clusters.


> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-163.
------------------------------

    Resolution: Fixed

It sounds like this was committed, so resolving (?)

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-163:
-----------------------------------

    Fix Version/s: 0.2

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-163:
-----------------------------------

    Attachment: MAHOUT-163.patch

Updates some of the Lucene code a wee bit.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743359#action_12743359 ] 

Ted Dunning commented on MAHOUT-163:
------------------------------------

Here are some code comments:

this isn't a property ... shouldn't be called get*.  how about scoreDocumentFrequencies instead?
{noformat}  
private double getLLRFromDF(int inDF, int outDF, int clusterSize, int corpusSize) {  
+    int k11 = inDF;
+    int k12 = clusterSize - inDF;
+    int k21 = outDF; 
+    int k22 = corpusSize - clusterSize - outDF;
+    
+    double llr = getLLR(k11, k12, k21, k22);
+
+    return llr;
+  }
{noformat}

This next has a spelling error (missing P).  Also, entropy isn't a property so this isn't a getter.  Since it is a pure function, entropy is a fine name (no verb needed).  a,b,c,d should have names (I suggest k1 .. k4).
{noformat}
+  private double getEntroy(int a, int b, int c, int d) {
+    int[] elements = {a, b, c, d};
+    return getEntropy(elements);
+  }
+  
+  private double getEntropy(int a, int b) {
+    int[] elements = {a, b};
+    return getEntropy(elements);
+  }
{noformat}
In terms of generality, it might be preferable to use doubles instead of ints here.  Again, this isn't a property, so it shouldn't be a getter. 
{noformat}
+  private double getEntropy(int[] elements) {
+    int sum = 0;    
{noformat}
The variable i has the connotation of an index.  It can also be a bit hard to read.  Pick something else.
{noformat}
+    for (int i : elements) {
+      sum += i;
+    }
{noformat}
Don't use float constant with a double value.
{noformat}
+    double result = 0.0f;
+    for (int i : elements) {
+      result += getElementEntropy(i, sum);
+    }
{noformat}
WHy the use of zero here?  Unary minus is just fine.
{noformat}

+    return 0.0d - result;
+  }
+  
{noformat}
This is a trivial function used only one place and not visible to tests.  Why is it separate out?
{noformat}
+  private double getElementEntropy(int a, int sum) {
+    int zeroFlag = (a == 0 ? 1: 0);
{noformat}
Use a cast here, not *1f.  Cast to double, not float.
{noformat}
+    double result = a * Math.log((a+zeroFlag)*1.0f/sum);
+    return result;
+  }
{noformat}
Actually, I would recommend rewriting the entropy loop this way:
{noformat}
private double getEntropy(int[] elements) {
    double sum = 0;    
    for (int x : elements) {
      sum += x;
    }
    double result = 0.0;
    for (int x : elements) {
        if (x > 0) {
            result += x * Math.log(x / sum);
        } else if (x < 0) {
            throw new IllegalArgumentException("Should not have negative count for entropy computation: (" + x + ")");
    }
    return -result;
}
{noformat}
This is fine except for the get* name.  LLR should have a reference or an explanation in the javadoc.
{noformat}
+  private double getLLR(int k11, int k12, int k21, int k22) {    
+    double rowEntropy = getEntropy(k11, k12) + getEntropy(k21, k22);
+    double columnEntropy = getEntropy(k11, k21) + getEntropy(k12, k22);
+    double matrixEntropy = getEntroy(k11, k12, k21, k22);
+    double result = 2 * (matrixEntropy - rowEntropy - columnEntropy); 
+    return result;    
+  }
+
{noformat}  


> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned MAHOUT-163:
--------------------------------------

    Assignee: Grant Ingersoll

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-163:
-----------------------------------

    Attachment: MAHOUT-163.patch

a little more clean up.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-163:
-----------------------------------

    Attachment: MAHOUT-163.patch

Cleans up some issues, adds license headers.  Gives more parameters for controlling the number of labels output and the minimum number of ids in a cluster.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743093#action_12743093 ] 

Shashikant Kore commented on MAHOUT-163:
----------------------------------------

I'm not sure if this implementation is generic enough. Ted can take a look at it and decide. 



> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-163:
-----------------------------------

    Attachment: MAHOUT-163.patch

Revised patch updated to trunk.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795604#action_12795604 ] 

Grant Ingersoll commented on MAHOUT-163:
----------------------------------------

Yep, I committed a change w/ those things being configurable.

> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels instead of the top features of the centroid vector. LLR finds terms/phrases which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.