You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Lance Norskog (Created) (JIRA)" <ji...@apache.org> on 2012/01/05 07:48:39 UTC

[jira] [Created] (MAHOUT-941) Strip quoted text from emails and add statistics to ConfusionMatrix

Strip quoted text from emails and add statistics to ConfusionMatrix
-------------------------------------------------------------------

                 Key: MAHOUT-941
                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
            Reporter: Lance Norskog
            Priority: Minor


This patch does 2 things:
# Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.

# Adds some dubious overall measurements to the ConfusionMatrix. 
** Kappa - a standard measurement. 
*** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
*** I think this is an "unweighted" kappa. 
** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
*** The standard deviation shows the distance between the success of each producer->consumer box.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAHOUT-941:
---------------------------------

    Description: 
This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
# Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
# Add mean & standard deviation of "Reliability" (User Accuracy) - assist in identifying consistent mal-assignment against "good" and "bad" labels.



  was:
This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
# Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
# Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.

Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)


    
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of "Reliability" (User Accuracy) - assist in identifying consistent mal-assignment against "good" and "bad" labels.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-941) Strip quoted text from emails and add statistics to ConfusionMatrix

Posted by "Lance Norskog (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180227#comment-13180227 ] 

Lance Norskog commented on MAHOUT-941:
--------------------------------------

Suggestion: leave out 'Success' if you commit. It is not a finished product. I was unable to cleanly remove it from the patch.

Removing the quoted text was a serious win- SGD worked much better without quoted text and subjects, oddly. See attached zipped files Bayes.zip and SGD.zip for test runs. I worked against a sample of the Apache email archives; it's on the net somewhere but I can't find the link just now.

                
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: MAHOUT-941.patch
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix. 
> ** Kappa - a standard measurement. 
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa. 
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
> *** The standard deviation shows the distance between the success of each producer->consumer box.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Strip quoted text from emails and add statistics to ConfusionMatrix

Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAHOUT-941:
---------------------------------

    Attachment: MAHOUT-941.patch
    
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: MAHOUT-941.patch
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix. 
> ** Kappa - a standard measurement. 
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa. 
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
> *** The standard deviation shows the distance between the success of each producer->consumer box.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAHOUT-941:
---------------------------------

    Attachment: MAHOUT-941.patch

Remove email processing additions, moved to MAHOUT-941.

Enhance statistics to assist tuning classifiers. Add CSV output for graphing incremental SGD models.
                
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290148#comment-13290148 ] 

Robin Anil commented on MAHOUT-941:
-----------------------------------

Lance can you send the patch in. 
                
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil reassigned MAHOUT-941:
---------------------------------

    Assignee: Robin Anil  (was: Grant Ingersoll)
    
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAHOUT-941:
---------------------------------

    Summary: Improve ConfusionMatrix statistics  (was: Strip quoted text from emails and add statistics to ConfusionMatrix)


Rename this to focus on Confusion Matrix stats.
Stripper for quoted lines is added to [MAHOUT-939], removed from this patch. 
                
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix. 
> ** Kappa - a standard measurement. 
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa. 
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
> *** The standard deviation shows the distance between the success of each producer->consumer box.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Strip quoted text from emails and add statistics to ConfusionMatrix

Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-941:
-----------------------------------

    Fix Version/s: 0.6
    
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix. 
> ** Kappa - a standard measurement. 
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa. 
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
> *** The standard deviation shows the distance between the success of each producer->consumer box.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Strip quoted text from emails and add statistics to ConfusionMatrix

Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAHOUT-941:
---------------------------------

    Attachment: Bayes.zip
                SGD.zip

These file contain the final output of 8 runs with:
bayes v.s. sgd
quoted text in bodies v.s. stripped
subject line v.s. no subject line

                
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix. 
> ** Kappa - a standard measurement. 
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa. 
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
> *** The standard deviation shows the distance between the success of each producer->consumer box.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAHOUT-941:
---------------------------------

    Description: 
This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
# Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
# Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.

Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)


  was:
This patch does 2 things:
# Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.

# Adds some dubious overall measurements to the ConfusionMatrix. 
** Kappa - a standard measurement. 
*** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
*** I think this is an "unweighted" kappa. 
** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
*** The standard deviation shows the distance between the success of each producer->consumer box.


    
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290816#comment-13290816 ] 

Lance Norskog edited comment on MAHOUT-941 at 6/7/12 8:16 PM:
--------------------------------------------------------------

Prints Accuracy and Reliability stats, plus standard deviation of reliability.

Accuracy = "Producer Accuracy", includes unclassified results.
Reliability = "User Accuracy", does not include unclassified results.
                
      was (Author: lancenorskog):
    Prints and Accuracy and Reliability stats, plus standard deviation of reliability.

Accuracy = "Producer Accuracy", includes unclassified results.
Reliability = "User Accuracy", does not include unclassified results.
                  
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-941:
-----------------------------------

    Fix Version/s:     (was: 0.7)
                   0.8
    
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291658#comment-13291658 ] 

Lance Norskog commented on MAHOUT-941:
--------------------------------------

This is the output of classify-20newsgroups.sh. "Accuracy" is 90.4 percent. "Reliability" is 85%. The standard deviation of "reliability" is .21. "Kappa" is 0.87- it is the relationship between "accuracy" v.s. "random classification". I do not know if Kappa includes "unclassified" in its formula, or assumes all are classified to known labels. Or perhaps it should be calculated both ways?

{quote}

Summary
-------------------------------------------------------
Correctly Classified Instances          :       6788	   90.4102%
Incorrectly Classified Instances        :        720	    9.5898%
Total Classified Instances              :       7508

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	h    	i    	j    	k    	l    	m    	n    	o    	p    	q    	r    	s    	t    	<--Classified as
296  	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	1    	8    	0    	2    	7    	3    	 |  317   	a     = alt.atheism
1    	327  	4    	20   	6    	14   	2    	1    	0    	0    	0    	1    	5    	3    	1    	0    	1    	0    	0    	0    	 |  386   	b     = comp.graphics
0    	27   	217  	76   	21   	17   	5    	0    	0    	0    	0    	4    	8    	1    	1    	0    	0    	0    	1    	3    	 |  381   	c     = comp.os.ms-windows.misc
0    	10   	1    	315  	23   	3    	9    	2    	0    	0    	0    	0    	8    	0    	0    	0    	0    	0    	0    	0    	 |  371   	d     = comp.sys.ibm.pc.hardware
0    	5    	1    	9    	348  	0    	5    	1    	0    	0    	0    	0    	4    	0    	0    	0    	0    	0    	1    	1    	 |  375   	e     = comp.sys.mac.hardware
0    	23   	2    	7    	1    	328  	1    	0    	0    	0    	0    	1    	0    	1    	1    	0    	0    	0    	0    	0    	 |  365   	f     = comp.windows.x
0    	5    	0    	19   	11   	0    	337  	8    	2    	1    	4    	4    	5    	0    	3    	0    	0    	0    	0    	1    	 |  400   	g     = misc.forsale
0    	0    	0    	3    	3    	1    	8    	402  	2    	1    	0    	0    	3    	1    	0    	0    	0    	0    	0    	3    	 |  427   	h     = rec.autos
0    	0    	0    	0    	0    	1    	7    	5    	368  	0    	0    	0    	0    	1    	0    	0    	0    	1    	0    	1    	 |  384   	i     = rec.motorcycles
1    	0    	0    	0    	0    	0    	1    	1    	0    	379  	7    	0    	0    	1    	0    	0    	0    	0    	0    	0    	 |  390   	j     = rec.sport.baseball
0    	0    	0    	1    	2    	0    	0    	1    	0    	4    	387  	0    	0    	0    	0    	1    	0    	0    	0    	2    	 |  398   	k     = rec.sport.hockey
0    	3    	0    	1    	3    	2    	0    	0    	0    	0    	0    	393  	2    	0    	0    	0    	1    	3    	1    	2    	 |  411   	l     = sci.crypt
0    	5    	0    	12   	10   	0    	5    	1    	1    	0    	0    	1    	328  	0    	2    	0    	0    	2    	1    	0    	 |  368   	m     = sci.electronics
1    	5    	1    	3    	1    	1    	1    	0    	0    	0    	0    	0    	2    	377  	4    	0    	0    	0    	1    	4    	 |  401   	n     = sci.med
0    	5    	0    	0    	1    	1    	1    	0    	0    	1    	0    	2    	0    	1    	389  	0    	0    	0    	2    	2    	 |  405   	o     = sci.space
4    	2    	0    	1    	2    	0    	0    	1    	0    	1    	1    	0    	0    	1    	0    	397  	2    	2    	5    	1    	 |  420   	p     = soc.religion.christian
1    	1    	0    	0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	0    	0    	4    	359  	0    	0    	1    	 |  367   	q     = talk.politics.mideast
0    	0    	0    	0    	0    	0    	0    	0    	1    	1    	0    	0    	1    	0    	0    	0    	0    	360  	0    	8    	 |  371   	r     = talk.politics.guns
26   	1    	0    	1    	0    	0    	1    	1    	1    	1    	0    	0    	1    	0    	2    	18   	1    	4    	197  	7    	 |  262   	s     = talk.religion.misc
0    	0    	0    	0    	1    	0    	0    	1    	0    	2    	0    	2    	0    	0    	3    	0    	3    	10   	3    	284  	 |  309   	t     = talk.politics.misc

=======================================================
Statistics
-------------------------------------------------------
Kappa                                       0.8759
Accuracy                                   90.4102%
Reliability                                85.8359%
Reliability (standard deviation)            0.2183
{quote}

                
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of "Reliability" (User Accuracy) - assist in identifying consistent mal-assignment against "good" and "bad" labels.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-941) Strip quoted text from emails and add statistics to ConfusionMatrix

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181960#comment-13181960 ] 

Grant Ingersoll commented on MAHOUT-941:
----------------------------------------

Or, just rename this issue to just be the stats piece
                
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix. 
> ** Kappa - a standard measurement. 
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa. 
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
> *** The standard deviation shows the distance between the success of each producer->consumer box.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289928#comment-13289928 ] 

Lance Norskog commented on MAHOUT-941:
--------------------------------------

1)  Grrrr.. correct is supposed to be a summer.
{code}
 correct = confusionMatrix[labelId][labelId];
{code}

2) This is printed out wrong. The "accuracy" up above is "producer's accuracy". This code calculates that and "user's accuracy", or "reliability". These are different. The printout should show both accuracies. Possibly also the mean of the two. 

Imagine classification as the code throwing balls of different sizes to robot arms each programmed to grab one size. If none grab the ball, that's 'unclassified' Producer's accuracy is from the thrower's point of view, user's accuracy is from the robot arms' points of view. They are different counts because 'unclassified' is part of the producer's 'wrong' count, while it is ignored by the user's counts.

[http://spatial-analyst.net/ILWIS/htm/ilwismen/confusion_matrix.htm]


                
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-941) Strip quoted text from emails and add statistics to ConfusionMatrix

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181916#comment-13181916 ] 

Grant Ingersoll commented on MAHOUT-941:
----------------------------------------

Lance, can you separate out the stats piece into a different issue?  I'll fold the quoted stuff in with MAHOUT-939 and then we can deal with the stats in other places
                
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix. 
> ** Kappa - a standard measurement. 
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa. 
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
> *** The standard deviation shows the distance between the success of each producer->consumer box.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-941) Strip quoted text from emails and add statistics to ConfusionMatrix

Posted by "Grant Ingersoll (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned MAHOUT-941:
--------------------------------------

    Assignee: Grant Ingersoll
    
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix. 
> ** Kappa - a standard measurement. 
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa. 
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
> *** The standard deviation shows the distance between the success of each producer->consumer box.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289439#comment-13289439 ] 

Robin Anil commented on MAHOUT-941:
-----------------------------------

 Complementary Results: 
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      68210	   97.9058%
Incorrectly Classified Instances        :       1459	    2.0942%
Total Classified Instances              :      69669

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	<--Classified as
27615	756  	 |  28371 	a     = commons.apache.org
703  	40595	 |  41298 	b     = cocoon.apache.org

=======================================================
Statistics
-------------------------------------------------------
Kappa                                   :    -1.1483
Accuracy                                :    0.6522
Consistency (stdev of accuracy)         :    0.5052


I am seeing this. Why is accuracy 0.65 when its actually 0.987. Can you fix this issue.
                
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAHOUT-941:
---------------------------------

    Attachment: MAHOUT-941.patch

Prints and Accuracy and Reliability stats, plus standard deviation of reliability.

Accuracy = "Producer Accuracy", includes unclassified results.
Reliability = "User Accuracy", does not include unclassified results.
                
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics

Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-941:
-----------------------------------

    Fix Version/s:     (was: 0.6)
                   0.7
    
> Improve ConfusionMatrix statistics
> ----------------------------------
>
>                 Key: MAHOUT-941
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-941
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Lance Norskog
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix. 
> ** Kappa - a standard measurement. 
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa. 
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus. 
> *** The standard deviation shows the distance between the success of each producer->consumer box.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira