You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Lance Norskog (Created) (JIRA)" <ji...@apache.org> on 2012/01/05 07:48:39 UTC
[jira] [Created] (MAHOUT-941) Strip quoted text from emails and add
statistics to ConfusionMatrix
Strip quoted text from emails and add statistics to ConfusionMatrix
-------------------------------------------------------------------
Key: MAHOUT-941
URL: https://issues.apache.org/jira/browse/MAHOUT-941
Project: Mahout
Issue Type: Improvement
Components: Classification
Reporter: Lance Norskog
Priority: Minor
This patch does 2 things:
# Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
# Adds some dubious overall measurements to the ConfusionMatrix.
** Kappa - a standard measurement.
*** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
*** I think this is an "unweighted" kappa.
** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
*** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lance Norskog updated MAHOUT-941:
---------------------------------
Description:
This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
# Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
# Add mean & standard deviation of "Reliability" (User Accuracy) - assist in identifying consistent mal-assignment against "good" and "bad" labels.
was:
This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
# Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
# Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Robin Anil
> Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of "Reliability" (User Accuracy) - assist in identifying consistent mal-assignment against "good" and "bad" labels.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-941) Strip quoted text from emails and
add statistics to ConfusionMatrix
Posted by "Lance Norskog (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180227#comment-13180227 ]
Lance Norskog commented on MAHOUT-941:
--------------------------------------
Suggestion: leave out 'Success' if you commit. It is not a finished product. I was unable to cleanly remove it from the patch.
Removing the quoted text was a serious win- SGD worked much better without quoted text and subjects, oddly. See attached zipped files Bayes.zip and SGD.zip for test runs. I worked against a sample of the Apache email archives; it's on the net somewhere but I can't find the link just now.
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: MAHOUT-941.patch
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix.
> ** Kappa - a standard measurement.
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa.
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
> *** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Strip quoted text from emails and add
statistics to ConfusionMatrix
Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lance Norskog updated MAHOUT-941:
---------------------------------
Attachment: MAHOUT-941.patch
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: MAHOUT-941.patch
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix.
> ** Kappa - a standard measurement.
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa.
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
> *** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lance Norskog updated MAHOUT-941:
---------------------------------
Attachment: MAHOUT-941.patch
Remove email processing additions, moved to MAHOUT-941.
Enhance statistics to assist tuning classifiers. Add CSV output for graphing incremental SGD models.
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.7
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290148#comment-13290148 ]
Robin Anil commented on MAHOUT-941:
-----------------------------------
Lance can you send the patch in.
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Robin Anil
> Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robin Anil reassigned MAHOUT-941:
---------------------------------
Assignee: Robin Anil (was: Grant Ingersoll)
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Robin Anil
> Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lance Norskog updated MAHOUT-941:
---------------------------------
Summary: Improve ConfusionMatrix statistics (was: Strip quoted text from emails and add statistics to ConfusionMatrix)
Rename this to focus on Confusion Matrix stats.
Stripper for quoted lines is added to [MAHOUT-939], removed from this patch.
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.6
>
> Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix.
> ** Kappa - a standard measurement.
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa.
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
> *** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Strip quoted text from emails and add
statistics to ConfusionMatrix
Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll updated MAHOUT-941:
-----------------------------------
Fix Version/s: 0.6
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Priority: Minor
> Fix For: 0.6
>
> Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix.
> ** Kappa - a standard measurement.
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa.
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
> *** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Strip quoted text from emails and add
statistics to ConfusionMatrix
Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lance Norskog updated MAHOUT-941:
---------------------------------
Attachment: Bayes.zip
SGD.zip
These file contain the final output of 8 runs with:
bayes v.s. sgd
quoted text in bodies v.s. stripped
subject line v.s. no subject line
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix.
> ** Kappa - a standard measurement.
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa.
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
> *** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Lance Norskog (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lance Norskog updated MAHOUT-941:
---------------------------------
Description:
This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
# Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
# Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
was:
This patch does 2 things:
# Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
# Adds some dubious overall measurements to the ConfusionMatrix.
** Kappa - a standard measurement.
*** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
*** I think this is an "unweighted" kappa.
** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
*** The standard deviation shows the distance between the success of each producer->consumer box.
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.7
>
> Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (MAHOUT-941) Improve ConfusionMatrix
statistics
Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290816#comment-13290816 ]
Lance Norskog edited comment on MAHOUT-941 at 6/7/12 8:16 PM:
--------------------------------------------------------------
Prints Accuracy and Reliability stats, plus standard deviation of reliability.
Accuracy = "Producer Accuracy", includes unclassified results.
Reliability = "User Accuracy", does not include unclassified results.
was (Author: lancenorskog):
Prints and Accuracy and Reliability stats, plus standard deviation of reliability.
Accuracy = "Producer Accuracy", includes unclassified results.
Reliability = "User Accuracy", does not include unclassified results.
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Robin Anil
> Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll updated MAHOUT-941:
-----------------------------------
Fix Version/s: (was: 0.7)
0.8
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291658#comment-13291658 ]
Lance Norskog commented on MAHOUT-941:
--------------------------------------
This is the output of classify-20newsgroups.sh. "Accuracy" is 90.4 percent. "Reliability" is 85%. The standard deviation of "reliability" is .21. "Kappa" is 0.87- it is the relationship between "accuracy" v.s. "random classification". I do not know if Kappa includes "unclassified" in its formula, or assumes all are classified to known labels. Or perhaps it should be calculated both ways?
{quote}
Summary
-------------------------------------------------------
Correctly Classified Instances : 6788 90.4102%
Incorrectly Classified Instances : 720 9.5898%
Total Classified Instances : 7508
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
296 0 0 0 0 0 0 0 0 0 0 0 0 0 1 8 0 2 7 3 | 317 a = alt.atheism
1 327 4 20 6 14 2 1 0 0 0 1 5 3 1 0 1 0 0 0 | 386 b = comp.graphics
0 27 217 76 21 17 5 0 0 0 0 4 8 1 1 0 0 0 1 3 | 381 c = comp.os.ms-windows.misc
0 10 1 315 23 3 9 2 0 0 0 0 8 0 0 0 0 0 0 0 | 371 d = comp.sys.ibm.pc.hardware
0 5 1 9 348 0 5 1 0 0 0 0 4 0 0 0 0 0 1 1 | 375 e = comp.sys.mac.hardware
0 23 2 7 1 328 1 0 0 0 0 1 0 1 1 0 0 0 0 0 | 365 f = comp.windows.x
0 5 0 19 11 0 337 8 2 1 4 4 5 0 3 0 0 0 0 1 | 400 g = misc.forsale
0 0 0 3 3 1 8 402 2 1 0 0 3 1 0 0 0 0 0 3 | 427 h = rec.autos
0 0 0 0 0 1 7 5 368 0 0 0 0 1 0 0 0 1 0 1 | 384 i = rec.motorcycles
1 0 0 0 0 0 1 1 0 379 7 0 0 1 0 0 0 0 0 0 | 390 j = rec.sport.baseball
0 0 0 1 2 0 0 1 0 4 387 0 0 0 0 1 0 0 0 2 | 398 k = rec.sport.hockey
0 3 0 1 3 2 0 0 0 0 0 393 2 0 0 0 1 3 1 2 | 411 l = sci.crypt
0 5 0 12 10 0 5 1 1 0 0 1 328 0 2 0 0 2 1 0 | 368 m = sci.electronics
1 5 1 3 1 1 1 0 0 0 0 0 2 377 4 0 0 0 1 4 | 401 n = sci.med
0 5 0 0 1 1 1 0 0 1 0 2 0 1 389 0 0 0 2 2 | 405 o = sci.space
4 2 0 1 2 0 0 1 0 1 1 0 0 1 0 397 2 2 5 1 | 420 p = soc.religion.christian
1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 4 359 0 0 1 | 367 q = talk.politics.mideast
0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 360 0 8 | 371 r = talk.politics.guns
26 1 0 1 0 0 1 1 1 1 0 0 1 0 2 18 1 4 197 7 | 262 s = talk.religion.misc
0 0 0 0 1 0 0 1 0 2 0 2 0 0 3 0 3 10 3 284 | 309 t = talk.politics.misc
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.8759
Accuracy 90.4102%
Reliability 85.8359%
Reliability (standard deviation) 0.2183
{quote}
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Robin Anil
> Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of "Reliability" (User Accuracy) - assist in identifying consistent mal-assignment against "good" and "bad" labels.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-941) Strip quoted text from emails and
add statistics to ConfusionMatrix
Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181960#comment-13181960 ]
Grant Ingersoll commented on MAHOUT-941:
----------------------------------------
Or, just rename this issue to just be the stats piece
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.6
>
> Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix.
> ** Kappa - a standard measurement.
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa.
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
> *** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289928#comment-13289928 ]
Lance Norskog commented on MAHOUT-941:
--------------------------------------
1) Grrrr.. correct is supposed to be a summer.
{code}
correct = confusionMatrix[labelId][labelId];
{code}
2) This is printed out wrong. The "accuracy" up above is "producer's accuracy". This code calculates that and "user's accuracy", or "reliability". These are different. The printout should show both accuracies. Possibly also the mean of the two.
Imagine classification as the code throwing balls of different sizes to robot arms each programmed to grab one size. If none grab the ball, that's 'unclassified' Producer's accuracy is from the thrower's point of view, user's accuracy is from the robot arms' points of view. They are different counts because 'unclassified' is part of the producer's 'wrong' count, while it is ignored by the user's counts.
[http://spatial-analyst.net/ILWIS/htm/ilwismen/confusion_matrix.htm]
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Robin Anil
> Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-941) Strip quoted text from emails and
add statistics to ConfusionMatrix
Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181916#comment-13181916 ]
Grant Ingersoll commented on MAHOUT-941:
----------------------------------------
Lance, can you separate out the stats piece into a different issue? I'll fold the quoted stuff in with MAHOUT-939 and then we can deal with the stats in other places
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.6
>
> Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix.
> ** Kappa - a standard measurement.
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa.
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
> *** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAHOUT-941) Strip quoted text from emails and
add statistics to ConfusionMatrix
Posted by "Grant Ingersoll (Assigned) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll reassigned MAHOUT-941:
--------------------------------------
Assignee: Grant Ingersoll
> Strip quoted text from emails and add statistics to ConfusionMatrix
> -------------------------------------------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.6
>
> Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix.
> ** Kappa - a standard measurement.
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa.
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
> *** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289439#comment-13289439 ]
Robin Anil commented on MAHOUT-941:
-----------------------------------
Complementary Results:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 68210 97.9058%
Incorrectly Classified Instances : 1459 2.0942%
Total Classified Instances : 69669
=======================================================
Confusion Matrix
-------------------------------------------------------
a b <--Classified as
27615 756 | 28371 a = commons.apache.org
703 40595 | 41298 b = cocoon.apache.org
=======================================================
Statistics
-------------------------------------------------------
Kappa : -1.1483
Accuracy : 0.6522
Consistency (stdev of accuracy) : 0.5052
I am seeing this. Why is accuracy 0.65 when its actually 0.987. Can you fix this issue.
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Robin Anil
> Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lance Norskog updated MAHOUT-941:
---------------------------------
Attachment: MAHOUT-941.patch
Prints and Accuracy and Reliability stats, plus standard deviation of reliability.
Accuracy = "Producer Accuracy", includes unclassified results.
Reliability = "User Accuracy", does not include unclassified results.
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Robin Anil
> Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random assignment.
> # Add mean & standard deviation of individual labels - assist in identifying consistent mal-assignment v.s. high and low quality labels.
> Also, the SGD solver saves its model periodically to /tmp/news-groups-number. This patch moves those captures to the model/ output directory. (These intermediate models are interesting for tracking SGD incremental development.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-941) Improve ConfusionMatrix statistics
Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll updated MAHOUT-941:
-----------------------------------
Fix Version/s: (was: 0.6)
0.7
> Improve ConfusionMatrix statistics
> ----------------------------------
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Lance Norskog
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.7
>
> Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip
>
>
> This patch does 2 things:
> # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that removes quoted text from email bodies. This is important because it avoid spamming the term dictionary with repeated text, especially in long email threads.
> ** The feature defaults to true. Add "--quoted" to the command line to keep the quoted lines.
> # Adds some dubious overall measurements to the ConfusionMatrix.
> ** Kappa - a standard measurement.
> *** How different is this confidence matrix from random numbers? 0.0 is the same, 1.0 is completely different.
> *** I think this is an "unweighted" kappa.
> ** "Success" - a homegrown formula attempting to represent the correctness of each box. Probably bogus.
> *** The standard deviation shows the distance between the success of each producer->consumer box.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira