You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2006/09/08 15:34:24 UTC
[Bug 5096] New: replace some mass-check spam corpora with spamtrap data
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5096
Summary: replace some mass-check spam corpora with spamtrap data
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Platform: Other
OS/Version: other
Status: NEW
Severity: normal
Priority: P5
Component: Score Generation
AssignedTo: dev@spamassassin.apache.org
ReportedBy: jm@jmason.org
This is an issue that I want to get into BZ so I don't forget it.
I think some of our mass-check corpora are no longer receiving representative
spam feeds.
Due to spam volume, many of us no longer accept all the spam that is sent to
our MXes -- the "easy" spam is being rejected during the SMTP conversation, and
therefore never makes it into our corpus. For example, my MX is now using
SBL+XBL during the SMTP conversation, rejecting about 40% of the incomng spam
to jmason.org. It looks like other mass-checkers are doing something similar,
based on the network rule hit-rates on one corpus compared to another:
http://ruleqa.spamassassin.org/20060902-r439560-n/RCVD_IN_XBL/detail#DETAILS_all_mass_check_date_rev_20060902_r439560_n
This is a problem, since the score generation process relies on having
a "representative" selection of spam and ham, and if half of the "easy" spam
is not in the corpus, that's not happening.
I suggest that we should stop mass-checks of 'problematic' corpora and replace
them with (reliable, carefully vetted, bounce-filtered) spamtrap data.
I also suggest that these spamtraps be set up with some kind of limited
SpamAssassin ruleset, so that they can record "live" network rule results
on the trapped mails.
Theo noted --
> FWIW, my personal mail and my spamtraps have no filtering other than SA.
> I can create new/share some of my current spamtrap addresses if people
> want to "spread them around" more than I have (which isn't a lot).
(ps: theo, is /home/corpus/SA/corpus/ham/hamtrap/2006/09/01/8fc9d5de19
a ham? could you verify? it hits XBL in the mass-check results above)
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 5096] replace some mass-check spam corpora with spamtrap data
Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5096
jm@jmason.org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Additional Comments From jm@jmason.org 2006-12-13 04:37 -------
this is fixed; a good portion of my corpus comes from trustworthy traps nowadays.
(the "jm" part, not "bb-jm" btw)
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 5096] replace some mass-check spam corpora with spamtrap data
Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5096
jm@jmason.org changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|Undefined |3.2.0
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 5096] replace some mass-check spam corpora with spamtrap data
Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5096
------- Additional Comments From felicity@apache.org 2006-09-08 18:11 -------
> (ps: theo, is /home/corpus/SA/corpus/ham/hamtrap/2006/09/01/8fc9d5de19
> a ham? could you verify? it hits XBL in the mass-check results above)
yes, that's ham.
btw: this sounds related to bug 4912.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.