You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2006/09/08 15:34:24 UTC

[Bug 5096] New: replace some mass-check spam corpora with spamtrap data

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5096

           Summary: replace some mass-check spam corpora with spamtrap data
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Score Generation
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jm@jmason.org


This is an issue that I want to get into BZ so I don't forget it.

I think some of our mass-check corpora are no longer receiving representative
spam feeds.

Due to spam volume, many of us no longer accept all the spam that is sent to
our MXes -- the "easy" spam is being rejected during the SMTP conversation, and
therefore never makes it into our corpus. For example, my MX is now using
SBL+XBL during the SMTP conversation, rejecting about 40% of the incomng spam
to jmason.org.  It looks like other mass-checkers are doing something similar,
based on the network rule hit-rates on one corpus compared to another:

http://ruleqa.spamassassin.org/20060902-r439560-n/RCVD_IN_XBL/detail#DETAILS_all_mass_check_date_rev_20060902_r439560_n

This is a problem, since the score generation process relies on having
a "representative" selection of spam and ham, and if half of the "easy" spam
is not in the corpus, that's not happening.

I suggest that we should stop mass-checks of 'problematic' corpora and replace
them with (reliable, carefully vetted, bounce-filtered) spamtrap data.
I also suggest that these spamtraps be set up with some kind of limited
SpamAssassin ruleset, so that they can record "live" network rule results
on the trapped mails.

Theo noted -- 
> FWIW, my personal mail and my spamtraps have no filtering other than SA.
> I can create new/share some of my current spamtrap addresses if people
> want to "spread them around" more than I have (which isn't a lot).

(ps: theo, is  /home/corpus/SA/corpus/ham/hamtrap/2006/09/01/8fc9d5de19
a ham?  could you verify?  it hits XBL in the mass-check results above)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5096] replace some mass-check spam corpora with spamtrap data

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5096


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Additional Comments From jm@jmason.org  2006-12-13 04:37 -------
this is fixed; a good portion of my corpus comes from trustworthy traps nowadays.
(the "jm" part, not "bb-jm" btw)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5096] replace some mass-check spam corpora with spamtrap data

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5096


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.2.0






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5096] replace some mass-check spam corpora with spamtrap data

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5096





------- Additional Comments From felicity@apache.org  2006-09-08 18:11 -------
> (ps: theo, is  /home/corpus/SA/corpus/ham/hamtrap/2006/09/01/8fc9d5de19
> a ham?  could you verify?  it hits XBL in the mass-check results above)

yes, that's ham.


btw: this sounds related to bug 4912.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.