You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/07/20 20:58:25 UTC

[Bug 4493] New: add pre-tokenize text munge to learner

http://bugzilla.spamassassin.org/show_bug.cgi?id=4493

           Summary: add pre-tokenize text munge to learner
           Product: Spamassassin
           Version: 3.0.4
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Learner
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: dharris@drh.net


I setup spamassassin with a site-wide bayes database. Users are reporting their 
own spam, and after being approved by an administrator, that spam is used to 
train the spamassassin bayes database.

Because I have users reporting spam into a global bayes database, I want the 
learner to ignore any e-mail addresses of my users in the learning, because if 
one user happens to report lots of spam, bayes would learn that their address 
means spam. I don't want this.

I have already excluded the To, Cc, Bcc headers using the base_ignore_header 
config, however e-mail addresses show up in my Received header like the 
following and can show up others places too.

Received: from w3.drh.net ([64.21.76.5])
          (envelope-sender <dh...@drh.net>)
          by secondary.scan1.myactv.net (qmail-ldap-1.03) with SMTP
          for <te...@mail.myactv.net>; 20 Jul 2005 18:12:32 -0000

So, I created a patch that applies the below regular expression to any text 
before it tokenized by bayes to wipe out the username:

s/[a-z0-9][a-z0-9\_\.-]{1,48}\@
(myactv.net|mail.myactv.net|mss1.myactv.net)/MYACTVREPLACEDUSERNAME\@myactv.net/
gi;

Because I have multiple MX servers, I also used this regular expression to 
solve the problem described here http://wiki.apache.org/spamassassin/BayesBitMe

s/scan\d.myactv.net/scan1.myactv.net/g;

A configurable way rewrite text before tokenization would be appreciated.

Also note that crm114 (http://crm114.sourceforge.net/) has a feature to do this 
same thing.

Here is my patch to add this feature manually:
http://www.davideous.com/qmail/Mail-SpamAssassin-3.0.4-antietam-bayes-
customizations-040719-just-rewrite.patch



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4493] add pre-tokenize text munge to learner

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4493


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.2.0




------- Additional Comments From felicity@apache.org  2006-12-31 12:48 -------
It seems like a plugin call in Bayes::tokenize() would solve this.  Then people
could filter out whatever tokens they don't want, or add in new tokens, or whatever.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4493] RFE: add pre-tokenize text munge to learner

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4493


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P2                          |P5
            Summary|add pre-tokenize text munge |RFE: add pre-tokenize text
                   |to learner                  |munge to learner




------- Additional Comments From jm@jmason.org  2007-01-14 07:00 -------
seems unlikely to happen in 3.2.0 without a patch



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4493] RFE: add pre-tokenize text munge to learner

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4493


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|3.2.0                       |3.3.0




------- Additional Comments From jm@jmason.org  2007-02-21 12:05 -------
pushing out to 3.3.0, since I don't think it's a 3.2.0 blocker. shout (or change
the milestone) if you disagree....



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.