You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2008/03/21 15:52:32 UTC

[Bug 5861] New: Bayes problem (too common tokens etc)

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861

           Summary: Bayes problem (too common tokens etc)
           Product: Spamassassin
           Version: 3.2.4
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P3
         Component: Learner
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: hege@hege.li


Created an attachment (id=4277)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4277)
Debug output of message

Hi, please see the attached debug output of bayes and message.

I have real hard time learning these kind of messages into bayes. The problem
is when you get mostly ham from gmail.com or some other common place, you can't
really learn it well.

I have already done:

bayes_ignore_header Received  (too many gmail tokens)
bayes_ignore_header DKIM-Signature  (too many gmail tokens)
bayes_ignore_header DomainKey-Signature  (too many gmail tokens)

And still I get a mere BAYES_50, there are too many gmail tokens left!

How about some option like "bayes_ignore_token /gmail/"? Is there anything
coming up in 3.3.0 that might help the cause?

Another funny thing, I'm not sure why my amavis mail_id (L0YXx-simPYV) is
learned as a token? What good would it do, since it's always random?

Is there any learning on the attachment filenames? I don't see any tokens.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861


Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P5




-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #12 from Henrik Krohns <he...@hege.li>  2009-03-31 13:37:00 PST ---

I stumbled on a curious FN. Seems wonky that a header or two can dominate the
whole scoring. Just food for thought, hopefully some year I have time to do
deeper tests..

Headers:

X-Proofpoint-Virus-Version: vendor=fsecure
engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2009-03-28_05:2009-03-27,2009-03-28,2009-03-27 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=26 spamscore=26
ipscore=0 phishscore=99 bulkscore=0 adultscore=0 classifier=spam adjust=0
reason=
mlx engine=5.0.0-0811170000 definitions=main-0903280069

Parse:

dbg: bayes: tok_get_all: token count: 147
dbg: bayes: token HX-Proofpoint-Virus-Version:1.12.7400 => 0.00190587422573309
dbg: bayes: token HX-Proofpoint-Virus-Version:fsecure => 0.00434363906304506
dbg: bayes: token HX-Proofpoint-Virus-Version:vendor => 0.00458230337259314
dbg: bayes: token HX-Proofpoint-Virus-Version:definitions =>
0.00458230337259314
dbg: bayes: token HX-Proofpoint-Virus-Version:engine => 0.00458230337259314
dbg: bayes: token HX-Proofpoint-Virus-Version:signatures => 0.00458230337259314
dbg: bayes: token HX-Proofpoint-Virus-Version:sk:2.4.4,1 => 0.00458803535339196
dbg: bayes: token HX-Proofpoint-Virus-Version:sk:2009-03 => 0.00560260989702483
dbg: bayes: token Hx-mimeole:Exchange => 0.00569605506742619
dbg: bayes: token Hx-mimeole:Produced => 0.00589653854308459
dbg: bayes: token Hx-mimeole:Microsoft => 0.00589653854308459
dbg: bayes: token Hx-mimeole:V6.5 => 0.00700724277522037
dbg: bayes: token HContent-class:content-classes => 0.00840337534766064
dbg: bayes: token HContent-class:urn => 0.00840353709745381
dbg: bayes: token HContent-class:message => 0.00856128997205224
dbg: bayes: token HX-Proofpoint-Spam-Details:sk:5.0.0-0 => 0.0107477511672061
dbg: bayes: token D*live.com => 0.98880239141895
dbg: bayes: token HX-Proofpoint-Spam-Details:mlx => 0.0132332664149436
dbg: bayes: token HX-Proofpoint-Spam-Details:adultscore => 0.0132332664149436
dbg: bayes: token HX-Proofpoint-Spam-Details:spam => 0.0134194375404317
dbg: bayes: token HX-Proofpoint-Spam-Details:ipscore => 0.0134194375404317
dbg: bayes: token HX-Proofpoint-Spam-Details:phishscore => 0.0134194375404317
dbg: bayes: token HX-Proofpoint-Spam-Details:spamscore => 0.0134194375404317
dbg: bayes: token HX-Proofpoint-Spam-Details:bulkscore => 0.0137087277138293
dbg: bayes: token sk:helpdes => 0.015616392737519
dbg: bayes: token HX-Proofpoint-Spam-Details:rule => 0.0158325080420237
dbg: bayes: token HX-Proofpoint-Spam-Details:notspam => 0.0158599391972407
dbg: bayes: token HX-Proofpoint-Spam-Details:score => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:definitions => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:adjust => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:policy => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:reason => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:engine => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:default => 0.0171566355845046
dbg: bayes: token HX-Proofpoint-Spam-Details:classifier => 0.0173669864520215
dbg: bayes: token mailbox => 0.0262678473925646
dbg: bayes: token Username => 0.0444462667192503
dbg: bayes: token Unit => 0.107186995611258
dbg: bayes: token increase => 0.88997938686673
dbg: bayes: token username => 0.110283938874452
dbg: bayes: token increased => 0.869317730990728
dbg: bayes: token size => 0.84902152925483
dbg: bayes: token message => 0.151844260253207
dbg: bayes: score = 1.93606242149258e-12


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #13 from Matt Kettler <mk...@verizon.net>  2009-03-31 18:37:51 PST ---
Are the  X-Proofpoint-* headers in all of your mail (ie: added by an upstream
server?)


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861

Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Group|security                    |
          Component|Security                    |Libraries
         AssignedTo|security@spamassassin.apach |dev@spamassassin.apache.org
                   |e.org                       |

--- Comment #19 from Justin Mason <jm...@jmason.org> 2010-01-27 03:16:31 UTC ---
reassigning, too

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #16 from Henrik Krohns <he...@hege.li>  2009-03-31 21:40:11 PST ---

Yes you have a general point, but not much relevance to the problem.

I wonder what would be the best way to fix it. Select few highest and lowest
scoring tokens from single header? I guess some validation runs would be
needed..


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #15 from Matt Kettler <mk...@verizon.net>  2009-03-31 21:27:53 PST ---
The question was relevant because a header that is in all of your mail mail,
and has a lot of unchanging text in it, should have a local ignore.

I do see your point on capping the number of tokens per header.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #17 from Justin Mason <jm...@jmason.org>  2009-04-01 01:44:55 PST ---
(In reply to comment #14)
> I don't see how it's relevant, but no. It's from some US uni.
> 
> The point is that there probably should be some limit on how many tokens to get
> from a header. If I learn that as spam, all ham mail containing those headers
> will be strongly biased to spam (an uneducated, but logical guess).

I think you're overestimating it's effects on the chi-square probability
combining algorithm; actually, there's a good chance those values won't skew it
much, assuming there are stronger tokens found elsewhere.

The only way to get a useful idea of what's really happening is to run a
10-fold cross validation run. 
http://wiki.apache.org/spamassassin/TenFoldCrossValidation


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #14 from Henrik Krohns <he...@hege.li>  2009-03-31 20:59:26 PST ---

I don't see how it's relevant, but no. It's from some US uni.

The point is that there probably should be some limit on how many tokens to get
from a header. If I learn that as spam, all ham mail containing those headers
will be strongly biased to spam (an uneducated, but logical guess).


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #4 from Justin Mason <jm...@jmason.org>  2008-03-22 05:54:11 PST ---
(In reply to comment #3)
> (In reply to comment #2)
> > Maybe it should be wise to add DKIM-Signature and DomainKey-Signature to the
> > default ignore list?
> 
> +1
> 
> There's lots of useful header information, but cryptographic signatures aren't
> included in that imo.

However the presence of the header might be, for some people, or some tokens in
those headers.  For example, you are *DEFINITELY* losing good data from the
Received: headers, I can guarantee that.

It's important that we don't make changes to the ignore list without
benchmarking its effects using 10-fold cross validation testing.  One thing
I've found, time and time again, is that Bayes probability combining is a lot
smarter than you're giving it credit for -- relatively-weak "ham" or "spam"
probabilities will cancel each other out, allowing stronger tokens to have an
effect quite nicely.  It's not always as simple as they may appear in
isolation.

(anyway, having said that, if someone wants to do a 10-fold cross-validation
run testing ignoring the DK/DKIM sig headers, go ahead.)


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861


Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |




--- Comment #8 from Henrik Krohns <he...@hege.li>  2008-04-10 03:50:18 PST ---

I'm not comfortable in closing this bug yet.

(In reply to comment #7)
> (In reply to comment #5)
> > So is there something that can help with these short messages, that don't
> > create many tokens? When there aren't enough body tokens, by default all those
> > hammy header tokens are sure to prevent correct scoring. It forces me to ignore
> > such headers.
> 
> Training on error should help -- train mostly on FPs and FNs from now on.

How can this help? If it wasn't obvious, ofcourse I trained it. It didn't help.

A mail from gmail had so many hammy tokens, it is impossible to train without
other more specific tokens.

Isn't there more stuff you can create tokens from, like filenames? What if you
get a mass of spam from gmail, containing only .doc attachment and no body? It
will still score BAYES_50 or something, all the hammy gmail tokens will prevent
better scores!! I demonstrated this already in my first post. Atleast my DKIM
patch should help remove some of excess tokens. I'll try to test how it
affects.

I know you guys are busy, but I think this isn't something to just shrug off.
Or is it just something that is rare and "gotta live with it"? Is there any
interest from your side in enchancing the Bayes engine or does it have to come
from contributions? You are the ones that know the system best.


> > Also whats the deal with saving those X-Spam-Relays-Internal tokens? I ignored
> > it since I can't figure out any purpose to bloat my db.
> 
> Consider a site with 2 MXes -- a primary and secondary MX.  both are listed as
> IPs in internal_networks.  For some reason, spammers tend to like sending spam
> via the secondary.  The presence of that MX's IP in the
> 'X-Spam-Relays-Internal' hdr therefore becomes a spam sign, for that site.
>

There is still atleast one question unanswered. Why is the _unique_ mail id
recorded as a token? I understand IP, but not that.

If you don't have time, then please answer when you have it. It seems you just
try to blaze though as fast as you can.

I will try to analyze and help with this, but I could really use some
insightful input.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #1 from Henrik Krohns <he...@hege.li>  2008-03-21 09:46:01 PST ---

This is the crude patch that I'm testing for now..

--- Bayes.pm.orig       Fri Mar 21 18:44:41 2008
+++ Bayes.pm    Fri Mar 21 18:46:39 2008
@@ -329,6 +329,7 @@
   my %tokens;
   foreach my $token (@tokens) {
     next unless length($token); # skip 0 length tokens
+    next if $token =~ /(?:gmail|yahoo|hotmail)/; # skip too hammy tokens
     $tokens{substr(sha1($token), -5)} = $token;
   }


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861


Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED
   Target Milestone|Undefined                   |3.3.0




--- Comment #7 from Justin Mason <jm...@jmason.org>  2008-04-10 01:48:34 PST ---
(In reply to comment #5)
> So is there something that can help with these short messages, that don't
> create many tokens? When there aren't enough body tokens, by default all those
> hammy header tokens are sure to prevent correct scoring. It forces me to ignore
> such headers.

Training on error should help -- train mostly on FPs and FNs from now on.

> Also whats the deal with saving those X-Spam-Relays-Internal tokens? I ignored
> it since I can't figure out any purpose to bloat my db.

Consider a site with 2 MXes -- a primary and secondary MX.  both are listed as
IPs in internal_networks.  For some reason, spammers tend to like sending spam
via the secondary.  The presence of that MX's IP in the
'X-Spam-Relays-Internal' hdr therefore becomes a spam sign, for that site.

If, on the other hand, a token appears equally in both ham and spam:

  - it's P value will tend towards the middle ground: 0.5
  - this means that it will fall outside $MIN_PROB_STRENGTH:

    # Should we ignore tokens with probs very close to the middle ground (.5)?
    # tokens need to be outside the [ .5-MPS, .5+MPS ] range to be used.
    our $MIN_PROB_STRENGTH = 0.346;

  - tokens outside that range are unused

  - unused tokens don't have their access times updated, and therefore
    are expired from the Bayes db.

thanks for the patch -- I'll apply it.  we should probably be running 
a 10-fold cross validation, but I'm a bit busy and I think it's a good
idea as a hunch. ;)

: jm 573...; svn commit -m "bug 5861: add DKIM-Signature and
DomainKey-Signature to the set of headers whose contents are ignored for Bayes;
their presence is marked, however.  thanks to Henrik Krohns"
lib/Mail/SpamAssassin/Plugin/Bayes.pm
Sending        lib/Mail/SpamAssassin/Plugin/Bayes.pm
Transmitting file data .
Committed revision 646688.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #11 from Justin Mason <jm...@jmason.org>  2008-04-10 04:25:13 PST ---
(In reply to comment #10)
> Ok I'll have a look at that.. might take a while as I need to create some good
> corpus first.

if you like, I can share one for you to use...


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #10 from Henrik Krohns <he...@hege.li>  2008-04-10 04:16:58 PST ---

Ok I'll have a look at that.. might take a while as I need to create some good
corpus first.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #2 from Henrik Krohns <he...@hege.li>  2008-03-21 09:56:54 PST ---

Maybe it should be wise to add DKIM-Signature and DomainKey-Signature to the
default ignore list?

[29942] dbg: bayes: token 'HDKIM-Signature:beta' => 0.00806992358626491
[29942] dbg: bayes: token 'HDomainKey-Signature:beta' => 0.00893960579973894
[29942] dbg: bayes: token 'HDKIM-Signature:mime-version' => 0.0102741290765491
[29942] dbg: bayes: token 'HDKIM-Signature:received' => 0.0102965519261728
[29942] dbg: bayes: token 'HDKIM-Signature:sk:domaink' => 0.0103000102852899
[29942] dbg: bayes: token 'HDKIM-Signature:relaxed' => 0.0103817043841538
[29942] dbg: bayes: token 'HDKIM-Signature:rsa-sha256' => 0.0111528115306108
[29942] dbg: bayes: token 'HDomainKey-Signature:content-type' =>
0.0117297689443074
[29942] dbg: bayes: token 'HDomainKey-Signature:mime-version' =>
0.0117316778713491
[29942] dbg: bayes: token 'HDomainKey-Signature:subject' => 0.0120925906621403
[29942] dbg: bayes: token 'HDomainKey-Signature:message-id' =>
0.0121248330712668
[29942] dbg: bayes: token 'HDomainKey-Signature:sk:uuyE1wR' =>
0.986543689320388
[29942] dbg: bayes: token 'HDKIM-Signature:sk:TMZO4KJ' => 0.986543689320388
[29942] dbg: bayes: token 'HDomainKey-Signature:oiyxr0w' => 0.986543689320388
[29942] dbg: bayes: token 'HDKIM-Signature:sk:i0UgPfZ' => 0.986543689320388
[29942] dbg: bayes: token 'HDomainKey-Signature:peWuJ2k' => 0.986543689320388
[29942] dbg: bayes: token 'HDKIM-Signature:sk:Hyhy38j' => 0.986543689320388
[29942] dbg: bayes: token 'HDKIM-Signature:sk:z8grhBD' => 0.986543689320388
[29942] dbg: bayes: token 'HDomainKey-Signature:sk:WCcy0Q8' =>
0.986543689320388

Or then make it more intelligent and skip the static ones on the top, since
they are generally the same anywhere.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #5 from Henrik Krohns <he...@hege.li>  2008-03-22 06:29:50 PST ---

So is there something that can help with these short messages, that don't
create many tokens? When there aren't enough body tokens, by default all those
hammy header tokens are sure to prevent correct scoring. It forces me to ignore
such headers.

Also whats the deal with saving those X-Spam-Relays-Internal tokens? I ignored
it since I can't figure out any purpose to bloat my db.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #3 from Theo Van Dinter <fe...@apache.org>  2008-03-21 11:50:19 PST ---
(In reply to comment #2)
> Maybe it should be wise to add DKIM-Signature and DomainKey-Signature to the
> default ignore list?

+1

There's lots of useful header information, but cryptographic signatures aren't
included in that imo.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #6 from Henrik Krohns <he...@hege.li>  2008-04-09 23:17:02 PST ---
Created an attachment (id=4293)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4293)
Mark only precense of DKIM/DomainKey-Signature on Bayes


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5861] Bayes problem (too common tokens etc)

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5861





--- Comment #9 from Justin Mason <jm...@jmason.org>  2008-04-10 04:07:16 PST ---
I agree, attachment filenames would be a great source of tokens.  *adding* new
tokens isn't likely to be a problem.

If you would like to see this stuff changed, here's what to do -- run a
ten-fold cross-validation that demonstrates an improvement in accuracy:

http://svn.apache.org/repos/asf/spamassassin/trunk/masses/bayes-testing/

that's how we measure the effects of Bayes tweaks.  stuff that performs well in
that testing is MUCH more likely to get in.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.