You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Robert Menschel <Ro...@Menschel.net> on 2004/02/17 15:52:42 UTC

Re[2]: possible bayes poison error to use against them?

Hello Loren,

Monday, February 16, 2004, 7:42:19 AM, you wrote:

LW> body     DUMB_PERIODS    /(?:.*\b[a-z]{3,10}[\.\!][a-z]{3,10}\b){6,30}/i
LW> describe DUMB_PERIODS    Writer doesn't put spaces after periods.
LW> score    DUMB_PERIODS    2.0    # not real high, can match source code listings

LW> This is UNTESTED, but might help.  You can twiddle the score higher if
LW> nobody ever sends you code listings in mail.  I'd really like to run this
LW> against a corpus and see how much ham it catches before putting it in my own
LW> configuration.

Results against my corpus:

DUMB_PERIODS -- 5029s/1518h of 100794 corpus (82099s/18695h) 02/16/04
DUMB_PERIODS -- suggested score: 0.184 (of 5.0)

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 100794    82099    18695    0.815   0.00    0.00  (all messages)
100.000  81.4523  18.5477    0.815   0.00    0.00  (all messages as %)

  6.495   6.1255   8.1198    0.430   0.00    2.00  DUMB_PERIODS

It matches 8% of my ham, and only 6% of my spam.

Bob Menschel






Re: Re[2]: possible bayes poison error to use against them?

Posted by Loren Wilton <lw...@earthlink.net>.
> It matches 8% of my ham, and only 6% of my spam.
>
> Bob Menschel

I think that can safely qualify that as a pretty bad rule!  It might be
possible to tune it by twiddling the word length values some, but I doubt it
is worth the effort unless nothing better can be found.  I don't have the
tools to do corpus checks with my tiny Linux machine, and I doubt that
anyone else would want to waste the hours fiddling with that to try to
improve it.

I *knew* there was a reason I didn't want to put it on my own machine...
:-)

BTW, I seem to be having some luck with a rule that checks for my email
address in the to and cc lists and looks to see if the optional name in
front of it is correct.  I only gave this a couple points since it will
obviously fail on all mailing lists, but I still generally end up with a
negative score from bayes or whitelist.  And it adds a couple of points to a
whole lot of real spams.  Can't really say yet how worthwhile this rule is,
and it is certainly a difficult one to implement without user-specific
rules.

        Loren