You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by mo...@medic.chalmers.se on 2004/09/07 19:32:20 UTC
Re: Re[2]: *** Please, help to add such a rule
In message <14...@Menschel.net>, Robert Menschel writes:
>Hello Loren, Mario,
>Wednesday, August 25, 2004, 12:39:23 PM, Loren wrote:
>LW> The specific rule you asked for would be written as
>LW> header SUB_UNDERSCORES Subject =~ /__/
>LW> score SUB_UNDERSCORES 0.1
>LW> But don't use it, or at least not with any significant score.
>Well, actually, a quick scan of my corpus, 24k ham and 46k spam, shows 40
>spam hits and no ham hits. IMO that could warrant a SARE score as high as
>0.777 (my email client often gives different results than mass-check
>does, so don't take this as gospel). Expect to see this in my next SARE
>mass-check request, so we can see if it works on other corpora.
I would advice against it. At least one big free email provider
(yahoo.se, not sure about the rest of yahoo) will produce this kind of
subject when you send quoted-printable encoded headers to and from it,
due to a buggy QP-encoding.
Essentially, if there's a space before the word with the QP-encoded
letter in it, it erroneously adds one extra `_'.
This eventually leads to subject like these:
Subject: Re: Som man bäddar, _____________________får man ligga...
//Christer
--
| Tellusgatan 54 | Telefon: Hem 031 - 42 52 03 CTH: 031 - 772 5431 |
| 415 19 Göteborg | Epost: mort@cd.chalmers.se Nalle: +46 (0)707 535757 |
| | WWW: http://www.cd.chalmers.se/~mort/ |
"An NT server can be run by an idiot, and usually is." -- Tom Holub, a.h.b-o-i
Re[4]: *** Please, help to add such a rule
Posted by Robert Menschel <Ro...@Menschel.net>.
Hello mort+spamassassin,
Tuesday, September 7, 2004, 10:32:20 AM, you wrote:
> In message <14...@Menschel.net>, Robert
Menschel writes:
>>LW> header SUB_UNDERSCORES Subject =~ /__/
>>LW> score SUB_UNDERSCORES 0.1
>>LW> But don't use it, or at least not with any significant score.
>>Well, actually, a quick scan of my corpus, 24k ham and 46k spam, shows 40
>>spam hits and no ham hits. IMO that could warrant a SARE score as high as
>>0.777 (my email client often gives different results than mass-check
>>does, so don't take this as gospel). Expect to see this in my next SARE
>>mass-check request, so we can see if it works on other corpora.
> I would advice against it. At least one big free email provider
> (yahoo.se, not sure about the rest of yahoo) will produce this kind of
> subject when you send quoted-printable encoded headers to and from it,
> due to a buggy QP-encoding.
Can you send me one or two examples of this for my corpus (with full
headers)? As mentioned above, the rule has done well within SARE's
testing,
> header SARE_SUB_2UNDERSCORES Subject =~ /__/
> describe SARE_SUB_2UNDERSCORES Subject contains consecutive underscores
> score SARE_SUB_2UNDERSCORES 0.652
> #hist SARE_SUB_2UNDERSCORES Loren Wilton in response to SA-Users query Aug 26 2004
> #counts SARE_SUB_2UNDERSCORES 31s/0h of 64199 corpus (39383s/24816h RM) 08/28/04
> #counts SARE_SUB_2UNDERSCORES 13s/0h of 18651 corpus (16120s/2531h MY) 08/29/04
> #counts SARE_SUB_2UNDERSCORES 8s/2h of 38751 corpus (15270s/23481h JH-SA3.0rc1) 08/30/04
That's only 2 ham vs 52 spam. If you have more counter-examples we'd like
to include them in our scoring algorithm, to help avoid FPs.
Thanks.
Bob Menschel