You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by mo...@medic.chalmers.se on 2004/09/07 19:32:20 UTC

Re: Re[2]: *** Please, help to add such a rule

In message <14...@Menschel.net>, Robert Menschel writes:
>Hello Loren, Mario,
>Wednesday, August 25, 2004, 12:39:23 PM, Loren wrote:
>LW> The specific rule you  asked for would be written as
>LW> header SUB_UNDERSCORES    Subject =~ /__/
>LW> score    SUB_UNDERSCORES    0.1
>LW> But don't use it, or at least not with any significant score.
>Well, actually, a quick scan of my corpus, 24k ham and 46k spam, shows 40
>spam hits and no ham hits. IMO that could warrant a SARE score as high as
>0.777 (my email client often gives different results than mass-check
>does, so don't take this as gospel). Expect to see this in my next SARE
>mass-check request, so we can see if it works on other corpora.

I would advice against it. At least one big free email provider
(yahoo.se, not sure about the rest of yahoo) will produce this kind of
subject when you send quoted-printable encoded headers to and from it,
due to a buggy QP-encoding.

Essentially, if there's a space before the word with the QP-encoded
letter in it, it erroneously adds one extra `_'.

This eventually leads to subject like these:
Subject: Re: Som man bäddar, _____________________får man ligga...

//Christer

-- 
| Tellusgatan 54    | Telefon: Hem 031 - 42 52 03     CTH: 031 - 772 5431     |
| 415 19 Göteborg   | Epost:   mort@cd.chalmers.se  Nalle: +46 (0)707 535757  |
|                   | WWW:     http://www.cd.chalmers.se/~mort/               |
"An NT server can be run by an idiot, and usually is." -- Tom Holub, a.h.b-o-i



Re[4]: *** Please, help to add such a rule

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello mort+spamassassin,

Tuesday, September 7, 2004, 10:32:20 AM, you wrote:

> In message <14...@Menschel.net>, Robert
Menschel writes:
>>LW> header SUB_UNDERSCORES    Subject =~ /__/
>>LW> score    SUB_UNDERSCORES    0.1
>>LW> But don't use it, or at least not with any significant score.

>>Well, actually, a quick scan of my corpus, 24k ham and 46k spam, shows 40
>>spam hits and no ham hits. IMO that could warrant a SARE score as high as
>>0.777 (my email client often gives different results than mass-check
>>does, so don't take this as gospel). Expect to see this in my next SARE
>>mass-check request, so we can see if it works on other corpora.

> I would advice against it. At least one big free email provider
> (yahoo.se, not sure about the rest of yahoo) will produce this kind of
> subject when you send quoted-printable encoded headers to and from it,
> due to a buggy QP-encoding.

Can you send me one or two examples of this for my corpus (with full
headers)? As mentioned above, the rule has done well within SARE's
testing, 
> header    SARE_SUB_2UNDERSCORES    Subject =~ /__/
> describe  SARE_SUB_2UNDERSCORES    Subject contains consecutive underscores
> score     SARE_SUB_2UNDERSCORES    0.652
> #hist     SARE_SUB_2UNDERSCORES    Loren Wilton in response to SA-Users query Aug 26 2004
> #counts   SARE_SUB_2UNDERSCORES    31s/0h of 64199 corpus (39383s/24816h RM) 08/28/04
> #counts   SARE_SUB_2UNDERSCORES    13s/0h of 18651 corpus (16120s/2531h MY) 08/29/04
> #counts   SARE_SUB_2UNDERSCORES    8s/2h of 38751 corpus (15270s/23481h JH-SA3.0rc1) 08/30/04
That's only 2 ham vs 52 spam. If you have more counter-examples we'd like
to include them in our scoring algorithm, to help avoid FPs.

Thanks.

Bob Menschel