You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Chris Lear <ch...@laculine.com> on 2005/05/20 11:27:00 UTC

SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251

I've been running quite a lot of sare rules on a site-wide SA
installation for a month or two now. I've been keeping a fairly close
eye on it, and there have been few false positives generally.

But today I noticed that several e-mails are hitting both
SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251. These are ham, sent from
(one specific address in) Ukraine to a Ukrainian in England, written in
English.
The scoring is such that the e-mail gets a score of 3.333 PLUS 4.0 - so
only bayes saves it from being rejected (we reject at >5.5).

I can re-score these rules (or remove sare_header0, which will lower the
scores anyway), but I have 2 questions:
- Is this a slightly unfair double-scoring?
- Are there any other similar rules I should worry about, given that
some Russian mail to this server is ham?

--
Chris

Re: SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251

Posted by Chris Lear <ch...@laculine.com>.
* John Wilcock wrote (05/20/05 12:15):
> Chris Lear wrote:
>> They're in my header0.cf from sare/rules du jour. And in header.cf with
>> a lower score as well. Have I got the wrong files?
> 
> Methinks you have an old header0.cf that is no longer being updated - 
> these rules aren't in the current header0 on rulesemporium.com.

OK, thanks. I'll try to find out what's wrong with my Rules du Jour.

> 
> And in any case you shouldn't be using header and header0 together...

I didn't know that. I'll fix that as well.

Thanks for your help.

--
Chris

Re: SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251

Posted by John Wilcock <jo...@tradoc.fr>.
Chris Lear wrote:
> They're in my header0.cf from sare/rules du jour. And in header.cf with
> a lower score as well. Have I got the wrong files?

Methinks you have an old header0.cf that is no longer being updated - 
these rules aren't in the current header0 on rulesemporium.com.

And in any case you shouldn't be using header and header0 together...

John.

-- 
-- Over 2500 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages    - www.tradoc.fr


Re: SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251

Posted by Chris Lear <ch...@laculine.com>.
* Robert Menschel wrote (05/20/05 15:13):
> Hello Chris, John,
> 
> Friday, May 20, 2005, 3:47:55 AM, you wrote:
> 
>>>> I can re-score these rules (or remove sare_header0, which will lower the
>>>> scores anyway), but I have 2 questions:
>>>> - Is this a slightly unfair double-scoring?
>>>> - Are there any other similar rules I should worry about, given that
>>>> some Russian mail to this server is ham?
>>> 
>>> These are actually in the header1 file, not header0, but surely they
>>> ought to be moved to the 70_sare_header_eng.cf as they hit non-English
>>> ham. Bob?
> 
> CL> They're in my header0.cf from sare/rules du jour. And in header.cf with
> CL> a lower score as well. Have I got the wrong files?
> 
> Yes, your header0 is old.  Both rules are in header1 in the current
> versions. You need to fix your RDJ for header0, or just delete it,
> since header0 through header3 are included in header.cf
> 
> Yes, you can and maybe should provide a lower score, at least
> temporarily.
> 
> Yes, they should be moved to header_eng, and will be this weekend.

Thanks for all this. I've been educated.

> 
> Meanwhile, is it possible for you to send me some samples of the ham?
> If I add that to my corpus, it'll be taken into account in the next
> rescoring.

Sent under separate cover.

--
Chris

Re[2]: SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Chris, John,

Friday, May 20, 2005, 3:47:55 AM, you wrote:

>>> I can re-score these rules (or remove sare_header0, which will lower the
>>> scores anyway), but I have 2 questions:
>>> - Is this a slightly unfair double-scoring?
>>> - Are there any other similar rules I should worry about, given that
>>> some Russian mail to this server is ham?
>> 
>> These are actually in the header1 file, not header0, but surely they
>> ought to be moved to the 70_sare_header_eng.cf as they hit non-English
>> ham. Bob?

CL> They're in my header0.cf from sare/rules du jour. And in header.cf with
CL> a lower score as well. Have I got the wrong files?

Yes, your header0 is old.  Both rules are in header1 in the current
versions. You need to fix your RDJ for header0, or just delete it,
since header0 through header3 are included in header.cf

Yes, you can and maybe should provide a lower score, at least
temporarily.

Yes, they should be moved to header_eng, and will be this weekend.

Meanwhile, is it possible for you to send me some samples of the ham?
If I add that to my corpus, it'll be taken into account in the next
rescoring.

Bob Menschel




Re: SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251

Posted by Chris Lear <ch...@laculine.com>.
* John Wilcock wrote (05/20/05 10:51):
> Chris Lear wrote:
>> But today I noticed that several e-mails are hitting both
>> SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251. These are ham, sent from
>> (one specific address in) Ukraine to a Ukrainian in England, written in
>> English.
>> The scoring is such that the e-mail gets a score of 3.333 PLUS 4.0 - so
>> only bayes saves it from being rejected (we reject at >5.5).
>> 
>> I can re-score these rules (or remove sare_header0, which will lower the
>> scores anyway), but I have 2 questions:
>> - Is this a slightly unfair double-scoring?
>> - Are there any other similar rules I should worry about, given that
>> some Russian mail to this server is ham?
> 
> These are actually in the header1 file, not header0, but surely they 
> ought to be moved to the 70_sare_header_eng.cf as they hit non-English 
> ham. Bob?

They're in my header0.cf from sare/rules du jour. And in header.cf with
a lower score as well. Have I got the wrong files?

RulesDuJour $ grep SARE_FROM_CHAR_W1251 *
70_sare_header.cf:header    SARE_FROM_CHAR_W1251     From:raw =~
/\=\?Windows-1251\?/i
70_sare_header.cf:describe  SARE_FROM_CHAR_W1251     Displays in
unexpected charset
70_sare_header.cf:score     SARE_FROM_CHAR_W1251     1.666
70_sare_header.cf:#ham      SARE_FROM_CHAR_W1251     Found in some
Russian ham
70_sare_header.cf:#hist     SARE_FROM_CHAR_W1251     Created by Bob
Menschel May 17 2004
70_sare_header.cf:#counts   SARE_FROM_CHAR_W1251     245s/4h of 238550
corpus (112525s/126025h RM) 02/28/05
70_sare_header.cf:#counts   SARE_FROM_CHAR_W1251     640s/0h of 54176
corpus (16997s/37179h JH-3.01) 02/01/05
70_sare_header.cf:#counts   SARE_FROM_CHAR_W1251     0s/0h of 17050
corpus (14617s/2433h MY) 08/08/04
70_sare_header0.cf:header    SARE_FROM_CHAR_W1251     From:raw =~
/\=\?Windows-1251\?/i
70_sare_header0.cf:describe  SARE_FROM_CHAR_W1251     Displays in
unexpected charset
70_sare_header0.cf:score     SARE_FROM_CHAR_W1251     4.000
70_sare_header0.cf:#stype    SARE_FROM_CHAR_W1251     spamgg
70_sare_header0.cf:#hist     SARE_FROM_CHAR_W1251     Created by Bob
Menschel May 17 2004
70_sare_header0.cf:#counts   SARE_FROM_CHAR_W1251     180s/0h of 66979
corpus (41757s/25222h RM) 09/04/04
70_sare_header0.cf:#counts   SARE_FROM_CHAR_W1251     209s/0h of 38398
corpus (14914s/23484h JH) 08/14/04 TM2 SA3.0-pre2
70_sare_header0.cf:#counts   SARE_FROM_CHAR_W1251     0s/0h of 17050
corpus (14617s/2433h MY) 08/08/04


--
Chris

Re: SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251

Posted by John Wilcock <jo...@tradoc.fr>.
Chris Lear wrote:
> But today I noticed that several e-mails are hitting both
> SARE_CHARSET_W1251 and SARE_FROM_CHAR_W1251. These are ham, sent from
> (one specific address in) Ukraine to a Ukrainian in England, written in
> English.
> The scoring is such that the e-mail gets a score of 3.333 PLUS 4.0 - so
> only bayes saves it from being rejected (we reject at >5.5).
> 
> I can re-score these rules (or remove sare_header0, which will lower the
> scores anyway), but I have 2 questions:
> - Is this a slightly unfair double-scoring?
> - Are there any other similar rules I should worry about, given that
> some Russian mail to this server is ham?

These are actually in the header1 file, not header0, but surely they 
ought to be moved to the 70_sare_header_eng.cf as they hit non-English 
ham. Bob?

And yes, the double scoring effect does seem rather over the top to me, 
even for sites that don't expect any Cyrillic ham.

John.

-- 
-- Over 2500 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages    - www.tradoc.fr