You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Heinrich Christian Peters <He...@nurfuerspam.de> on 2008/10/21 11:56:18 UTC

Tuning the bayes-system?

Hello,

I am using a system-wide spamassassin setup (MailScanner). Nearly all my
spam-mails are detected correctly (~0,1% is not), no FP. But, especially
German spam-mails, are "wrongly" classified by the bayes-system. Should
I train thees mails manually as spam, if they are not autolearned? Or
should I train *all* my spam-mails regular?

Thanks, Yours,
Heiner


Email:     3870  Autolearn:  3575  Cached:   126  AvgScore:  29.35
Spam:      3562  Autolearn:  3411  Cached:   113  AvgScore:  32.22
Ham:        308  Autolearn:   164  Cached:    13  AvgScore:  -3.81

TOP SPAM RULES FIRED    (50/432)
========================================================================
RANK    RULE NAME          SCORE   COUNT %OFMAIL %OFSPAM  %OFHAM   BAYES
------------------------------------------------------------------------
  18    BAYES_99            5.40    1970   50.90   55.31    0.00  100.00
  27    BAYES_50            0.00     801   21.16   22.49    5.84   97.80
  41    BAYES_80            2.00     259    6.69    7.27    0.00  100.00
  43    BAYES_95            3.00     248    6.41    6.96    0.00  100.00

TOP HAM RULES FIRED     (50/78)
========================================================================
RANK    RULE NAME          SCORE   COUNT %OFMAIL %OFSPAM  %OFHAM   BAYES
------------------------------------------------------------------------

   2    BAYES_00           -4.90     253    8.01    1.60   82.14   81.61
  12    BAYES_50            0.00      18   21.16   22.49    5.84    2.20
  21    BAYES_40           -0.18       8    0.93    0.79    2.60   22.22
  32    BAYES_05           -1.11       4    0.62    0.56    1.30   16.67
  43    BAYES_20           -0.74       3    0.65    0.62    0.97   12.00

conf:
bayes_expiry_max_db_size 1500000
bayes_auto_learn_threshold_spam 7.5


Re: Not a reply: spamassassin stats (was Re: Tuning the bayes-system?)

Posted by Heinrich Christian Peters <He...@nurfuerspam.de>.
Moin,

Koopmann, Jan-Peter schrieb:
> can you share your new script with the MailScanner changes with us?

of cause I can... But the script will only work with German reports [1],
 you have change it. I am no perl-guru, so changes are welcome!

You can find the script here:
<http://www.heinrich-peters.de/mailscanner/sa-stats_MailScanner.txt>


[1]:
> X-heinrich-peters.zz-MailScanner-SpamCheck: not spam,
> 	SpamAssassin (nicht zwischen gespeichert, Wertung=-5.563,
> 	benoetigt 5, autolearn=not spam, AWL -0.66, BAYES_00 -4.90,
> 	NO_RELAYS -0.00)


RE: Re: Not a reply: spamassassin stats (was Re: Tuning the bayes-system?)

Posted by "Koopmann, Jan-Peter" <ja...@koopmann.eu>.
Hi,

can you share your new script with the MailScanner changes with us?

Kind regards,
  JP


Re: Not a reply: spamassassin stats (was Re: Tuning the bayes-system?)

Posted by Heinrich Christian Peters <He...@nurfuerspam.de>.
Hello Mathias,

I am useing a variant of the sa-stats script:
<http://www.rulesemporium.com/programs/sa-stats-1.0.txt>

I had to change some things to get it work with my MailScanner-setup.

Bye,
Heiner

Mathias Homann schrieb:
> Am Dienstag 21 Oktober 2008 schrieb Heinrich Christian Peters:
> 
>>  [... my SpamAssassin Statistics ...]
> 
> What are you using to generate those stats?
> I'd like to have that on my server as well.
> 
> 
> bye,
> MH
> 
> 


Not a reply: spamassassin stats (was Re: Tuning the bayes-system?)

Posted by Mathias Homann <ad...@eregion.de>.
Am Dienstag 21 Oktober 2008 schrieb Heinrich Christian Peters:

> Email:     3870  Autolearn:  3575  Cached:   126  AvgScore:  29.35
> Spam:      3562  Autolearn:  3411  Cached:   113  AvgScore:  32.22
> Ham:        308  Autolearn:   164  Cached:    13  AvgScore:  -3.81
>
> TOP SPAM RULES FIRED    (50/432)
> ===================================================================
>===== RANK    RULE NAME          SCORE   COUNT %OFMAIL %OFSPAM 
> %OFHAM   BAYES
> -------------------------------------------------------------------
>----- 18    BAYES_99            5.40    1970   50.90   55.31    0.00
>  100.00 27    BAYES_50            0.00     801   21.16   22.49   
> 5.84   97.80 41    BAYES_80            2.00     259    6.69    7.27
>    0.00  100.00 43    BAYES_95            3.00     248    6.41   
> 6.96    0.00  100.00
>
> TOP HAM RULES FIRED     (50/78)
> ===================================================================
>===== RANK    RULE NAME          SCORE   COUNT %OFMAIL %OFSPAM 
> %OFHAM   BAYES
> -------------------------------------------------------------------
>-----
>
>    2    BAYES_00           -4.90     253    8.01    1.60   82.14  
> 81.61 12    BAYES_50            0.00      18   21.16   22.49   
> 5.84    2.20 21    BAYES_40           -0.18       8    0.93    0.79
>    2.60   22.22 32    BAYES_05           -1.11       4    0.62   
> 0.56    1.30   16.67 43    BAYES_20           -0.74       3    0.65
>    0.62    0.97   12.00
>
> conf:
> bayes_expiry_max_db_size 1500000
> bayes_auto_learn_threshold_spam 7.5

What are you using to generate those stats?
I'd like to have that on my server as well.


bye,
MH


-- 
gpg key fingerprint: 5F64 4C92 9B77 DE37 D184  C5F9 B013 44E7 27BD 
763C

Re: Tuning the bayes-system?

Posted by Heinrich Christian Peters <He...@nurfuerspam.de>.
Karsten Bräckelmann schrieb:
> On Tue, 2008-10-21 at 14:32 +0200, Heinrich Christian Peters wrote:
>> see, if the mail was classified as "BAYES_50" it is in nearly every case
>> spam, so I think, the mails are wrongly classified, they should be
>> BAYES_60 or higher...
> 
> Again, BAYES_50 is neither classified as ham nor spam. According to Byes
> there's just no indication to classify it. Thus, IMHO it is not wrongly
> classified. Think about it that way -- the absence of a given URL in
> either black and white lists does not constitute a false hit for the
> list.

Mmh, OK, I think I get it...


>>> Since you merely mentioned "German spam", the details might make a
>>> difference, though. What are you talking about exactly?
>>
>> German is my first language and nearly all (ham-)mails I get, are
>> German.  The few English (ham-)mails I get are correctly classified as
>> BAYES_10 or below.
> 
>> The (spam-)mails I am talking about are eg.:
>>  - phishing-mails (today: DABbank AG)
>>  - casino (Fiesta Club Casino, Euro Club Casino)
> 
> These are not exactly spam IMHO. They are phishing mail and trojan URL
> carrying mail respectively. ClamAV and the SaneSecurity phish sigs weed
> those out before SA even processes the mail in my setup.

MailScanner starts with the spam detection and follows upt with the
content analysis.
I think phishing and trojan URL carrying mails are spam, too, but maybe
a special type of spam.


> With a notable exception of the very recent DAB Bank phishes, which
> started today. Massively. Apparently there's no AV sig yet for those.
> However, even though Bayes didn't catch them for me either, they
> typically score around *20* here, with hits in XBL, PBL and URIBL_BLACK.
> If you really have a problem with these, I guess Bayes isn't your main
> issue. ;)

They score here very similar, 20 +-5.


>>  - pharmacy, mostly caught by ZMIde_Pharmacy
> 
> German pharmacy spam. Similar to the above for me. Hits blacklists
> galore, Bayes of 80 or higher. The bulk of these I get features rather
> static text anyway -- do you really have a problem training them in
> Bayes?
> 
> Since you are using site-wide Bayes, are you sure that your manual
> training uses the *same* Bayes DB? A common oops, and you'd effectively
> end up with auto-learning only, no manual training on low scorers.

Since I am useing "70_zmi_german.cf.zmi.sa-update.dostech.net" this
mails score very high (70+).
My MailScanner (with exim4) is running under debian as user Debian-exim.
SpamAssassin is called as this user, too. And I train bayes as
Debian-exim only.

>>  - "job offers", finance-sector
> 
> Not as easy to catch indeed.

Now my setup catch it, but "BAYES_20"....:

> X-heinrich-peters.zz-MailScanner-SpamCheck: spam,
> 	SpamAssassin (nicht zwischen gespeichert, Wertung=16.329,
> 	benoetigt 5, autolearn=spam, BAYES_20 -0.74, CTYME_IXHASH 2.50,
> 	DATE_IN_FUTURE_96_XX 1.44, DCC_CHECK 2.17, DIGEST_MULTIPLE 0.00,
> 	GENERIC_IXHASH 4.50, NIXSPAM_IXHASH 2.50, RAZOR2_CHECK 0.50,
> 	RCVD_IN_BL_SPAMCOP_NET 1.96, RCVD_IN_BRBL 1.50, SPF_HELO_PASS -0.00)

I have no problem catching spam. But I am not lucky with a BAYES below
50 in spam-mails. But indeed, this is a /cosmetic/ problem....

Heiner


Re: Tuning the bayes-system?

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2008-10-21 at 14:32 +0200, Heinrich Christian Peters wrote:
> Hello Karsten/guenther, (?)

Real name, commonly known nick name (and email address). :)


> >> I am using a system-wide spamassassin setup (MailScanner). Nearly all my
> >> spam-mails are detected correctly (~0,1% is not), no FP. But, especially
> >> German spam-mails, are "wrongly" classified by the bayes-system. Should
> > 
> > According to your stats snippets:  BAYES_50 is not "wrongly" classified,
> > but not-classified-at-all. The difference is the very meaning of a
> > Bayesian score of 0.5 -- undecided, neither really spammy nor hammy
> > tokens.
> 
> I see, but what are about the 1.6% of spam (around 57 mails) classified
> by the bayes-system as ham (BAYES_00)? And, another thing, as you can

That's mis-classified alright.

> see, if the mail was classified as "BAYES_50" it is in nearly every case
> spam, so I think, the mails are wrongly classified, they should be
> BAYES_60 or higher...

Again, BAYES_50 is neither classified as ham nor spam. According to Byes
there's just no indication to classify it. Thus, IMHO it is not wrongly
classified. Think about it that way -- the absence of a given URL in
either black and white lists does not constitute a false hit for the
list.


> > Since you merely mentioned "German spam", the details might make a
> > difference, though. What are you talking about exactly?
> 
> German is my first language and nearly all (ham-)mails I get, are
> German.  The few English (ham-)mails I get are correctly classified as
> BAYES_10 or below.

> The (spam-)mails I am talking about are eg.:
>  - phishing-mails (today: DABbank AG)
>  - casino (Fiesta Club Casino, Euro Club Casino)

These are not exactly spam IMHO. They are phishing mail and trojan URL
carrying mail respectively. ClamAV and the SaneSecurity phish sigs weed
those out before SA even processes the mail in my setup.

With a notable exception of the very recent DAB Bank phishes, which
started today. Massively. Apparently there's no AV sig yet for those.
However, even though Bayes didn't catch them for me either, they
typically score around *20* here, with hits in XBL, PBL and URIBL_BLACK.
If you really have a problem with these, I guess Bayes isn't your main
issue. ;)


>  - pharmacy, mostly caught by ZMIde_Pharmacy

German pharmacy spam. Similar to the above for me. Hits blacklists
galore, Bayes of 80 or higher. The bulk of these I get features rather
static text anyway -- do you really have a problem training them in
Bayes?

Since you are using site-wide Bayes, are you sure that your manual
training uses the *same* Bayes DB? A common oops, and you'd effectively
end up with auto-learning only, no manual training on low scorers.


>  - "job offers", finance-sector

Not as easy to catch indeed.


> > Given your timing, my guess is you're talking about the recent flood of
> > German porn spam, advertising cam sites. Even though they are using
> > pretty explicit phrases, these appear to be hard to catch.
> 
> These mails are not the problem, I didn't get them...

Consider yourself lucky. :)

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Tuning the bayes-system?

Posted by Heinrich Christian Peters <He...@nurfuerspam.de>.
Hello Karsten/guenther, (?)

Thanks for your reply.

Karsten Bräckelmann schrieb:
> On Tue, 2008-10-21 at 11:56 +0200, Heinrich Christian Peters wrote:
>> I am using a system-wide spamassassin setup (MailScanner). Nearly all my
>> spam-mails are detected correctly (~0,1% is not), no FP. But, especially
>> German spam-mails, are "wrongly" classified by the bayes-system. Should
> 
> According to your stats snippets:  BAYES_50 is not "wrongly" classified,
> but not-classified-at-all. The difference is the very meaning of a
> Bayesian score of 0.5 -- undecided, neither really spammy nor hammy
> tokens.

I see, but what are about the 1.6% of spam (around 57 mails) classified
by the bayes-system as ham (BAYES_00)? And, another thing, as you can
see, if the mail was classified as "BAYES_50" it is in nearly every case
spam, so I think, the mails are wrongly classified, they should be
BAYES_60 or higher...


>> I train thees mails manually as spam, if they are not autolearned? Or
>> should I train *all* my spam-mails regular?
> 
> Personally, I prefer to not learn *all* spam, but to omit the lions
> share of really high scoring stuff. The reason is an attempt to keep the
> number of tokens somewhat sane, and to not bias my Bayes. If everything
> Bayes gets to see is spam, everything will appear spammy.
> 
> If you get a certain class of sneaky spam, you definitely should feed
> that to Bayes. However, if it also loosely resembles your ham [1], you
> better make sure to train them as well.

Up till now, I train only the wrongly marked mails manually,
autolearning is working, same for ham. But, as I said before, I have no FPs.


> Since you merely mentioned "German spam", the details might make a
> difference, though. What are you talking about exactly?

German is my first language and nearly all (ham-)mails I get, are
German.  The few English (ham-)mails I get are correctly classified as
BAYES_10 or below.
The (spam-)mails I am talking about are eg.:
 - phishing-mails (today: DABbank AG)
 - casino (Fiesta Club Casino, Euro Club Casino)
 - pharmacy, mostly caught by ZMIde_Pharmacy
 - "job offers", finance-sector


> Given your timing, my guess is you're talking about the recent flood of
> German porn spam, advertising cam sites. Even though they are using
> pretty explicit phrases, these appear to be hard to catch.

These mails are not the problem, I didn't get them...


> If that's the kind of spam you're talking about, check the archives.
> This has been brought up very recently. Not much solutions though, IIRC.
> They are hard to catch, and a few people are working on rules as we
> speak.  HTH
> 
>   guenther
> 
> 
> [1] In this context this means, German spam tends to be sneaky, and
>     German is your users first language.

Thanks, Yours,
Heiner


Re: Tuning the bayes-system?

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2008-10-21 at 11:56 +0200, Heinrich Christian Peters wrote:
> Hello,
> 
> I am using a system-wide spamassassin setup (MailScanner). Nearly all my
> spam-mails are detected correctly (~0,1% is not), no FP. But, especially
> German spam-mails, are "wrongly" classified by the bayes-system. Should

According to your stats snippets:  BAYES_50 is not "wrongly" classified,
but not-classified-at-all. The difference is the very meaning of a
Bayesian score of 0.5 -- undecided, neither really spammy nor hammy
tokens.

> I train thees mails manually as spam, if they are not autolearned? Or
> should I train *all* my spam-mails regular?

Personally, I prefer to not learn *all* spam, but to omit the lions
share of really high scoring stuff. The reason is an attempt to keep the
number of tokens somewhat sane, and to not bias my Bayes. If everything
Bayes gets to see is spam, everything will appear spammy.

If you get a certain class of sneaky spam, you definitely should feed
that to Bayes. However, if it also loosely resembles your ham [1], you
better make sure to train them as well.


Since you merely mentioned "German spam", the details might make a
difference, though. What are you talking about exactly?

Given your timing, my guess is you're talking about the recent flood of
German porn spam, advertising cam sites. Even though they are using
pretty explicit phrases, these appear to be hard to catch.

If that's the kind of spam you're talking about, check the archives.
This has been brought up very recently. Not much solutions though, IIRC.
They are hard to catch, and a few people are working on rules as we
speak.  HTH

  guenther


[1] In this context this means, German spam tends to be sneaky, and
    German is your users first language.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Tuning the bayes-system?

Posted by Heinrich Christian Peters <He...@nurfuerspam.de>.
Jeff Mincy schrieb:
>    From:  Heinrich Christian Peters <He...@nurfuerspam.de>
>    Date:  Wed, 22 Oct 2008 14:43:55 +0200
>    
>    It learns all detected spam mails with a score of 8 (or higher) and not
>    hitted by BAYES_70 (or higher).
>    
> I would learn all spam messages including the ones that hit BAYES_70-BAYES_99
> There are other tokens in the spam message that will be learned
> and the known tokens in the message will reinforced as spam.

OK, I will see, how many mails would be learned this way - perhaps I
will change it.

Heiner


Re: Tuning the bayes-system?

Posted by Jeff Mincy <je...@delphioutpost.com>.
   From:  Heinrich Christian Peters <He...@nurfuerspam.de>
   Date:  Wed, 22 Oct 2008 14:43:55 +0200
   
   Kai Schaetzl schrieb:
   > Just checked what I actually do, here it is:
   > 
   > yesterday=`date -d "-1 day" +"%Y%m%d"`
   > sa-learn --spam --progress /var/spool/MailScanner/quarantine/
   > ${yesterday}/spam/
   
   I implemented a solution with sieve and sa-learn-cyrus:
   > if allof (
   > 	header :contains "X-heinrich-peters.zz-MailScanner-SpamScore" "ssssssss",
   > #	not header :contains "X-heinrich-peters.zz-MailScanner-SpamCheck" "autolearn=spam",
   > 	not header :regex "X-heinrich-peters.zz-MailScanner-SpamCheck" "BAYES_(9|8|7)(0|5|9)"
   > ){
   > 	addflag "\\Seen";
   > 	fileinto "INBOX.SpamAssassin.spam";
   > }
   
   It learns all detected spam mails with a score of 8 (or higher) and not
   hitted by BAYES_70 (or higher).
   
I would learn all spam messages including the ones that hit BAYES_70-BAYES_99
There are other tokens in the spam message that will be learned
and the known tokens in the message will reinforced as spam.

   But now I am unsure about the autolearning. Should I train autolearned
   messages or not? Or, in other words, can spamassassin learn the same
   message twice (to learn faster), if I tell him to do so?

The autolearned messages have already been learned, you do not need to
learn the message again.    Nothing bad will happen if you do learn a
message again, other than wasting CPU time.

-jeff

Re: Tuning the bayes-system?

Posted by Heinrich Christian Peters <He...@nurfuerspam.de>.
Kai Schaetzl schrieb:
> Heinrich Christian Peters wrote on  Wed, 22 Oct 2008 14:43:55 +0200:
> 
>> It learns all detected spam mails with a score of 8 (or higher) and not
>> hitted by BAYES_70 (or higher).
> 
> Which means you miss all the spam that got hit by Bayes and scored that 
> high and was not autolearned. I think with a good trained Bayes the above 
> will likely not learn much - e.g. you could just skip it ;-) You also do 
> know that you can adjust the autolearning threshold?
> 
>> But now I am unsure about the autolearning. Should I train autolearned
>> messages or not? Or, in other words, can spamassassin learn the same
>> message twice (to learn faster), if I tell him to do so?
> 
> As already said: it will just ignore these for learning.

I just want to be sure...


> BTW: Some MUA software will have problems with dots in headers. You either 
> should upgrade to a newer MailScanner or change %org-name% in your 
> MailScanner.conf.

Thanks for info. I will change %org-name% - so I can keep the original
debian-package.

Heiner


Re: Tuning the bayes-system?

Posted by Kai Schaetzl <ma...@conactive.com>.
Heinrich Christian Peters wrote on  Wed, 22 Oct 2008 14:43:55 +0200:

> It learns all detected spam mails with a score of 8 (or higher) and not
> hitted by BAYES_70 (or higher).

Which means you miss all the spam that got hit by Bayes and scored that 
high and was not autolearned. I think with a good trained Bayes the above 
will likely not learn much - e.g. you could just skip it ;-) You also do 
know that you can adjust the autolearning threshold?

> 
> But now I am unsure about the autolearning. Should I train autolearned
> messages or not? Or, in other words, can spamassassin learn the same
> message twice (to learn faster), if I tell him to do so?

As already said: it will just ignore these for learning.

BTW: Some MUA software will have problems with dots in headers. You either 
should upgrade to a newer MailScanner or change %org-name% in your 
MailScanner.conf.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




Re: Tuning the bayes-system?

Posted by Heinrich Christian Peters <He...@nurfuerspam.de>.
Kai Schaetzl schrieb:
> Just checked what I actually do, here it is:
> 
> yesterday=`date -d "-1 day" +"%Y%m%d"`
> sa-learn --spam --progress /var/spool/MailScanner/quarantine/
> ${yesterday}/spam/

I implemented a solution with sieve and sa-learn-cyrus:
> if allof (
> 	header :contains "X-heinrich-peters.zz-MailScanner-SpamScore" "ssssssss",
> #	not header :contains "X-heinrich-peters.zz-MailScanner-SpamCheck" "autolearn=spam",
> 	not header :regex "X-heinrich-peters.zz-MailScanner-SpamCheck" "BAYES_(9|8|7)(0|5|9)"
> ){
> 	addflag "\\Seen";
> 	fileinto "INBOX.SpamAssassin.spam";
> }

It learns all detected spam mails with a score of 8 (or higher) and not
hitted by BAYES_70 (or higher).

But now I am unsure about the autolearning. Should I train autolearned
messages or not? Or, in other words, can spamassassin learn the same
message twice (to learn faster), if I tell him to do so?

Thanks,
Heiner


Re: Tuning the bayes-system?

Posted by Kai Schaetzl <ma...@conactive.com>.
Heinrich Christian Peters wrote on  Wed, 22 Oct 2008 14:30:13 +0200:

> Why? Won't Bayes get better, if I tell spamassassin more clearly, which
> mails are spam *and* which are not?

In general, yes. But we were talking about a specific context (MailScanner 
quarantine and automated Bayes learning beyond that what SA already does) 
and in this context it's undesirable (prone to FPs) and unlikely to even 
exist (do you archive all ham? In some countries that may even be 
illegal.)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




Re: Tuning the bayes-system?

Posted by Heinrich Christian Peters <He...@nurfuerspam.de>.
Kai Schaetzl schrieb:
> Benny Pedersen wrote on Wed, 22 Oct 2008 01:04:42 +0200 (CEST):
> 
>> will olso be good to learn ham in that script
> 
> no, it won't

Why? Won't Bayes get better, if I tell spamassassin more clearly, which
mails are spam *and* which are not?

Heiner


Re: Tuning the bayes-system?

Posted by Kai Schaetzl <ma...@conactive.com>.
Benny Pedersen wrote on Wed, 22 Oct 2008 01:04:42 +0200 (CEST):

> will olso be good to learn ham in that script

no, it won't

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




Re: Tuning the bayes-system?

Posted by Benny Pedersen <me...@junc.org>.
On Tue, October 21, 2008 17:31, Kai Schaetzl wrote:

> works very well. It learns all that (detected) spam that didn't get
> autolearned.

will olso be good to learn ham in that script


-- 
Benny Pedersen
Need more webspace ? http://www.servage.net/?coupon=cust37098


Re: Tuning the bayes-system?

Posted by Kai Schaetzl <ma...@conactive.com>.
Just checked what I actually do, here it is:

yesterday=`date -d "-1 day" +"%Y%m%d"`
sa-learn --spam --progress /var/spool/MailScanner/quarantine/
${yesterday}/spam/

works very well. It learns all that (detected) spam that didn't get 
autolearned.


Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




Re: Tuning the bayes-system?

Posted by Heinrich Christian Peters <He...@nurfuerspam.de>.
Hello Kai,

thanks for  your reply.

Kai Schaetzl schrieb:
> Heinrich Christian Peters wrote on  Tue, 21 Oct 2008 11:56:18 +0200:
> 
>> Should I train thees mails manually as spam, if they are not
>> autolearned? Or should I train *all* my spam-mails regular?
> 
> You can do both. A mail already trained won't be learned again, but it 
> won't do any harm doing that (for instance if it is easier to train all 
> spam instead of picking the spam that was not autolearned), it just makes 
> the whole training process last a bit longer.

OK, that is good. Perhaps I will train all spam-mails with a score
higher 6 (1 more for being a spam-mail) and BAYES_00 - BAYES_50
automatically. I think training *all* mails isn't the /right/ way...

> If I remember right I train all spam from the MailScanner spam quarantine 
> with a daily cron script.
> 
> Kai
> 

Heiner


Re: Tuning the bayes-system?

Posted by Kai Schaetzl <ma...@conactive.com>.
Heinrich Christian Peters wrote on  Tue, 21 Oct 2008 11:56:18 +0200:

> But, especially
> German spam-mails, are "wrongly" classified by the bayes-system.

That's normal. As Germans we got a lot of German ham, but only few German 
spam. So, Bayes isn't trained very well with German spam.

Should
> I train thees mails manually as spam, if they are not autolearned? Or
> should I train *all* my spam-mails regular?

You can do both. A mail already trained won't be learned again, but it 
won't do any harm doing that (for instance if it is easier to train all 
spam instead of picking the spam that was not autolearned), it just makes 
the whole training process last a bit longer.
If I remember right I train all spam from the MailScanner spam quarantine 
with a daily cron script.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com