You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Beast <be...@ldap.or.id> on 2006/08/14 07:21:16 UTC

bayes not run on some mail

Hi,

 From some (spam) mail which not caught by SA, it seems that bayes is 
not applied to this mail.

X-Spam-Report:
     * 0.0 HTML_MESSAGE BODY: HTML included in message
     * 1.7 SARE_SPEC_ROLEX Rolex watch spam
X-Spam-Status: No, score=1.7 required=5.2 tests=HTML_MESSAGE,SARE_SPEC_ROLEX
     autolearn=no version=3.1.4

Is bayes check is not run for every mail?


--beast


Re: bayes not run on some mail

Posted by Magnus Holmgren <ho...@lysator.liu.se>.
On Monday 14 August 2006 11:02, Nigel Frankcom took the opportunity to say:
> On Mon, 14 Aug 2006 01:52:33 -0700, "jdow" <jd...@earthlink.net> wrote:
> >(I manually train here. I distrust automatic training.)
> >
> >{^_^}
>
> I agree with not autotraining, imo it's a damned good way to get your
> bayes poisoned. With beast's error I got the impression only _some_
> mails were being missed which would imply either a file lock issue or
> not enough child processes?

Autotraining should be completely safe *if* you are able to relearn all 
miscategorised mail.

-- 
Magnus Holmgren        holmgren@lysator.liu.se
                       (No Cc of list mail needed, thanks)

Re: bayes not run on some mail

Posted by jdow <jd...@earthlink.net>.
From: "Beast" <be...@ldap.or.id>

> Nigel Frankcom wrote:
>>   
>>>> I will turn on auto leaarn mostly because I need to feed more HAM to SA 
>>>> (so far I only feed ham for any false positive which is very low daily 
>>>> and i think that is not good enough for SA)
>>>>       
>>> If it is well trained then Bayes should be hitting. It may be that
>>> SA cannot get to the Bayes database due to privileges.
>>>
>>> (I manually train here. I distrust automatic training.)
>>>
>>> {^_^}
>>>     
>>
>> I agree with not autotraining, imo it's a damned good way to get your
>> bayes poisoned. With beast's error I got the impression only _some_
>> mails were being missed which would imply either a file lock issue or
>> not enough child processes?
>>   
> I also agree with your point, however I need to feed more HAM (not spam) 
> message, which is not easy to obtain, unless we dump all users mail to 
> one mailbox.
> 
> For bayes file locking problem, I'm not quite sure because not complaint 
> in log:
> 
> Aug 13 22:11:01 blowfish spampd[9828]: clean message 
> <d9...@auracom.net> (1.67/5.20) from 
> <wo...@auracom.net> for <po...@example.com> in 0.33s, 2587 bytes.
> 
> Yesterday, i was received 5 FN mails which  are not have scanned by 
> bayes (low score), this for postmaster only, i'm not sure if its 
> applicable to other address also.

As postmaster you can probably setup a spamtrap account with a very
easily guessed name. Perhaps pick one out of the list of <stuff>
that gets rejected as user unknown. Then watch it for a few days.

{^_^}

Re: Re: bayes not run on some mail

Posted by Nigel Frankcom <ni...@blue-canoe.net>.
On Mon, 14 Aug 2006 16:28:21 +0700, Beast <be...@ldap.or.id> wrote:

>Nigel Frankcom wrote:
>>   
>>>> I will turn on auto leaarn mostly because I need to feed more HAM to SA 
>>>> (so far I only feed ham for any false positive which is very low daily 
>>>> and i think that is not good enough for SA)
>>>>       
>>> If it is well trained then Bayes should be hitting. It may be that
>>> SA cannot get to the Bayes database due to privileges.
>>>
>>> (I manually train here. I distrust automatic training.)
>>>
>>> {^_^}
>>>     
>>
>> I agree with not autotraining, imo it's a damned good way to get your
>> bayes poisoned. With beast's error I got the impression only _some_
>> mails were being missed which would imply either a file lock issue or
>> not enough child processes?
>>   
>I also agree with your point, however I need to feed more HAM (not spam) 
>message, which is not easy to obtain, unless we dump all users mail to 
>one mailbox.
>
>For bayes file locking problem, I'm not quite sure because not complaint 
>in log:
>
>Aug 13 22:11:01 blowfish spampd[9828]: clean message 
><d9...@auracom.net> (1.67/5.20) from 
><wo...@auracom.net> for <po...@example.com> in 0.33s, 2587 bytes.
>
>Yesterday, i was received 5 FN mails which  are not have scanned by 
>bayes (low score), this for postmaster only, i'm not sure if its 
>applicable to other address also.
>
>--beast

A lot will depend  on the circumstances your email servers run under
and the terms & privacy options your site uses. 

Here it's not such an issue fortunately. I have an application that
pulls mails out of the archive for our mailservers; then it's a case
of finding either ham or specific spam to train in. 

You might try training in your own mailbox for ham; though with a
large userbase ideally you want to train in a representative corpus of
mail to all your users.

Either way, it's going to involve some work (though significantly less
work than clearing up after the spammers).

I've found here that after the initial training run, just adding in
reported FPs & FN's is sufficient to keep bayes accurate. This doesn't
usually involve more than a few mails a month.

Nigel

Re: bayes not run on some mail

Posted by Beast <be...@ldap.or.id>.
Nigel Frankcom wrote:
>   
>>> I will turn on auto leaarn mostly because I need to feed more HAM to SA 
>>> (so far I only feed ham for any false positive which is very low daily 
>>> and i think that is not good enough for SA)
>>>       
>> If it is well trained then Bayes should be hitting. It may be that
>> SA cannot get to the Bayes database due to privileges.
>>
>> (I manually train here. I distrust automatic training.)
>>
>> {^_^}
>>     
>
> I agree with not autotraining, imo it's a damned good way to get your
> bayes poisoned. With beast's error I got the impression only _some_
> mails were being missed which would imply either a file lock issue or
> not enough child processes?
>   
I also agree with your point, however I need to feed more HAM (not spam) 
message, which is not easy to obtain, unless we dump all users mail to 
one mailbox.

For bayes file locking problem, I'm not quite sure because not complaint 
in log:

Aug 13 22:11:01 blowfish spampd[9828]: clean message 
<d9...@auracom.net> (1.67/5.20) from 
<wo...@auracom.net> for <po...@example.com> in 0.33s, 2587 bytes.

Yesterday, i was received 5 FN mails which  are not have scanned by 
bayes (low score), this for postmaster only, i'm not sure if its 
applicable to other address also.

--beast


Re: Re: bayes not run on some mail

Posted by Nigel Frankcom <ni...@blue-canoe.net>.
On Mon, 14 Aug 2006 01:52:33 -0700, "jdow" <jd...@earthlink.net> wrote:

>From: "Beast" <be...@ldap.or.id>
>
>> jdow wrote:
>>> From: "Beast" <be...@ldap.or.id>
>>>
>>>> Hi,
>>>>
>>>> From some (spam) mail which not caught by SA, it seems that bayes is 
>>>> not applied to this mail.
>>>>
>>>> X-Spam-Report:
>>>>     * 0.0 HTML_MESSAGE BODY: HTML included in message
>>>>     * 1.7 SARE_SPEC_ROLEX Rolex watch spam
>>>> X-Spam-Status: No, score=1.7 required=5.2 
>>>> tests=HTML_MESSAGE,SARE_SPEC_ROLEX
>>>>     autolearn=no version=3.1.4
>>>>
>>>> Is bayes check is not run for every mail?
>>>
>>> It is not run if you have not yet learned from at least 200 each of
>>> spam and ham messages. You do not learn form all messages because the
>>> scores are "indicative" rather than "certain" with regards to estimating
>>> ham or spam properties. If you collect a random bunch of 200 or more
>>> ham messages and 200 or more known spam messages and manually train
>>> with them via sa-learn you can get Bayes working sooner.
>> 
>> It actually has enough corpus learned. I was running this for more than 
>> a year with manual tarined (daily tarined by human). Bayes was working 
>> for most mail but not for all mails.
>> 
>> [root@blowfish ~]# spamassassin --lint -D 2>&1 |  grep 'corpus size'
>> [12081] dbg: bayes: corpus size: nspam = 34035, nham = 7399
>> 
>> I will turn on auto leaarn mostly because I need to feed more HAM to SA 
>> (so far I only feed ham for any false positive which is very low daily 
>> and i think that is not good enough for SA)
>
>If it is well trained then Bayes should be hitting. It may be that
>SA cannot get to the Bayes database due to privileges.
>
>(I manually train here. I distrust automatic training.)
>
>{^_^}

I agree with not autotraining, imo it's a damned good way to get your
bayes poisoned. With beast's error I got the impression only _some_
mails were being missed which would imply either a file lock issue or
not enough child processes?

Nigel

Re: bayes not run on some mail

Posted by jdow <jd...@earthlink.net>.
From: "Beast" <be...@ldap.or.id>

> jdow wrote:
>> From: "Beast" <be...@ldap.or.id>
>>
>>> Hi,
>>>
>>> From some (spam) mail which not caught by SA, it seems that bayes is 
>>> not applied to this mail.
>>>
>>> X-Spam-Report:
>>>     * 0.0 HTML_MESSAGE BODY: HTML included in message
>>>     * 1.7 SARE_SPEC_ROLEX Rolex watch spam
>>> X-Spam-Status: No, score=1.7 required=5.2 
>>> tests=HTML_MESSAGE,SARE_SPEC_ROLEX
>>>     autolearn=no version=3.1.4
>>>
>>> Is bayes check is not run for every mail?
>>
>> It is not run if you have not yet learned from at least 200 each of
>> spam and ham messages. You do not learn form all messages because the
>> scores are "indicative" rather than "certain" with regards to estimating
>> ham or spam properties. If you collect a random bunch of 200 or more
>> ham messages and 200 or more known spam messages and manually train
>> with them via sa-learn you can get Bayes working sooner.
> 
> It actually has enough corpus learned. I was running this for more than 
> a year with manual tarined (daily tarined by human). Bayes was working 
> for most mail but not for all mails.
> 
> [root@blowfish ~]# spamassassin --lint -D 2>&1 |  grep 'corpus size'
> [12081] dbg: bayes: corpus size: nspam = 34035, nham = 7399
> 
> I will turn on auto leaarn mostly because I need to feed more HAM to SA 
> (so far I only feed ham for any false positive which is very low daily 
> and i think that is not good enough for SA)

If it is well trained then Bayes should be hitting. It may be that
SA cannot get to the Bayes database due to privileges.

(I manually train here. I distrust automatic training.)

{^_^}

Re: bayes not run on some mail

Posted by Beast <be...@ldap.or.id>.
jdow wrote:
> From: "Beast" <be...@ldap.or.id>
>
>> Hi,
>>
>> From some (spam) mail which not caught by SA, it seems that bayes is 
>> not applied to this mail.
>>
>> X-Spam-Report:
>>     * 0.0 HTML_MESSAGE BODY: HTML included in message
>>     * 1.7 SARE_SPEC_ROLEX Rolex watch spam
>> X-Spam-Status: No, score=1.7 required=5.2 
>> tests=HTML_MESSAGE,SARE_SPEC_ROLEX
>>     autolearn=no version=3.1.4
>>
>> Is bayes check is not run for every mail?
>
> It is not run if you have not yet learned from at least 200 each of
> spam and ham messages. You do not learn form all messages because the
> scores are "indicative" rather than "certain" with regards to estimating
> ham or spam properties. If you collect a random bunch of 200 or more
> ham messages and 200 or more known spam messages and manually train
> with them via sa-learn you can get Bayes working sooner.

It actually has enough corpus learned. I was running this for more than 
a year with manual tarined (daily tarined by human). Bayes was working 
for most mail but not for all mails.

[root@blowfish ~]# spamassassin --lint -D 2>&1 |  grep 'corpus size'
[12081] dbg: bayes: corpus size: nspam = 34035, nham = 7399

I will turn on auto leaarn mostly because I need to feed more HAM to SA 
(so far I only feed ham for any false positive which is very low daily 
and i think that is not good enough for SA)


--beast


Re: bayes not run on some mail

Posted by jdow <jd...@earthlink.net>.
From: "Beast" <be...@ldap.or.id>

> Hi,
> 
> From some (spam) mail which not caught by SA, it seems that bayes is 
> not applied to this mail.
> 
> X-Spam-Report:
>     * 0.0 HTML_MESSAGE BODY: HTML included in message
>     * 1.7 SARE_SPEC_ROLEX Rolex watch spam
> X-Spam-Status: No, score=1.7 required=5.2 tests=HTML_MESSAGE,SARE_SPEC_ROLEX
>     autolearn=no version=3.1.4
> 
> Is bayes check is not run for every mail?

It is not run if you have not yet learned from at least 200 each of
spam and ham messages. You do not learn form all messages because the
scores are "indicative" rather than "certain" with regards to estimating
ham or spam properties. If you collect a random bunch of 200 or more
ham messages and 200 or more known spam messages and manually train
with them via sa-learn you can get Bayes working sooner.

{^_^}