You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Paul Boven <p....@chello.nl> on 2005/03/17 11:16:27 UTC

Testing Bayes (auto)-learning

Hi everyone,

There seem to be some learning-problems with our Bayes database which 
I'm trying to track down.

Given a particular spam-message that got auto-trained as ham, then 
re-trained as spam, I would like to be able to do the following:

1.) Make sure whether it's in the Bayes database or not, and whether it 
is there as ham or as spam. I can use Berkeley's tools to dump the 
bayes_seen database, but often the message-ID isn't in there even though 
the message got learned; probably with a '@sa-generated' message-ID.

Given the original message, how can I determine which Message-ID Bayes 
is using to keep track o fthe message? When will it accept the original 
Message-ID, and when will it use the generated one? How can I determine 
the sa-generated Message-ID without running it trough the learner again?

How sensitive is the generated Message-ID to changes in Received: and 
other headers that happen when the mail gets returned to the learner?

2.) With the new SpamAssassin 3.0.2, I can no longer see what score a 
particular token has, because they are hashed. Is there an easy way to 
generate these hashes or is there an interface that I can use to check 
the score for a token?

My problem is that I have end-users that are basically claiming 'the 
more I send to the relearn-address, the lower the Bayes score seems to 
be getting.' The included headers seem to support that claim, so I 
really want to dig a bit deeper into the whole setup.

Regards, Paul Boven.




Re: Testing Bayes (auto)-learning

Posted by Matt Kettler <mk...@evi-inc.com>.
Greg Abbas wrote:

>Paul Boven <p.boven <at> chello.nl> writes:
>  
>
>>Yes, they're forwarding the messages as attachements, and yes, I'm 
>>stripping them out of the message/rfc822 attachements before feeding 
>>them to Bayes. And in all the tests I've done so far this seems to work, 
>>but now that we've upgraded to SA3.0.2 I can't peek 'under the hood' 
>>anymore to see if things are still being learned as they should.
>>    
>>
>
>On a related note, if I grab messages from a maildir after
>spamassassin has "quarantined" them ("The original message has
>been attached to this so you can view it... yadda yadda") is
>sa-learn smart enough to realize that the spam is contained in
>the attachment? 
>  
>

sa-learn is smart enough to undo any changes made by spamassassin
itself, so if you use SA to do your tagging, sa-learn will undo it prior
to learning.

However, if you use a tool like amavis, mimedefang, or mailscanner and
use that tool's own encapsulation methods instead of SA's, then sa-learn
won't undo it.


Re: Testing Bayes (auto)-learning

Posted by Greg Abbas <sp...@abbas.org>.
Paul Boven <p.boven <at> chello.nl> writes:
> Yes, they're forwarding the messages as attachements, and yes, I'm 
> stripping them out of the message/rfc822 attachements before feeding 
> them to Bayes. And in all the tests I've done so far this seems to work, 
> but now that we've upgraded to SA3.0.2 I can't peek 'under the hood' 
> anymore to see if things are still being learned as they should.

On a related note, if I grab messages from a maildir after
spamassassin has "quarantined" them ("The original message has
been attached to this so you can view it... yadda yadda") is
sa-learn smart enough to realize that the spam is contained in
the attachment? Or is this the same situation as a user-forward,
where I would need to write something to strip it out?

And as an aside, I'm curious about "peeking under the hood" too,
but in my case it's because I'm curious how many messages have
been trained. (In order to find out how soon the filter is going
to think the corpus is large enough to start using its bayes
rules.)

TIA. -g.



Re: Testing Bayes (auto)-learning

Posted by Paul Boven <p....@chello.nl>.
Hi Daryl, everyone,

Daryl C. W. O'Shea wrote:
> Paul Boven wrote:

>> My problem is that I have end-users that are basically claiming 'the 
>> more I send to the relearn-address, the lower the Bayes score seems to 
>> be getting.' The included headers seem to support that claim, so I 
>> really want to dig a bit deeper into the whole setup.

> That there sounds like your problem.  How are your users sending mail to 
> the 'relearn address'?  If they're not forwarding messages as an 
> attachment, and you're not striping out these attached messages then it 
> isn't going to work to your benefit, and you'll see the result you 
> describe.

Yes, they're forwarding the messages as attachements, and yes, I'm 
stripping them out of the message/rfc822 attachements before feeding 
them to Bayes. And in all the tests I've done so far this seems to work, 
but now that we've upgraded to SA3.0.2 I can't peek 'under the hood' 
anymore to see if things are still being learned as they should.

Regards, Paul Boven.

Re: Testing Bayes (auto)-learning

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
Paul Boven wrote:
> My problem is that I have end-users that are basically claiming 'the 
> more I send to the relearn-address, the lower the Bayes score seems to 
> be getting.' The included headers seem to support that claim, so I 
> really want to dig a bit deeper into the whole setup.

That there sounds like your problem.  How are your users sending mail to 
the 'relearn address'?  If they're not forwarding messages as an 
attachment, and you're not striping out these attached messages then it 
isn't going to work to your benefit, and you'll see the result you describe.

Daryl