You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by andrij <an...@gmail.com> on 2010/07/26 11:58:29 UTC

Bayes classifier

Hi all,

I am new to spamassassin and bayes classifier. I have several questions and
I will greatly appreciate your help with that.

1) Training of the bayes classifier with _multipart_ e-mails (e.g., an
e-mail contains other e-mails within its body). If I set
"bayes_ignore_header Some-header", will bayes classifier ignore (while
learning) the header "Some-header" in the nested messages as well?

2) Evaluating whether an email is spam or not. Again, if I set
"bayes_ignore_header Some-header", will the bayes classifier ignore the
header while evaluating an e-mail?

3) Evaluating whether an email is spam or not. Does the bayes classifier
analyze headers if I have, for example, the following rule: "body BAYES_05
eval:check_bayes('0.00', '0.05')". According to the
http://wiki.apache.org/spamassassin/WritingRules : "Body rules also include
the Subject as the first line of the body content". So, any headers that
precede subject header are not considered by the bayes classifier?

Thanks for the help.
-- 
View this message in context: http://old.nabble.com/Bayes-classifier-tp29264841p29264841.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Bayes classifier

Posted by John Hardin <jh...@impsec.org>.
On Mon, 26 Jul 2010, Matus UHLAR - fantomas wrote:

>> On Mon, 26 Jul 2010, Bowie Bailey wrote:
>>
>>>> 3) Evaluating whether an email is spam or not. Does the bayes
>>>>    classifier analyze headers if I have, for example, the following
>>>>    rule: "body BAYES_05 eval:check_bayes('0.00', '0.05')". According to
>>>>    the http://wiki.apache.org/spamassassin/WritingRules : "Body rules
>>>>    also include the Subject as the first line of the body content". So,
>>>>    any headers that precede subject header are not considered by the
>>>>    bayes classifier?
>>>
>>> I don't have an answer for you here, but just another question.  Why do
>>> you want to mess with the bayes rules?  They work very well as-is as
>>> long as you make sure the database is being fed properly (learning spam
>>> as spam and ham as ham with a decent mix of both being learned on a
>>> regular basis).
>
> On 26.07.10 08:13, John Hardin wrote:
>> A better answer here would be "the order of the headers doesn't matter."
>
> at least until we won't have a rule that will score by header order :)
> (a bayes score probably)

The context of the question (as far as I can determine - it's a pretty 
rambling question) was "within the Bayes classifier", not "within general 
rules". There _are_ some rules where header order is significant and 
explicitly checked for.

So, let me amend my response:

A better answer here would be "the order of the headers doesn't matter to 
the bayes classifier".

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Insofar as the police deter by their presence, they are very, very
   good. Criminals take great pains not to commit a crime in front of
   them.                                             -- Jeffrey Snyder
-----------------------------------------------------------------------
  10 days until the 275th anniversary of John Peter Zenger's acquittal

Re: Bayes classifier

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> On Mon, 26 Jul 2010, Bowie Bailey wrote:
>
>>> 3) Evaluating whether an email is spam or not. Does the bayes
>>>    classifier analyze headers if I have, for example, the following
>>>    rule: "body BAYES_05 eval:check_bayes('0.00', '0.05')". According to
>>>    the http://wiki.apache.org/spamassassin/WritingRules : "Body rules
>>>    also include the Subject as the first line of the body content". So,
>>>    any headers that precede subject header are not considered by the
>>>    bayes classifier?
>>
>> I don't have an answer for you here, but just another question.  Why do 
>> you want to mess with the bayes rules?  They work very well as-is as  
>> long as you make sure the database is being fed properly (learning spam 
>> as spam and ham as ham with a decent mix of both being learned on a  
>> regular basis).

On 26.07.10 08:13, John Hardin wrote:
> A better answer here would be "the order of the headers doesn't matter."

at least until we won't have a rule that will score by header order :)
(a bayes score probably)
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Microsoft dick is soft to do no harm

Re: Bayes classifier

Posted by John Hardin <jh...@impsec.org>.
On Mon, 26 Jul 2010, Bowie Bailey wrote:

>> 3) Evaluating whether an email is spam or not. Does the bayes
>>    classifier analyze headers if I have, for example, the following
>>    rule: "body BAYES_05 eval:check_bayes('0.00', '0.05')". According to
>>    the http://wiki.apache.org/spamassassin/WritingRules : "Body rules
>>    also include the Subject as the first line of the body content". So,
>>    any headers that precede subject header are not considered by the
>>    bayes classifier?
>
> I don't have an answer for you here, but just another question.  Why do 
> you want to mess with the bayes rules?  They work very well as-is as 
> long as you make sure the database is being fed properly (learning spam 
> as spam and ham as ham with a decent mix of both being learned on a 
> regular basis).

A better answer here would be "the order of the headers doesn't matter."

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   I'm seriously considering getting one of those bright-orange prison
   overalls and stencilling PASSENGER on the back. Along with the paper
   slippers, I ought to be able to walk right through security.
                                              -- Brian Kantor in a.s.r
-----------------------------------------------------------------------
  10 days until the 275th anniversary of John Peter Zenger's acquittal

Re: Bayes classifier

Posted by Bowie Bailey <Bo...@BUC.com>.
 On 7/26/2010 2:46 PM, andrij wrote:
> Bowie Bailey wrote:
>>>>> 3) Evaluating whether an email is spam or not. Does the bayes
>>>>> classifier
>>>>> analyze headers if I have, for example, the following rule: "body
>>>>> BAYES_05
>>>>> eval:check_bayes('0.00', '0.05')". According to the
>>>>> http://wiki.apache.org/spamassassin/WritingRules : "Body rules also
>>>>> include
>>>>> the Subject as the first line of the body content". So, any headers
>>>>> that
>>>>> precede subject header are not considered by the bayes classifier?
>>>> I don't have an answer for you here, but just another question.  Why do
>>>> you want to mess with the bayes rules?
>>> Maybe I am mistaken, but what is the sense to train the bayes classifier
>>> on
>>> headers if headers (at least those that precede a subject header) are not
>>> considered during the spam detection phase?
>> Bayes learns based on the entire message -- headers and all. 
>> (Otherwise, what would be the point of the bayes_ignore_header option?)
>>
>> I can see where you might get that impression by looking at the rule,
>> but if I understand it correctly, Bayes has already been run and the
>> rule is just checking the result.
> Thank you for the clarifying. The word "body" at the begining of the rule
> confused me. So, in general it does not matter what word ("body" or
> "header") is put there -- the Bayes clasifier analyzes both headers (except
> those introduced by bayes_ignore_header) and body during both learning and
> scoring phases. Right?

Right.

-- 
Bowie

Re: Bayes classifier

Posted by andrij <an...@gmail.com>.


Bowie Bailey wrote:
> 
>>>> 3) Evaluating whether an email is spam or not. Does the bayes
>>>> classifier
>>>> analyze headers if I have, for example, the following rule: "body
>>>> BAYES_05
>>>> eval:check_bayes('0.00', '0.05')". According to the
>>>> http://wiki.apache.org/spamassassin/WritingRules : "Body rules also
>>>> include
>>>> the Subject as the first line of the body content". So, any headers
>>>> that
>>>> precede subject header are not considered by the bayes classifier?
>>> I don't have an answer for you here, but just another question.  Why do
>>> you want to mess with the bayes rules?
>> Maybe I am mistaken, but what is the sense to train the bayes classifier
>> on
>> headers if headers (at least those that precede a subject header) are not
>> considered during the spam detection phase?
> 
> Bayes learns based on the entire message -- headers and all. 
> (Otherwise, what would be the point of the bayes_ignore_header option?)
> 
> I can see where you might get that impression by looking at the rule,
> but if I understand it correctly, Bayes has already been run and the
> rule is just checking the result.
> 

Thank you for the clarifying. The word "body" at the begining of the rule
confused me. So, in general it does not matter what word ("body" or
"header") is put there -- the Bayes clasifier analyzes both headers (except
those introduced by bayes_ignore_header) and body during both learning and
scoring phases. Right?

-- 
View this message in context: http://old.nabble.com/Bayes-classifier-tp29264841p29269574.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Bayes classifier

Posted by Bowie Bailey <Bo...@BUC.com>.
 On 7/26/2010 10:12 AM, andrij wrote:
>>> 2) Evaluating whether an email is spam or not. Again, if I set
>>> "bayes_ignore_header Some-header", will the bayes classifier ignore the
>>> header while evaluating an e-mail?
>> Yes.  That's what it's for.
> So, the bayes clasifier will ignore "Some-header" in both learning and spam
> detection phases. Did I understand it correctly?

I'm not an expert, just another user, but as I understand it, this
config option causes Bayes to ignore that particular header in both
learning and scoring modes.

>>> 3) Evaluating whether an email is spam or not. Does the bayes classifier
>>> analyze headers if I have, for example, the following rule: "body
>>> BAYES_05
>>> eval:check_bayes('0.00', '0.05')". According to the
>>> http://wiki.apache.org/spamassassin/WritingRules : "Body rules also
>>> include
>>> the Subject as the first line of the body content". So, any headers that
>>> precede subject header are not considered by the bayes classifier?
>> I don't have an answer for you here, but just another question.  Why do
>> you want to mess with the bayes rules?
> Maybe I am mistaken, but what is the sense to train the bayes classifier on
> headers if headers (at least those that precede a subject header) are not
> considered during the spam detection phase?

Bayes learns based on the entire message -- headers and all. 
(Otherwise, what would be the point of the bayes_ignore_header option?)

I can see where you might get that impression by looking at the rule,
but if I understand it correctly, Bayes has already been run and the
rule is just checking the result.

-- 
Bowie

Re: Bayes classifier

Posted by andrij <an...@gmail.com>.


>> 2) Evaluating whether an email is spam or not. Again, if I set
>> "bayes_ignore_header Some-header", will the bayes classifier ignore the
>> header while evaluating an e-mail?
> 
> Yes.  That's what it's for.
> 

So, the bayes clasifier will ignore "Some-header" in both learning and spam
detection phases. Did I understand it correctly?



>> 3) Evaluating whether an email is spam or not. Does the bayes classifier
>> analyze headers if I have, for example, the following rule: "body
>> BAYES_05
>> eval:check_bayes('0.00', '0.05')". According to the
>> http://wiki.apache.org/spamassassin/WritingRules : "Body rules also
>> include
>> the Subject as the first line of the body content". So, any headers that
>> precede subject header are not considered by the bayes classifier?
> 
> I don't have an answer for you here, but just another question.  Why do
> you want to mess with the bayes rules?
> 

Maybe I am mistaken, but what is the sense to train the bayes classifier on
headers if headers (at least those that precede a subject header) are not
considered during the spam detection phase?

Thank you.
-- 
View this message in context: http://old.nabble.com/Bayes-classifier-tp29264841p29266978.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Bayes classifier

Posted by RW <rw...@googlemail.com>.
On Mon, 26 Jul 2010 09:47:24 -0400
Bowie Bailey <Bo...@BUC.com> wrote:


> > 3) Evaluating whether an email is spam or not. Does the bayes
> > classifier analyze headers if I have, for example, the following
> > rule: "body BAYES_05 eval:check_bayes('0.00', '0.05')". According
> > to the http://wiki.apache.org/spamassassin/WritingRules : "Body
> > rules also include the Subject as the first line of the body
> > content". So, any headers that precede subject header are not
> > considered by the bayes classifier?
> 
> I don't have an answer for you here, but just another question.  Why
> do you want to mess with the bayes rules?  

That's actually the way BAYES rules are already set up (except that
BAYES_05 has '0.01' not '0.00'), so they are already body rules. It
doesn't mean they only run on the body and subject.

Re: Bayes classifier

Posted by Bowie Bailey <Bo...@BUC.com>.
 On 7/26/2010 5:58 AM, andrij wrote:
> Hi all,
>
> I am new to spamassassin and bayes classifier. I have several questions and
> I will greatly appreciate your help with that.
>
> 1) Training of the bayes classifier with _multipart_ e-mails (e.g., an
> e-mail contains other e-mails within its body). If I set
> "bayes_ignore_header Some-header", will bayes classifier ignore (while
> learning) the header "Some-header" in the nested messages as well?

As far as SA is concerned, this is a single message with a single set of
headers.  Bayes will ignore the specified header in the main message,
but not in the body (where the rest of the e-mails are stored).  If you
want them treated as separate messages, you will need to run something
to split them into separate files and then learn them.

> 2) Evaluating whether an email is spam or not. Again, if I set
> "bayes_ignore_header Some-header", will the bayes classifier ignore the
> header while evaluating an e-mail?

Yes.  That's what it's for.

> 3) Evaluating whether an email is spam or not. Does the bayes classifier
> analyze headers if I have, for example, the following rule: "body BAYES_05
> eval:check_bayes('0.00', '0.05')". According to the
> http://wiki.apache.org/spamassassin/WritingRules : "Body rules also include
> the Subject as the first line of the body content". So, any headers that
> precede subject header are not considered by the bayes classifier?

I don't have an answer for you here, but just another question.  Why do
you want to mess with the bayes rules?  They work very well as-is as
long as you make sure the database is being fed properly (learning spam
as spam and ham as ham with a decent mix of both being learned on a
regular basis).

-- 
Bowie