You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "David F. Skoll" <df...@roaringpenguin.com> on 2014/10/14 16:17:05 UTC

Philosophical question on Bayes (was Re: 23_bayes_ignore_header.cf)

On Tue, 14 Oct 2014 16:10:52 +0200
Axb <ax...@gmail.com> wrote:

> and to avoid further discussions of what header may pollute bayes or
> not, I've removed all header entries which are not directly related
> to AV/filter products.

I'm not sure I agree with being too clever about Bayes.  Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not?  Isn't that the whole point of Bayes?

I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive.  For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?

Regards,

David.

Re: Philosophical question on Bayes (was Re: 23_bayes_ignore_header.cf)

Posted by Jeff Mincy <je...@delphioutpost.com>.
   From: Axb <ax...@gmail.com>
   Date: Tue, 14 Oct 2014 23:37:36 +0200
   
   On 10/14/2014 11:08 PM, Adam Katz wrote:
   >> On Tue, 14 Oct 2014 16:10:52 +0200 Axb <ax...@gmail.com> wrote:
   >>> and to avoid further discussions of what header may pollute bayes or
   >>> not, I've removed all header entries which are not directly related
   >>> to AV/filter products.
   >
   > On 10/14/2014 07:17 AM, David F. Skoll wrote:
   >> I'm not sure I agree with being too clever about Bayes.  Surely by its
   >> very nature, the Bayes algorithm will itself indicate which tokens
   >> are relevant and which are not?  Isn't that the whole point of Bayes?
   >>
   >> I think being to clever about massaging the data that gets fed to
   >> Bayes may be counter-productive.  For sure, *some* massaging is in order;
   >> a token should be a semantic unit, so something like "www.example.com"
   >> should probably be one token rather than three, but beyond that I wonder
   >> if it's good or not to massage the data?
   >
   > The purpose of bayes_ignore_header is twofold:
   >
   >   1. Prevent inheriting other systems' false positives (ensure better
   >      independence)
   >   2. Prevent relying upon headers that won't exist at delivery time (e.g.
   >      added by the mailbox server)
   >
   > This is why it's so important to ignore other spam engines, which
   > basically fit into both of those categories.
   
   I'd love to have the option (switch) to use Bayes on msg bodies ONLY, 
   though I doubt anybody would be a taker for such a project.
   (I'd even be willing to "$pon$or" such an addition to SA)
   
Wouldn't that be fairly easy to implement  by intercepting the call to
_tokenize_headers in Plugin/Bayes.pm?

  # Tokenize the headers
  my %hdrs = $self->_tokenize_headers ($msg);
  while( my($prefix, $value) = each %hdrs ) {
    push(@tokens, $self->_tokenize_line ($value, "H$prefix:", 0));
  }

-jeff

Re: Philosophical question on Bayes (was Re: 23_bayes_ignore_header.cf)

Posted by Reindl Harald <h....@thelounge.net>.
Am 14.10.2014 um 23:37 schrieb Axb:
> On 10/14/2014 11:08 PM, Adam Katz wrote:
>>> On Tue, 14 Oct 2014 16:10:52 +0200 Axb <ax...@gmail.com> wrote:
>>>> and to avoid further discussions of what header may pollute bayes or
>>>> not, I've removed all header entries which are not directly related
>>>> to AV/filter products.
>>
>> On 10/14/2014 07:17 AM, David F. Skoll wrote:
>>> I'm not sure I agree with being too clever about Bayes.  Surely by its
>>> very nature, the Bayes algorithm will itself indicate which tokens
>>> are relevant and which are not?  Isn't that the whole point of Bayes?
>>>
>>> I think being to clever about massaging the data that gets fed to
>>> Bayes may be counter-productive.  For sure, *some* massaging is in
>>> order;
>>> a token should be a semantic unit, so something like "www.example.com"
>>> should probably be one token rather than three, but beyond that I wonder
>>> if it's good or not to massage the data?
>>
>> The purpose of bayes_ignore_header is twofold:
>>
>>   1. Prevent inheriting other systems' false positives (ensure better
>>      independence)
>>   2. Prevent relying upon headers that won't exist at delivery time (e.g.
>>      added by the mailbox server)
>>
>> This is why it's so important to ignore other spam engines, which
>> basically fit into both of those categories.
>
> I'd love to have the option (switch) to use Bayes on msg bodies ONLY,
> though I doubt anybody would be a taker for such a project.
> (I'd even be willing to "$pon$or" such an addition to SA)

or someting like the opposit as now:

bayes_include_header received
bayes_include_header subject
bayes_include_header x-mailer


Re: Philosophical question on Bayes (was Re: 23_bayes_ignore_header.cf)

Posted by Axb <ax...@gmail.com>.
On 10/14/2014 11:08 PM, Adam Katz wrote:
>> On Tue, 14 Oct 2014 16:10:52 +0200 Axb <ax...@gmail.com> wrote:
>>> and to avoid further discussions of what header may pollute bayes or
>>> not, I've removed all header entries which are not directly related
>>> to AV/filter products.
>
> On 10/14/2014 07:17 AM, David F. Skoll wrote:
>> I'm not sure I agree with being too clever about Bayes.  Surely by its
>> very nature, the Bayes algorithm will itself indicate which tokens
>> are relevant and which are not?  Isn't that the whole point of Bayes?
>>
>> I think being to clever about massaging the data that gets fed to
>> Bayes may be counter-productive.  For sure, *some* massaging is in order;
>> a token should be a semantic unit, so something like "www.example.com"
>> should probably be one token rather than three, but beyond that I wonder
>> if it's good or not to massage the data?
>
> The purpose of bayes_ignore_header is twofold:
>
>   1. Prevent inheriting other systems' false positives (ensure better
>      independence)
>   2. Prevent relying upon headers that won't exist at delivery time (e.g.
>      added by the mailbox server)
>
> This is why it's so important to ignore other spam engines, which
> basically fit into both of those categories.

I'd love to have the option (switch) to use Bayes on msg bodies ONLY, 
though I doubt anybody would be a taker for such a project.
(I'd even be willing to "$pon$or" such an addition to SA)


Re: Philosophical question on Bayes (was Re: 23_bayes_ignore_header.cf)

Posted by Adam Katz <an...@khopis.com>.
> On Tue, 14 Oct 2014 16:10:52 +0200 Axb <ax...@gmail.com> wrote:
>> and to avoid further discussions of what header may pollute bayes or
>> not, I've removed all header entries which are not directly related
>> to AV/filter products.

On 10/14/2014 07:17 AM, David F. Skoll wrote:
> I'm not sure I agree with being too clever about Bayes.  Surely by its
> very nature, the Bayes algorithm will itself indicate which tokens
> are relevant and which are not?  Isn't that the whole point of Bayes?
>
> I think being to clever about massaging the data that gets fed to
> Bayes may be counter-productive.  For sure, *some* massaging is in order;
> a token should be a semantic unit, so something like "www.example.com"
> should probably be one token rather than three, but beyond that I wonder
> if it's good or not to massage the data?

The purpose of bayes_ignore_header is twofold:

 1. Prevent inheriting other systems' false positives (ensure better
    independence)
 2. Prevent relying upon headers that won't exist at delivery time (e.g.
    added by the mailbox server)

This is why it's so important to ignore other spam engines, which
basically fit into both of those categories.



Re: Philosophical question on Bayes (was Re: 23_bayes_ignore_header.cf)

Posted by Axb <ax...@gmail.com>.
On 10/14/2014 04:17 PM, David F. Skoll wrote:
> On Tue, 14 Oct 2014 16:10:52 +0200
> Axb <ax...@gmail.com> wrote:
>
>> and to avoid further discussions of what header may pollute bayes or
>> not, I've removed all header entries which are not directly related
>> to AV/filter products.
>
> I'm not sure I agree with being too clever about Bayes.  Surely by its
> very nature, the Bayes algorithm will itself indicate which tokens
> are relevant and which are not?  Isn't that the whole point of Bayes?
>
> I think being to clever about massaging the data that gets fed to
> Bayes may be counter-productive.  For sure, *some* massaging is in order;
> a token should be a semantic unit, so something like "www.example.com"
> should probably be one token rather than three, but beyond that I wonder
> if it's good or not to massage the data?

David,

The "boys_ignore" file will not become a part of SA default .cf files.
My intention is to keep a central repository in case somebody else wants 
to use it instead of mantaining in my local repo.

I believe in *some* massaging, as in "works for me".

I assume it depends on how you feed bayes and what kind of traffic you 
deal with.

The concept of avoiding bayes from learning other filter's stuff is 
ancient (there's a commented  example in local.cf) but as with so much 
in SA tuning , it's trial and possible error till you feel cozy.