You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/06/01 11:22:05 UTC

Bayes combining and OCR (Was Re: SpamAssassin 3.2 compatiblity)

Matthias Keller writes:
> Nix wrote:
> > On 31 May 2007, Graham Murray said:
> >
> >   
> >> Nix <ni...@esperi.org.uk> writes:
> >>
> >>     
> >>> (And, let's be blunt, the pure this-word-is-spammy recognition part of
> >>> FuzzyOCR is much less smart than the Bayesian system already present
> >>> in SA: FuzzyOCR should really use the Bayesian system to determine the
> >>> spamminess of words, I suppose...)
> >>>       
> >> Or even just act as a MIME part 'decoding' system (like Base64) and feed
> >> all words it finds in images into Bayes, along with all other text in
> >> the mail, rather than generating a score itself.
> >>     
> >
> > Perhaps so, but if so those words should have a score-multiplier of some
> > sort applied, because the fact that those words originated in images is
> > itself an obfuscation technique that should be noted in the score.
> >   
> This has been discussed here again and again and again
> 
> first of all, these 10 words found in an image cannot stand against the 
> bayes poisoning found in all these messages - so it would literally be 
> useless for bayes filtering

by the way, this is a common misconception of how our Bayes system works;
what *should* happen is that the "poison" text winds up with "weak"
Bayesian probability scores between 0.2 and 0.8, since it uses words that
also appear in ham (hence why it appears as poison).  However, the OCR'd
text would wind up with "strong" scores around 0.99 or greater.

The chi-square probability combining algorithm we use takes care of this,
by discounting the "weak" clues and taking more account of the "strong"
clues.  (This is what makes it a more effective combining algorithm for
Bayes than the traditional Graham style.)

Note: this relies on the use of a different "namespace" for OCR-discovered
words, btw; ie. if the words "make money fast" are found in OCR'd text,
it'd generate "OCR:make", "OCR:money", "OCR:fast".  If the OCR-discovered
words are just thrown in with normal text words, that wouldn't work.

--j.

Re: Bayes combining and OCR (Was Re: SpamAssassin 3.2 compatiblity)

Posted by Matthias Keller <li...@matthias-keller.ch>.
Justin Mason wrote:
> Matthias Keller writes:
>   
>> Nix wrote:
>>     
>>> On 31 May 2007, Graham Murray said:
>>>
>>>   
>>>       
>>>> Nix <ni...@esperi.org.uk> writes:
>>>>
>>>>     
>>>>         
>>>>> (And, let's be blunt, the pure this-word-is-spammy recognition part of
>>>>> FuzzyOCR is much less smart than the Bayesian system already present
>>>>> in SA: FuzzyOCR should really use the Bayesian system to determine the
>>>>> spamminess of words, I suppose...)
>>>>>       
>>>>>           
>>>> Or even just act as a MIME part 'decoding' system (like Base64) and feed
>>>> all words it finds in images into Bayes, along with all other text in
>>>> the mail, rather than generating a score itself.
>>>>     
>>>>         
>>> Perhaps so, but if so those words should have a score-multiplier of some
>>> sort applied, because the fact that those words originated in images is
>>> itself an obfuscation technique that should be noted in the score.
>>>   
>>>       
>> This has been discussed here again and again and again
>>
>> first of all, these 10 words found in an image cannot stand against the 
>> bayes poisoning found in all these messages - so it would literally be 
>> useless for bayes filtering
>>     
>
> by the way, this is a common misconception of how our Bayes system works;
> what *should* happen is that the "poison" text winds up with "weak"
> Bayesian probability scores between 0.2 and 0.8, since it uses words that
> also appear in ham (hence why it appears as poison).  However, the OCR'd
> text would wind up with "strong" scores around 0.99 or greater.
>
> The chi-square probability combining algorithm we use takes care of this,
> by discounting the "weak" clues and taking more account of the "strong"
> clues.  (This is what makes it a more effective combining algorithm for
> Bayes than the traditional Graham style.)
>   
Would be nice if that worked - just it doesn't for me. I dont know how 
the algorithm works but I observed its results...
I learnt dozens of spams with nearly identical spam texts (and only the 
spam stuff, not the poisoning) and an identical mail WITH random text 
got a Bayes 0.500 - hence really - it just doesn't work for me...

Matt