You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Nix <ni...@esperi.org.uk> on 2007/06/01 09:21:16 UTC

Re: SpamAssassin 3.2 compatiblity

On 31 May 2007, Graham Murray said:

> Nix <ni...@esperi.org.uk> writes:
>
>> (And, let's be blunt, the pure this-word-is-spammy recognition part of
>> FuzzyOCR is much less smart than the Bayesian system already present
>> in SA: FuzzyOCR should really use the Bayesian system to determine the
>> spamminess of words, I suppose...)
>
> Or even just act as a MIME part 'decoding' system (like Base64) and feed
> all words it finds in images into Bayes, along with all other text in
> the mail, rather than generating a score itself.

Perhaps so, but if so those words should have a score-multiplier of some
sort applied, because the fact that those words originated in images is
itself an obfuscation technique that should be noted in the score.

-- 
`On a scale of one to ten of usefulness, BBC BASIC was several points ahead
 of the competition, scoring a relatively respectable zero.' --- Peter Corlett

Re: SpamAssassin 3.2 compatiblity

Posted by Matthias Keller <li...@matthias-keller.ch>.
Nix wrote:
> On 31 May 2007, Graham Murray said:
>
>   
>> Nix <ni...@esperi.org.uk> writes:
>>
>>     
>>> (And, let's be blunt, the pure this-word-is-spammy recognition part of
>>> FuzzyOCR is much less smart than the Bayesian system already present
>>> in SA: FuzzyOCR should really use the Bayesian system to determine the
>>> spamminess of words, I suppose...)
>>>       
>> Or even just act as a MIME part 'decoding' system (like Base64) and feed
>> all words it finds in images into Bayes, along with all other text in
>> the mail, rather than generating a score itself.
>>     
>
> Perhaps so, but if so those words should have a score-multiplier of some
> sort applied, because the fact that those words originated in images is
> itself an obfuscation technique that should be noted in the score.
>   
This has been discussed here again and again and again

first of all, these 10 words found in an image cannot stand against the 
bayes poisoning found in all these messages - so it would literally be 
useless for bayes filtering
secondly, the hit rate of the OCR is pretty bad, so we cannot use exact 
matches - that's exactly why this app is named FUZZYocr, compared to the 
original version which wasn't fuzzy - that's why we have such high hit 
rates with it because it can still find these words even if one or two 
letters are wrong - try to do that with regular expressions and it gets 
ugly and big quite fast....

FuzzyOCR is perfect just the way it is. It might need some tweaking, 
yes, but then it can do exactly what you want. If you want an upper 
limit, just hack the source and add it - it's not too hard. I've added a 
few tweaks myself - for example dont stop if the minimum words was found 
with one scanset but continue unil the minimum+10 have been found.. I 
dont want it to stop at 2 words if a later scanset could find 15

I agree, an upper bound would be quite interesting for a few folks 
(actually I dont mind having a fuzzyocr hit with 20+ hits, that's just 
perfect actually because the FP rate was zero so far) and it shouldn't 
be too hard to add - so you might officially request this for a next 
version or like I said - just do it yourself. If you cant do it, I might 
have a look and give you a hint into the right direction, even tough I'm 
not really a good perl programmer


Matt