You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Adam Katz <an...@khopis.com> on 2009/04/26 20:37:06 UTC

A rant about FUZZY_OCR

> On Fri, Apr 24, 2009 at 05:14:21PM -0400, Adam Katz wrote:
>> I wouldn't trust FUZZY_OCR with anything.  12 points is *WAY* too high
>> for any single thing.  I had to disable this plugin a year or three
>> ago because it assigned 20+ points to legit screenshots in ham (and
>> that was /after/ I trimmed its flagging words file down in size)!

Henrik K wrote:
> You do realize that it's configurable? Who to blame if you just run
> things blindly.

I expect the defaults to at least border on sane.  As noted before, I've
tried and failed to configure it.  Could you point me at where the
configuration options are specified, specifically focr_threshold?  All I
see is the installation manual and the .cf file, neither of which is
terribly informative (like say the perldoc pages for other plugins).

Searching for it http://google.com/search?q=FUZZY_OCR finds an
OVERWHELMING MAJORITY of hits describing false positives and
configuration issues.  The official documentation didn't even make it to
the top 100 hits in Google, and after finding it on the SA wiki (google
hit #59), I found it sparse at best (I had to dive into the svn repo!).

The FAQ, which features only two answered questions, includes an
un-answered question about how to cap the score, which IMHO is a
mission-critical feature.

I don't know if I still have the example of the bad hit from those years
ago, but it made absolutely no sense, hitting dozens of "words found"
that did not exist ... and this was a PNG screen capture, not even a
photo or a JPEG-compressed image.  My company deals with screen captures
a LOT, and I just can't afford for such a poorly designed plugin to run
amok the way Fuzzy OCR does.

It's extremely disturbing that there are several tests (which is a good
thing), but none of them are designed to test for false positives, or
even to help you tweak the detection threshold.  You're left guessing
what reasonable levels are, especially when the config file (the best
docs I could find) points you at the manual (which I believe is the
install guide, which doesn't even include the string "thresh").

The last release was two years ago, and even on the svn trunk, the word
list hasn't been updated ... ever (excepting minor tweaks like a
threshold change from 0.1 to 0.01).  How is this fair?

The claim that FUZZY_OCR can't use the Bayesian database is a weak one,
too; just make a custom prefix to the tokens it creates (I don't know
SA's bayes token syntax, but other implementations use things like
"subject:foo" to indicate that the word "foo" in the subject differs
from the word "foo" elsewhere, so you could have "fuzzyocr:foo"
instead).  Implement the fuzziness by inserting a dozen tokens for each
possible parsing.)  This would solve the issue of stale or inappropriate
word lists.

Finally, I have no way of testing the thing live.  Since FUZZY_OCR is a
dynamically scored rule, I can't just push it to 0.001 and see the hits,
the way I can with the BAYES_XX thresholds for example.  (Sure, I can
make all score-changing values 0.001, but I'm not sure that would
properly test it, and given my past experiences, I wouldn't be surprised
if this still causes problems.)

It's a great idea, but I'd like to see it mature some first, especially
with respect to its documentation, test emails, word list, and live testing.

Re: [sa-list] Re: A rant about FUZZY_OCR

Posted by John Hardin <jh...@impsec.org>.

On Mon, 27 Apr 2009, Dan Mahoney, System Admin wrote:

> 3) Wordlists loadable from userprefs, if not bayes.

Along with that, the detected words should be (somehow) fed into bayes for 
analysis along with the other message text.

We touched on that last time fuzzyOCR was active.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Vista is at best mildly annoying and at worst makes you want to
   rush to Redmond, Wash. and rip somebody's liver out.      -- Forbes
-----------------------------------------------------------------------
  96 days since Obama's inauguration and still no unicorn!

Re: [sa-list] Re: A rant about FUZZY_OCR

Posted by "Dan Mahoney, System Admin" <da...@prime.gushi.org>.

On Mon, 27 Apr 2009, Jo Rhett wrote:

> On Apr 27, 2009, at 1:16 PM, Dan Mahoney, System Admin wrote:
>> The problem exists now, there is PNG spam, and there will continue to be, 
>> because it gets through.  Right now the only way I find this blocked is if 
>> spamcop blocks it.
>
>
> Just as a point of reference, I'd like to note that we haven't bothered with 
> FuzzyOCR here and absolute none of the spam which reaches my inbox is a PNG 
> or JPG or GIF spam.   SA does block it, and it does so without FuzzyOCR.
>
> That said, we have jacked the scores for e-mail with images and no text and 
> that might be why.   We never, ever receive valid e-mail with no text in it.

The spam I've been getting contains text, lots of it.  Markov-chain like 
crap that is 100 percent nonrelevant to the image.

-Dan


-- 

"She's NOT my girlfriend!"

-Dan Mahoney, Quite a bit recently.

--------Dan Mahoney--------
Techie,  Sysadmin,  WebGeek
Gushi on efnet/undernet IRC
ICQ: 13735144   AIM: LarpGM
Site:  http://www.gushi.org
---------------------------

Re: A rant about FUZZY_OCR

Posted by LuKreme <kr...@kreme.com>.

On 27-Apr-2009, at 16:06, Jo Rhett wrote:
> On Apr 27, 2009, at 1:16 PM, Dan Mahoney, System Admin wrote:
>> The problem exists now, there is PNG spam, and there will continue  
>> to be, because it gets through.  Right now the only way I find this  
>> blocked is if spamcop blocks it.
>
> Just as a point of reference, I'd like to note that we haven't  
> bothered with FuzzyOCR here and absolute none of the spam which  
> reaches my inbox is a PNG or JPG or GIF spam.   SA does block it,  
> and it does so without FuzzyOCR.

Yeah, I've not seen an image spam in my mailboxes in a long time.  I  
figured people were getting spam I'm not getting...

> We never, ever receive valid e-mail with no text in it.

Oh, I do all the time, but it's from people whom the AWL scores well  
down, pulling them out of spam range (My brother often sends me silly  
pictures with nothing else in the email).

BTW, is there anyway to see what the AWL adjustment is for a  
particular email or for a specific sender couplet?

-- 
Anybody who could duck the Vietnam war can certainly duck a couple of
shoes. -- Chris Gehlker

Re: [sa-list] Re: A rant about FUZZY_OCR

Posted by Jo Rhett <jr...@netconsonance.com>.

On Apr 27, 2009, at 1:16 PM, Dan Mahoney, System Admin wrote:
> The problem exists now, there is PNG spam, and there will continue  
> to be, because it gets through.  Right now the only way I find this  
> blocked is if spamcop blocks it.

Just as a point of reference, I'd like to note that we haven't  
bothered with FuzzyOCR here and absolute none of the spam which  
reaches my inbox is a PNG or JPG or GIF spam.   SA does block it, and  
it does so without FuzzyOCR.

That said, we have jacked the scores for e-mail with images and no  
text and that might be why.   We never, ever receive valid e-mail with  
no text in it.

-- 
Jo Rhett
Net Consonance : consonant endings by net philanthropy, open source  
and other randomness

Re: [sa-list] Re: A rant about FUZZY_OCR

Posted by "Dan Mahoney, System Admin" <da...@prime.gushi.org>.

On Mon, 27 Apr 2009, Henrik K wrote:
> Nothing of this makes sense. If you don't have a test server, too bad. If
> you don't trust the "score-changing values" too bad. It all worked for me.
>
>> It's a great idea, but I'd like to see it mature some first, especially
>> with respect to its documentation, test emails, word list, and live testing.
>
> If was quickly developed to an ongoing problem. The problem disappeared
> years ago. It was mature enough for 99% of users at that time. Though it did
> add lots of complexity and stricter MTA rules etc handled the job just fine
> also.

The problem exists now, there is PNG spam, and there will continue to be, 
because it gets through.  Right now the only way I find this blocked is if 
spamcop blocks it.

Ideally, what I'd probably like to see with regard to fuzzyOCR are:

1) Just patch it enough to work with 3.2 and 3.3 -- I don't have the 
internals know-how to do this, and I don't know if Decoder still reads 
this list.

2) A debug mode, whereby the plugin would note its own score, possibly by 
applying an equal negative value.

3) Wordlists loadable from userprefs, if not bayes.

4) A recommended configuration, along with "shortcircuit" documentation.

-Dan

-- 

"Ca. Tas. Tro. Phy."

-John Smedley, March 28th 1998, 3AM

--------Dan Mahoney--------
Techie,  Sysadmin,  WebGeek
Gushi on efnet/undernet IRC
ICQ: 13735144   AIM: LarpGM
Site:  http://www.gushi.org
---------------------------

Re: A rant about FUZZY_OCR

Posted by Henrik K <he...@hege.li>.

On Sun, Apr 26, 2009 at 02:37:06PM -0400, Adam Katz wrote:
> > On Fri, Apr 24, 2009 at 05:14:21PM -0400, Adam Katz wrote:
> >> I wouldn't trust FUZZY_OCR with anything.  12 points is *WAY* too high
> >> for any single thing.  I had to disable this plugin a year or three
> >> ago because it assigned 20+ points to legit screenshots in ham (and
> >> that was /after/ I trimmed its flagging words file down in size)!
> 
> Henrik K wrote:
> > You do realize that it's configurable? Who to blame if you just run
> > things blindly.
> 
> I expect the defaults to at least border on sane.  As noted before, I've
> tried and failed to configure it.  Could you point me at where the
> configuration options are specified, specifically focr_threshold?  All I
> see is the installation manual and the .cf file, neither of which is
> terribly informative (like say the perldoc pages for other plugins).

Unfortunately it's not a sane world. But if you have any logic, you will see
that focr_base_score and focr_add_score mainly make up the score. One can
argue that the popular "botnet" plugin also doesn't have sane defaults.

> I don't know if I still have the example of the bad hit from those years
> ago, but it made absolutely no sense, hitting dozens of "words found"
> that did not exist ... and this was a PNG screen capture, not even a
> photo or a JPEG-compressed image.  My company deals with screen captures
> a LOT, and I just can't afford for such a poorly designed plugin to run
> amok the way Fuzzy OCR does.

I'm sorry that you are disappointed on the design. If you need "mission
critical" code, then you must expect that code people generously make on
their spare time for free might have few kinks around. Were you on fuzzyocr
mailing list few years ago and participate on the development process?

> It's extremely disturbing that there are several tests (which is a good
> thing), but none of them are designed to test for false positives, or
> even to help you tweak the detection threshold.  You're left guessing
> what reasonable levels are, especially when the config file (the best
> docs I could find) points you at the manual (which I believe is the
> install guide, which doesn't even include the string "thresh").
> The last release was two years ago, and even on the svn trunk, the word
> list hasn't been updated ... ever (excepting minor tweaks like a
> threshold change from 0.1 to 0.01).  How is this fair?

The plugin was last needed few years ago? Why is it supposed to be updated
to this day as there was no image spam? There is not much point making
general word lists. You put there what your mail flow sees. Someone from
medical company could be using it and come screaming at the "bad defaults"..

> The claim that FUZZY_OCR can't use the Bayesian database is a weak one,
> too; just make a custom prefix to the tokens it creates (I don't know
> SA's bayes token syntax, but other implementations use things like
> "subject:foo" to indicate that the word "foo" in the subject differs
> from the word "foo" elsewhere, so you could have "fuzzyocr:foo"
> instead).  Implement the fuzziness by inserting a dozen tokens for each
> possible parsing.)  This would solve the issue of stale or inappropriate
> word lists.

You are free to contribute code. If I remember right, someone might have
been trying it, maybe some talk can be found on mailinglist archives.

> Finally, I have no way of testing the thing live.  Since FUZZY_OCR is a
> dynamically scored rule, I can't just push it to 0.001 and see the hits,
> the way I can with the BAYES_XX thresholds for example.  (Sure, I can
> make all score-changing values 0.001, but I'm not sure that would
> properly test it, and given my past experiences, I wouldn't be surprised
> if this still causes problems.)

Nothing of this makes sense. If you don't have a test server, too bad. If
you don't trust the "score-changing values" too bad. It all worked for me.

> It's a great idea, but I'd like to see it mature some first, especially
> with respect to its documentation, test emails, word list, and live testing.

If was quickly developed to an ongoing problem. The problem disappeared
years ago. It was mature enough for 99% of users at that time. Though it did
add lots of complexity and stricter MTA rules etc handled the job just fine
also.

Cheers,
Henrik