You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Richard Mealing <ri...@fastnet.co.uk> on 2016/08/31 12:55:15 UTC

Image spam - FuzzyOCR?

Hi everyone,

I am looking at Fuzzy ocr to detect more image spam and I had a couple of questions;


1)      Is this being used? Does it detect image spam, or should I be looking at something else?

2)      I'm getting some horny date spam coming through with just images and text inside an image at the bottom. My bayes seems to be scoring this with -1.90 Bayes_00. I keep sending this to my database as spam but I'm not sure how many I need to feed it and I don't get much. Are there any other means of feeding bayes with image spam (or any spam really) from a source on the internet? Or is that a bad idea since that's not my spam?

3)      If I use Fuzzy OCR on FreeBSD, how does it get updated?

4)      I installed it from the ports and I had to install tesseract or I got a dependency warning message. Now I still get a warning - warn: FuzzyOcr: Cannot find executable for gifinter - Is this normal? How should I omit this error since I can't find gifinter in the ports tree?

Thanks,
Rich


RE: Image spam - FuzzyOCR?

Posted by Richard Mealing <ri...@fastnet.co.uk>.
>-----Original Message-----
>From: Matus UHLAR - fantomas [mailto:uhlar@fantomas.sk] 
>Sent: Thursday, September 1, 2016 14:30
>To: users@spamassassin.apache.org
>Subject: Re: Image spam - FuzzyOCR? 

>>On Wed, 31 Aug 2016 12:55:15 +0000 Richard Mealing wrote:
>>> 2)      I'm getting some horny date spam coming through with just
>>> images and text inside an image at the bottom. My bayes seems to be 
>>> scoring this with -1.90 Bayes_00. I keep sending this to my database 
>>> as spam but I'm not sure how many I need to feed it and I don't get 
>>> much.

>On 01.09.16 14:25, RW wrote:
>>It not a good sign when spam resists being trained way from BAYES_00.
>>
>>IIWY I'd reset the database, and if possible turn-off autotraining and 
>>train manually.
>>
>>Also you might want to set:
>>
>>  bayes_token_sources  all
>>
>>This adds in mimepart hashes, which may help Bayes identify repeated 
>>images.

>I think what happens more often is that the training data are sent to wrong user.
>when using amavis, training must be done as 'amavis' user, or other than amavis runs as.

I'm scanning for quite a few different domains (100+) and I'm not that familiar with how bayes works - I can't really find much documentation. TBH it seems to be working fine and scoring quite well, but there are instances where it fails.
Also I am using it through sql - 

use_bayes 1
bayes_auto_learn 1
bayes_auto_expire 1
bayes_store_module      Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsn   DBI:mysql:sa_bayes:x.x.x.x:3306
bayes_sql_username      sa_user
bayes_sql_password       xxxx


I need to do more reading on how to make it better, but I have a few dormant domains delivering emails to a POP box and I rsync that to my filtering server and run sa-learn just using some bash script. I read this isn't recommended though, but I would have thought using a domain that no one should know about, like a honeypot, this should be ok? Maybe I should just rethink the whole thing. 
I remember someone telling me about that flesh plugin. I'm sure it was my boss! Was it not called pornsweeper? Looks like the DNS was removed for the website, but I looked at googles cached copy.. 

Thanks for all your advice, it is much appreciated. 

>--
>Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
>Warning: I wish NOT to receive e-mail advertising to this address.
>Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
>"Where do you want to go to die?" [Microsoft]

Re: Image spam - FuzzyOCR?

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>On Wed, 31 Aug 2016 12:55:15 +0000 Richard Mealing wrote:
>> 2)      I'm getting some horny date spam coming through with just
>> images and text inside an image at the bottom. My bayes seems to be
>> scoring this with -1.90 Bayes_00. I keep sending this to my database
>> as spam but I'm not sure how many I need to feed it and I don't get
>> much.

On 01.09.16 14:25, RW wrote:
>It not a good sign when spam resists being trained way from BAYES_00.
>
>IIWY I'd reset the database, and if possible turn-off autotraining and
>train manually.
>
>Also you might want to set:
>
>  bayes_token_sources  all
>
>This adds in mimepart hashes, which may help Bayes identify repeated
>images.

I think what happens more often is that the training data are sent to wrong
user.
when using amavis, training must be done as 'amavis' user, or other than
amavis runs as.


-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"Where do you want to go to die?" [Microsoft]

Re: Image spam - FuzzyOCR?

Posted by RW <rw...@googlemail.com>.
On Wed, 31 Aug 2016 12:55:15 +0000
Richard Mealing wrote:

> 2)      I'm getting some horny date spam coming through with just
> images and text inside an image at the bottom. My bayes seems to be
> scoring this with -1.90 Bayes_00. I keep sending this to my database
> as spam but I'm not sure how many I need to feed it and I don't get
> much. 

It not a good sign when spam resists being trained way from BAYES_00.

IIWY I'd reset the database, and if possible turn-off autotraining and
train manually.

Also you might want to set:

  bayes_token_sources  all

This adds in mimepart hashes, which may help Bayes identify repeated
images.

Re: Image spam - FuzzyOCR?

Posted by RW <rw...@googlemail.com>.
On Fri, 02 Sep 2016 10:19:22 +0700
Olivier wrote:

> > Not really, he just said it matches against a word list. My point is
> > that out of the several SA OCR plugins that have been written,
> > FuzzyOCR is the one that's specifically designed for doing fuzzy
> > matching on a finite word list. If you just pass the OCR output to
> > Bayes or add it to the body, it's not "fuzzy OCR" anymore.  
> 
> To my understanding, the fuzzy part refeered to the way it does OCR
> (several passes, with different angles, colours, etc.), not
> to the word matching.


From:

<https://web.archive.org/web/20070701213609/http://fuzzyocr.own-hero.net/wiki/WhatisFuzzyOcr>


The methods mainly are:

-  Optical Character Recognition using different engines and settings
-  Fuzzy word matching algorithm applied to OCR results
...

Re: Image spam - FuzzyOCR?

Posted by RW <rw...@googlemail.com>.
On Thu, 1 Sep 2016 15:16:37 +0200
Matus UHLAR - fantomas wrote:

> >> On Thu, Sep 1, 2016 at 12:27 AM, Olivier
> >> <Ol...@cs.ait.ac.th> wrote:  
> >> > I am running it, it does not do a very good job at extracting the
> >> > text from the images. Then it uses it's own list of keywords to
> >> > detect spam: to me it's the biggest problem, it should push back
> >> > the text to SpamAssassin and let SA rules decide what to do with
> >> > it. 
> >>       I do agree that the OCR program should be doing the OCR'ing
> >> and the text filtering should be left to a program that does that
> >> for a living.  
> 
> On 01.09.16 13:59, RW wrote:
> >It's a long time since I've used it, but IIRC the point of FuzzyOCR
> >is that it does fuzzy matching on a dictionary of "bad" words -
> >similar to the way that spelling checkers find the mostly likely
> >suggestions. This gives it a very limited ability to deal with
> >imperfectly read words.  
> 
> it's the same as Olivier wrote above :-)

Not really, he just said it matches against a word list. My point is
that out of the several SA OCR plugins that have been written, FuzzyOCR
is the one that's specifically designed for doing fuzzy matching on a
finite word list. If you just pass the OCR output to Bayes or add it to
the body, it's not "fuzzy OCR" anymore.


> >Putting garbled OCR text through SA body rules may be more trouble
> >than it's worth.  
> 
> garbled, yes. I've had this discussion some years back and tesseract
> has currently much much better results than it had those years ago.


Unless it can cope with current CAPTCHAs the spammer has a reserve. 

The first OCR plugin came towards the end of a period where people were
being hammered by image spam. There's been nothing like that since,
probably because it doesn't work well as spam.  As I've said I find it
can be caught by other means. I must have put about 50k spams through
SA since I last had an FN that was an image spam. 

Re: Image spam - FuzzyOCR?

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>> On Thu, Sep 1, 2016 at 12:27 AM, Olivier <Ol...@cs.ait.ac.th> wrote:
>> > I am running it, it does not do a very good job at extracting the
>> > text from the images. Then it uses it's own list of keywords to
>> > detect spam: to me it's the biggest problem, it should push back
>> > the text to SpamAssassin and let SA rules decide what to do with it.
>> >
>>       I do agree that the OCR program should be doing the OCR'ing and
>> the text filtering should be left to a program that does that for a
>> living.

On 01.09.16 13:59, RW wrote:
>It's a long time since I've used it, but IIRC the point of FuzzyOCR is
>that it does fuzzy matching on a dictionary of "bad" words - similar to
>the way that spelling checkers find the mostly likely suggestions. This
>gives it a very limited ability to deal with imperfectly read words.

it's the same as Olivier wrote above :-)

>Putting garbled OCR text through SA body rules may be more trouble than
>it's worth.

garbled, yes. I've had this discussion some years back and tesseract has
currently much much better results than it had those years ago.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Boost your system's speed by 500% - DEL C:\WINDOWS\*.*

Re: Image spam - FuzzyOCR?

Posted by RW <rw...@googlemail.com>.
On Thu, 1 Sep 2016 06:23:37 -0400
Mauricio Tavares wrote:

> On Thu, Sep 1, 2016 at 12:27 AM, Olivier
> <Ol...@cs.ait.ac.th> wrote:

> > I am running it, it does not do a very good job at extracting the
> > text from the images. Then it uses it's own list of keywords to
> > detect spam: to me it's the biggest problem, it should push back
> > the text to SpamAssassin and let SA rules decide what to do with it.
> >  
>       I do agree that the OCR program should be doing the OCR'ing and
> the text filtering should be left to a program that does that for a
> living.

It's a long time since I've used it, but IIRC the point of FuzzyOCR is
that it does fuzzy matching on a dictionary of "bad" words - similar to
the way that spelling checkers find the mostly likely suggestions. This
gives it a very limited ability to deal with imperfectly read words.

Putting garbled OCR text through SA body rules may be more trouble than
it's worth.




Re: Image spam - FuzzyOCR?

Posted by "lists@rhsoft.net" <li...@rhsoft.net>.

Am 01.09.2016 um 12:23 schrieb Mauricio Tavares:
> I do agree that the OCR program should be doing the OCR'ing and
> the text filtering should be left to a program that does that for a
> living. In the modern, systemd world this is of course an ancient and
> outdated design philosophy

this is simply *not* true und hence systemd ships a lot of different 
binaries doing different things and so *clearly* follows the unix philosophy

the only difference is that instead all this tools living in different 
upstream repos, maintained by independent teams and hopefully get 
adopted properly in case of changes which affect more than one needed 
changes are done in the same repo

some people just have the illusion that Lennart P�ttering is the one and 
only programmer of all that tools - no he is not - the different tools 
are maintained by different people and just get tightly integrated 
because they are all talking together and working in the same team 
instead different projects fighting against each other in case of 
problems and point to the other tool which is broken



Re: Image spam - FuzzyOCR?

Posted by Mauricio Tavares <ra...@gmail.com>.
On Thu, Sep 1, 2016 at 12:27 AM, Olivier <Ol...@cs.ait.ac.th> wrote:
> Richard,
>
>> I am looking at Fuzzy ocr to detect more image spam and I had a couple
>> of questions;
>
> FuzzyOCR does not detect image spam per se, it detects spam text in an
> image. To classify image spam, you could consider image Cerberus that
> does a classification on images metadata (size, presence of text, etc.)
>
>> 1)      Is this being used? Does it detect image spam, or should I be
>> looking at something else?
>
> Yes. No, maybe.
>
> I am running it, it does not do a very good job at extracting the text
> from the images. Then it uses it's own list of keywords to detect spam:
> to me it's the biggest problem, it should push back the text to
> SpamAssassin and let SA rules decide what to do with it.
>
      I do agree that the OCR program should be doing the OCR'ing and
the text filtering should be left to a program that does that for a
living. In the modern, systemd world this is of course an ancient and
outdated design philosophy.

>> 2)      I'm getting some horny date spam coming through with just
>> images and text inside an image at the bottom. My bayes seems to be
>> scoring this with -1.90 Bayes_00. I keep sending this to my database
>> as spam but I'm not sure how many I need to feed it and I don't get
>> much. Are there any other means of feeding bayes with image spam (or
>> any spam really) from a source on the internet? Or is that a bad idea
>> since that's not my spam?
>
> The ideal plugin would be able to look at a picture and decide that it's
> an horny date :) I remember we once had a student that wanted to work on
> classifying picture by the amount of flesh to decide whether it was a
> naked picture or not/ But I don't think he ever succeeded.
>
      I need to find where I saw this - might even have been in
wikipedia of all places -- but China or some other country has a
program that blocks images on the internet based on the amount of
flesh. As a result, it would block a picture of a bunch of pigs
feeding. Maybe it is the same guy?

>> 3)      If I use Fuzzy OCR on FreeBSD, how does it get updated?
>
> I doubt FuzzyOCR ever gets updated, on FreeBSD or elsewhere.
>
>> 4)      I installed it from the ports and I had to install tesseract
>> or I got a dependency warning message. Now I still get a warning -
>> warn: FuzzyOcr: Cannot find executable for gifinter - Is this normal?
>> How should I omit this error since I can't find gifinter in the ports
>> tree?
>
> gifinter used to be part of /usr/ports/graphics/giflib
> but the NEWS file mentions that:
> Version 5.0.1
> =============
> Retirements
> -----------
> * gifinter is gone.  Use convert -interlace from the ImageMagick suite.
>
> In my case, I still have an old executable of gifinter laying around,
> but I think you would configure FuzzyOCF.cf with an approprate line of
> the form:
>
> focr_bin_gifinter /usr/local/bin/convert -interlace and the needed
> parameters.
>
> Best regards,
>
> Olivier