You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Brent Clark <br...@gmail.com> on 2018/10/12 13:11:00 UTC

Is fuzzyocr i.e. Image scanning

Good day Guys

I am getting quite a bit of image spam, and googling put me in the 
direction of a tool called FuzzyOCR.

What I did was configure vagrant to install spamassassin and fuzzyocr, 
and fuzzyocr does not appear to be catching my spam (The example 
provided work).

Before I go down the road of installing and configuring fuzzyocr on my 
MTA, I thought I would double check with the spamassassin community and 
ask is there still a place for image scanning in 2018?

The documentation is fairly old, so it got me wondering if image 
scanning and old technology and method.

Thanks in advance.

Regards
Brent
P.s. Here is a pastebin link of what I am seeing.
https://pastebin.com/raw/gurvFrZw



Re: Is fuzzyocr i.e. Image scanning

Posted by John Hardin <jh...@impsec.org>.
On Mon, 15 Oct 2018, Brent Clark wrote:

> Good day Guys
>
> I was fortunate that someone privately emailed me, but is there no one else, 
> that has any thing they can share (its not only to me, but the community as a 
> whole). Im sure there is others out there, whose users dealing with this 
> nonsense.
>
> Please share.

Text obfuscation via images comes and goes. I've noticed for a while that 
it seems to be in the "coming" phase again. I have been getting 419 frauds 
where the pitch is in an image.

It might be reasonable to review and freshen the fuzzyOCR code.

> Regards
> Brent
>
> On 2018/10/12 15:11, Brent Clark wrote:
>> Good day Guys
>> 
>> I am getting quite a bit of image spam, and googling put me in the 
>> direction of a tool called FuzzyOCR.
>> 
>> What I did was configure vagrant to install spamassassin and fuzzyocr, and 
>> fuzzyocr does not appear to be catching my spam (The example provided 
>> work).
>> 
>> Before I go down the road of installing and configuring fuzzyocr on my MTA, 
>> I thought I would double check with the spamassassin community and ask is 
>> there still a place for image scanning in 2018?
>> 
>> The documentation is fairly old, so it got me wondering if image scanning 
>> and old technology and method.
>> 
>> Thanks in advance.
>> 
>> Regards
>> Brent
>> P.s. Here is a pastebin link of what I am seeing.
>> https://pastebin.com/raw/gurvFrZw
>> 
>> 
>

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   But if there is no such inalienable right [to self defense], the
   entire nature of the social contract is changed. Each man’s worth
   is measured solely by his utility to the state, and as such the
   value of his life rides a roller coaster not unlike the stock
   market: dependent not only upon the preferences of the party in
   power but upon the whims of its political leaders and the
   permanent bureaucratic class.                      -- Mike McDaniel
-----------------------------------------------------------------------
  564 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Is fuzzyocr i.e. Image scanning

Posted by Brent Clark <br...@gmail.com>.
Good day Guys

I was fortunate that someone privately emailed me, but is there no one 
else, that has any thing they can share (its not only to me, but the 
community as a whole). Im sure there is others out there, whose users 
dealing with this nonsense.

Please share.

Regards
Brent

On 2018/10/12 15:11, Brent Clark wrote:
> Good day Guys
> 
> I am getting quite a bit of image spam, and googling put me in the 
> direction of a tool called FuzzyOCR.
> 
> What I did was configure vagrant to install spamassassin and fuzzyocr, 
> and fuzzyocr does not appear to be catching my spam (The example 
> provided work).
> 
> Before I go down the road of installing and configuring fuzzyocr on my 
> MTA, I thought I would double check with the spamassassin community and 
> ask is there still a place for image scanning in 2018?
> 
> The documentation is fairly old, so it got me wondering if image 
> scanning and old technology and method.
> 
> Thanks in advance.
> 
> Regards
> Brent
> P.s. Here is a pastebin link of what I am seeing.
> https://pastebin.com/raw/gurvFrZw
> 
> 

Re: Is fuzzyocr i.e. Image scanning

Posted by Brent Clark <br...@gmail.com>.
Apologies for the subject.

It was meant to read "Is fuzzyocr i.e. Image scanning, warranted in 2018"

Regards
Brent

On 2018/10/12 15:11, Brent Clark wrote:
> Good day Guys
> 
> I am getting quite a bit of image spam, and googling put me in the 
> direction of a tool called FuzzyOCR.
> 
> What I did was configure vagrant to install spamassassin and fuzzyocr, 
> and fuzzyocr does not appear to be catching my spam (The example 
> provided work).
> 
> Before I go down the road of installing and configuring fuzzyocr on my 
> MTA, I thought I would double check with the spamassassin community and 
> ask is there still a place for image scanning in 2018?
> 
> The documentation is fairly old, so it got me wondering if image 
> scanning and old technology and method.
> 
> Thanks in advance.
> 
> Regards
> Brent
> P.s. Here is a pastebin link of what I am seeing.
> https://pastebin.com/raw/gurvFrZw
> 
> 

Re: Is fuzzyocr i.e. Image scanning

Posted by Rupert Gallagher <ru...@protonmail.com>.
I see a vps and an ".expert" tld sender domain. My servers handle those with a REJECT rule.

On Fri, Oct 12, 2018 at 15:11, Brent Clark <br...@gmail.com> wrote:

> Good day Guys
>
> I am getting quite a bit of image spam, and googling put me in the
> direction of a tool called FuzzyOCR.
>
> What I did was configure vagrant to install spamassassin and fuzzyocr,
> and fuzzyocr does not appear to be catching my spam (The example
> provided work).
>
> Before I go down the road of installing and configuring fuzzyocr on my
> MTA, I thought I would double check with the spamassassin community and
> ask is there still a place for image scanning in 2018?
>
> The documentation is fairly old, so it got me wondering if image
> scanning and old technology and method.
>
> Thanks in advance.
>
> Regards
> Brent
> P.s. Here is a pastebin link of what I am seeing.
> https://pastebin.com/raw/gurvFrZw

Re: Is fuzzyocr i.e. Image scanning

Posted by Henrik K <he...@hege.li>.
On Wed, Oct 17, 2018 at 09:21:33AM +0700, Olivier wrote:
>
> That is the way I meant it, it's an AND, not an OR. I see FuzzyOCR as
> just one more tool that can be added to SA.

The problem is it's so inefficient..  I've never seen image spam as a
problem, mostly it hits other rules and MTA blocks if you know what you are
doing.  My current spam corpus contains only 7% images.  For ham it's over
60%, so that's a horrible amount of executing image transformation tools and
analyzers for nothing, also thinking how many vulnerabilities have
imagemagick etc image tools had.  At minimum FuzzyOCR etc should maintain a
hash database of good images to skip..  all the these 10 year old plugins
are pretty horrid code..


Re: Is fuzzyocr i.e. Image scanning

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>>On 16.10.18 18:42, RW wrote:
>>>Bayes might work, but I wouldn't like to see it added to body text
>>>because corrupted text could look like obfuscation.

>On Wed, 17 Oct 2018, Matus UHLAR - fantomas wrote:
>>it should be pushed back to body text just for filters like bayes.
>>The same could/should be done for attachhed .doc, .pdf files etc.

On 17.10.18 07:56, John Hardin wrote:
>...which would be much more reliable than OCR.
>
>If it was a resource-allocation decision for pulling text from doc/pdf 
>vs. updating OCR, I'd push for the former.

this could be easily configured by installing modules or loading them.

btw, both PDF and word documents can contain images too ...


-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
99 percent of lawyers give the rest a bad name. 

Re: Is fuzzyocr i.e. Image scanning

Posted by John Hardin <jh...@impsec.org>.
On Wed, 17 Oct 2018, Matus UHLAR - fantomas wrote:

> On 16.10.18 18:42, RW wrote:
>> Bayes might work, but I wouldn't like to see it added to body text
>> because corrupted text could look like obfuscation.
>
> it should be pushed back to body text just for filters like bayes.
> The same could/should be done for attachhed .doc, .pdf files etc.

...which would be much more reliable than OCR.

If it was a resource-allocation decision for pulling text from doc/pdf vs. 
updating OCR, I'd push for the former.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The problem is when people look at Yahoo, slashdot, or groklaw and
   jump from obvious and correct observations like "Oh my God, this
   place is teeming with utter morons" to incorrect conclusions like
   "there's nothing of value here".        -- Al Petrofsky, in Y! SCOX
-----------------------------------------------------------------------
  566 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Is fuzzyocr i.e. Image scanning

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>> >On Tue, 16 Oct 2018 11:49:54 +0700 Olivier wrote:
>> >> One of my holdback with FuzzyOCR is that you have to provide an
>> >> independant word list, while we have a very good tool to analyze
>> >> text contents: SpamAssassin itself. So I would much prefer
>> >> FuzzyOCR to feed the OCR'ed text back to SA for further analysis
>> >> (the way pdfAssassin is working).
>>
>> On 16.10.18 13:34, RW wrote:
>> >That works as long as the OCR remains very accurate. What happened
>> >before was that the deployment of OCR lead spammers to make their
>> >text much less readable.

>On Tue, 16 Oct 2018 15:48:34 +0200 Matus UHLAR - fantomas wrote:
>> I think that original reason was that available OCR programs were not
>> reliable enough.
>>
>> I have tested gocr, ocrad and tesseract some >10 years ago, with not
>> very satisfying results, gocr being best at that time.
>>
>> Since then, google took tesseract and made it much better.
>>
>> I believe tht currently it would bve viable to push ocr output to
>> spamassassin for processing with bayes and other rules.

On 16.10.18 18:42, RW wrote:
>Bayes might work, but I wouldn't like to see it added to body text
>because corrupted text could look like obfuscation.

it should be pushed back to body text just for filters like bayes.
The same could/should be done for attachhed .doc, .pdf files etc.
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
42.7 percent of all statistics are made up on the spot. 

Re: Is fuzzyocr i.e. Image scanning

Posted by RW <rw...@googlemail.com>.
On Tue, 16 Oct 2018 15:48:34 +0200
Matus UHLAR - fantomas wrote:

> >On Tue, 16 Oct 2018 11:49:54 +0700 Olivier wrote:  
> >> One of my holdback with FuzzyOCR is that you have to provide an
> >> independant word list, while we have a very good tool to analyze
> >> text contents: SpamAssassin itself. So I would much prefer
> >> FuzzyOCR to feed the OCR'ed text back to SA for further analysis
> >> (the way pdfAssassin is working).  
> 
> On 16.10.18 13:34, RW wrote:
> >That works as long as the OCR remains very accurate. What happened
> >before was that the deployment of OCR lead spammers to make their
> >text much less readable.  
> 
> I think that original reason was that available OCR programs were not
> reliable enough.
> 
> I have tested gocr, ocrad and tesseract some >10 years ago, with not
> very satisfying results, gocr being best at that time.
> 
> Since then, google took tesseract and made it much better.
> 
> I believe tht currently it would bve viable to push ocr output to
> spamassassin for processing with bayes and other rules.


Bayes might work, but I wouldn't like to see it added to body text
because corrupted text could look like obfuscation.

Re: Is fuzzyocr i.e. Image scanning

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>On Tue, 16 Oct 2018 11:49:54 +0700 Olivier wrote:
>> One of my holdback with FuzzyOCR is that you have to provide an
>> independant word list, while we have a very good tool to analyze text
>> contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed
>> the OCR'ed text back to SA for further analysis (the way pdfAssassin
>> is working).

On 16.10.18 13:34, RW wrote:
>That works as long as the OCR remains very accurate. What happened
>before was that the deployment of OCR lead spammers to make their text
>much less readable.

I think that original reason was that available OCR programs were not
reliable enough.

I have tested gocr, ocrad and tesseract some >10 years ago, with not very
satisfying results, gocr being best at that time.

Since then, google took tesseract and made it much better.

I believe tht currently it would bve viable to push ocr output to
spamassassin for processing with bayes and other rules.


>> As for your question about the place for image scanning, if your MTA
>> has the resources to do so, why not?
>
>Because it's better if it's combined with other information.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
A day without sunshine is like, night.

Re: Is fuzzyocr i.e. Image scanning

Posted by RW <rw...@googlemail.com>.
On Tue, 16 Oct 2018 11:49:54 +0700
Olivier wrote:


> One of my holdback with FuzzyOCR is that you have to provide an
> independant word list, while we have a very good tool to analyze text
> contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed
> the OCR'ed text back to SA for further analysis (the way pdfAssassin
> is working).

That works as long as the OCR remains very accurate. What happened
before was that the deployment of OCR lead spammers to make their text
much less readable.


> As for your question about the place for image scanning, if your MTA
> has the resources to do so, why not?

Because it's better if it's combined with other information.