You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Olivier Nicole <on...@cs.ait.ac.th> on 2007/11/16 07:38:14 UTC

FuzzyOcr: Pushing OCR'ed text back to SA

Hi,

This ticket in FuzzyOcr http://fuzzyocr.own-hero.net/ticket/15 is
proposing to send the text resulting from from the OCR process back to
SA for analysis.

I fully second that idea but I am wondering *what* text to push back:
depending on teh scanset being used the same image will decode as:

[20834] dbg: FuzzyOcr: ocrdata=>>. Uíagra tl.7g
[20834] dbg: FuzzyOcr: . Cíalís t2.6g
[20834] dbg: FuzzyOcr: 
[20834] dbg: FuzzyOcr: <<=end

[20834] dbg: FuzzyOcr: ocrdata=>><<=end

[20834] dbg: FuzzyOcr: ocrdata=>>. Uíagra tl.7g
[20834] dbg: FuzzyOcr: . Cíalís t2.6g
[20834] dbg: FuzzyOcr: 
[20834] dbg: FuzzyOcr: <<=end

[20834] dbg: FuzzyOcr: ocrdata=>>' Viagra tl.79
[20834] dbg: FuzzyOcr: ' CiaIis t2.69
[20834] dbg: FuzzyOcr: <<=end

The last scanset is the one prefered by FuzzyOcr when we let it do the
word analysis, but the first may even be enough for SA.

So the question really is: when can we say that the OCR is giving
clean enough results that could be used by SA? We should not give SA
the result of all scansets, else that would artificially raise the
spam score.

On another hand, for a photgraphy, OCR text may look like the
following this, this should never be pushed to SA, so how to decide?

[19120] dbg: FuzzyOcr: ocrdata=>>. ._ .
[19120] dbg: FuzzyOcr: _\
[19120] dbg: FuzzyOcr: | _
[19120] dbg: FuzzyOcr: _ |
[19120] dbg: FuzzyOcr: 
[19120] dbg: FuzzyOcr: _? _4'|
[19120] dbg: FuzzyOcr: , _ ,. . .
[19120] dbg: FuzzyOcr: 
[19120] dbg: FuzzyOcr: __ - . . _
[19120] dbg: FuzzyOcr: _ . . .
[19120] dbg: FuzzyOcr: .._ _ .
[19120] dbg: FuzzyOcr: 
[19120] dbg: FuzzyOcr: <<=end

Best regards,

Olivier

Re: FuzzyOcr: Pushing OCR'ed text back to SA

Posted by Olivier Nicole <on...@cs.ait.ac.th>.

Dear Matus,

> > Olivier Nicole writes:
> > > This ticket in FuzzyOcr http://fuzzyocr.own-hero.net/ticket/15 is
> > > proposing to send the text resulting from from the OCR process back to
> > > SA for analysis.
> > > 
> > > I fully second that idea but I am wondering *what* text to push back:
> > > depending on teh scanset being used the same image will decode as:
> [...]
> > > The last scanset is the one prefered by FuzzyOcr when we let it do the
> > > word analysis, but the first may even be enough for SA.
> > > 
> > > So the question really is: when can we say that the OCR is giving
> > > clean enough results that could be used by SA? We should not give SA
> > > the result of all scansets, else that would artificially raise the
> > > spam score.
> 
> On 16.11.07 10:30, Justin Mason wrote:
> > actually, that's what I'd recommend. SpamAssassin's ruleset counts
> > a single occurrence of a body rule pattern as equal to multiple
> > occurrences, so this is harmless in SpamAssassin.
> > 
> > We already do this -- when we decode multipart/alternative MIME messages
> > containing both a text/plain and text/html part, we decode *both* parts
> > and concatenate them in the body rendering.
> 
> Are there any plans to do that with word/ppt/pdf/image decoders?
> catching spam from images embedded in .pdf or .doc files seems very nice :)
> not the simple way fuzzyocr does (did?) that, but using BAYES and other
> rules (haha, I can imagine next will be typos in images)

On another leg, I reworked a PDF pluging (PDFassassin) to do that: the
text part of the PDF document is pushed back to SA for SA to analyze
it; the images are pushed back to SA as images for image plugin
(FuzzyOCR) to analyze them.

While I don't think there is a framework in SA for that kind of
cascading the plugin, I beleive that is the right path:

- the file format plugin (PDF, word, ...) cares for the format of the
  file, extract text as text for SA, extracts images as images for
  some OCR plugin;

- the OCR plugin takes care of analyzing the images.

So the file decoding, taking care of various changes in the file
format is pretty much disconnected from the image analysis task.

Best regards,

Olivier

Re: FuzzyOcr: Pushing OCR'ed text back to SA

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

> Olivier Nicole writes:
> > This ticket in FuzzyOcr http://fuzzyocr.own-hero.net/ticket/15 is
> > proposing to send the text resulting from from the OCR process back to
> > SA for analysis.
> > 
> > I fully second that idea but I am wondering *what* text to push back:
> > depending on teh scanset being used the same image will decode as:
[...]
> > The last scanset is the one prefered by FuzzyOcr when we let it do the
> > word analysis, but the first may even be enough for SA.
> > 
> > So the question really is: when can we say that the OCR is giving
> > clean enough results that could be used by SA? We should not give SA
> > the result of all scansets, else that would artificially raise the
> > spam score.

On 16.11.07 10:30, Justin Mason wrote:
> actually, that's what I'd recommend. SpamAssassin's ruleset counts
> a single occurrence of a body rule pattern as equal to multiple
> occurrences, so this is harmless in SpamAssassin.
> 
> We already do this -- when we decode multipart/alternative MIME messages
> containing both a text/plain and text/html part, we decode *both* parts
> and concatenate them in the body rendering.

Are there any plans to do that with word/ppt/pdf/image decoders?
catching spam from images embedded in .pdf or .doc files seems very nice :)
not the simple way fuzzyocr does (did?) that, but using BAYES and other
rules (haha, I can imagine next will be typos in images)
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
2B|!2B, that's a question!