You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/11/16 11:30:06 UTC

Re: FuzzyOcr: Pushing OCR'ed text back to SA

Olivier Nicole writes:
> Hi,
> 
> This ticket in FuzzyOcr http://fuzzyocr.own-hero.net/ticket/15 is
> proposing to send the text resulting from from the OCR process back to
> SA for analysis.
> 
> I fully second that idea but I am wondering *what* text to push back:
> depending on teh scanset being used the same image will decode as:
> 
> [20834] dbg: FuzzyOcr: ocrdata=>>. U�agra tl.7g
> [20834] dbg: FuzzyOcr: . C�al�s t2.6g
> [20834] dbg: FuzzyOcr: 
> [20834] dbg: FuzzyOcr: <<=end
> 
> [20834] dbg: FuzzyOcr: ocrdata=>><<=end
> 
> [20834] dbg: FuzzyOcr: ocrdata=>>. U�agra tl.7g
> [20834] dbg: FuzzyOcr: . C�al�s t2.6g
> [20834] dbg: FuzzyOcr: 
> [20834] dbg: FuzzyOcr: <<=end
> 
> [20834] dbg: FuzzyOcr: ocrdata=>>' Viagra tl.79
> [20834] dbg: FuzzyOcr: ' CiaIis t2.69
> [20834] dbg: FuzzyOcr: <<=end
> 
> The last scanset is the one prefered by FuzzyOcr when we let it do the
> word analysis, but the first may even be enough for SA.
> 
> So the question really is: when can we say that the OCR is giving
> clean enough results that could be used by SA? We should not give SA
> the result of all scansets, else that would artificially raise the
> spam score.

actually, that's what I'd recommend. SpamAssassin's ruleset counts
a single occurrence of a body rule pattern as equal to multiple
occurrences, so this is harmless in SpamAssassin.

We already do this -- when we decode multipart/alternative MIME messages
containing both a text/plain and text/html part, we decode *both* parts
and concatenate them in the body rendering.

--j.

Re: FuzzyOcr: Pushing OCR'ed text back to SA

Posted by Olivier Nicole <on...@cs.ait.ac.th>.
Dear Matus,

> > Olivier Nicole writes:
> > > This ticket in FuzzyOcr http://fuzzyocr.own-hero.net/ticket/15 is
> > > proposing to send the text resulting from from the OCR process back to
> > > SA for analysis.
> > > 
> > > I fully second that idea but I am wondering *what* text to push back:
> > > depending on teh scanset being used the same image will decode as:
> [...]
> > > The last scanset is the one prefered by FuzzyOcr when we let it do the
> > > word analysis, but the first may even be enough for SA.
> > > 
> > > So the question really is: when can we say that the OCR is giving
> > > clean enough results that could be used by SA? We should not give SA
> > > the result of all scansets, else that would artificially raise the
> > > spam score.
> 
> On 16.11.07 10:30, Justin Mason wrote:
> > actually, that's what I'd recommend. SpamAssassin's ruleset counts
> > a single occurrence of a body rule pattern as equal to multiple
> > occurrences, so this is harmless in SpamAssassin.
> > 
> > We already do this -- when we decode multipart/alternative MIME messages
> > containing both a text/plain and text/html part, we decode *both* parts
> > and concatenate them in the body rendering.
> 
> Are there any plans to do that with word/ppt/pdf/image decoders?
> catching spam from images embedded in .pdf or .doc files seems very nice :)
> not the simple way fuzzyocr does (did?) that, but using BAYES and other
> rules (haha, I can imagine next will be typos in images)

On another leg, I reworked a PDF pluging (PDFassassin) to do that: the
text part of the PDF document is pushed back to SA for SA to analyze
it; the images are pushed back to SA as images for image plugin
(FuzzyOCR) to analyze them.

While I don't think there is a framework in SA for that kind of
cascading the plugin, I beleive that is the right path:

- the file format plugin (PDF, word, ...) cares for the format of the
  file, extract text as text for SA, extracts images as images for
  some OCR plugin;

- the OCR plugin takes care of analyzing the images.

So the file decoding, taking care of various changes in the file
format is pretty much disconnected from the image analysis task.

Best regards,

Olivier

Re: FuzzyOcr: Pushing OCR'ed text back to SA

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> Olivier Nicole writes:
> > This ticket in FuzzyOcr http://fuzzyocr.own-hero.net/ticket/15 is
> > proposing to send the text resulting from from the OCR process back to
> > SA for analysis.
> > 
> > I fully second that idea but I am wondering *what* text to push back:
> > depending on teh scanset being used the same image will decode as:
[...]
> > The last scanset is the one prefered by FuzzyOcr when we let it do the
> > word analysis, but the first may even be enough for SA.
> > 
> > So the question really is: when can we say that the OCR is giving
> > clean enough results that could be used by SA? We should not give SA
> > the result of all scansets, else that would artificially raise the
> > spam score.

On 16.11.07 10:30, Justin Mason wrote:
> actually, that's what I'd recommend. SpamAssassin's ruleset counts
> a single occurrence of a body rule pattern as equal to multiple
> occurrences, so this is harmless in SpamAssassin.
> 
> We already do this -- when we decode multipart/alternative MIME messages
> containing both a text/plain and text/html part, we decode *both* parts
> and concatenate them in the body rendering.

Are there any plans to do that with word/ppt/pdf/image decoders?
catching spam from images embedded in .pdf or .doc files seems very nice :)
not the simple way fuzzyocr does (did?) that, but using BAYES and other
rules (haha, I can imagine next will be typos in images)
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
2B|!2B, that's a question!