You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by James MacLean <ma...@ednet.ns.ca> on 2007/07/16 23:26:38 UTC

Re: pdf tools clarification? - PDFText

JT DeLys wrote, on 16/07/07 02:14 PM:
> Hi,
>
> Could someone perhaps succinctly summarize the various & sundry 
> anti-pdf-image-spam tools that are currently in play?
>
>   PDFText
>    -- works in 3.2, not 3.1
>
This one is my fault :(. PDFText _does_ work in 3.1 and that is where we 
are getting the most use from it. PDFtext2 is for 3.2.

It's goal was/is to get the text from PDF's and do your SPAM matching on 
them. With PDFText, you have to request the match tests with the 
exectute command, i.e. :

body PDF_TO_TEXT eval:check_pdftext('stock','profit','Symbol::4')

That example gives the match "Symbol:" a value of 4 points.

With PDFText2, the found text is added (rendered) to the main tests that 
SpamAssassin does.

Both get the info that comes from running pdfinfo and pdftotext on the 
PDFs attached, which gives you access to information like "Title:".

PDFText2 can also use gocr to do OCR on any PDF images. I'm not sold on 
that as the first one I tested it on gave back :

SZSN St_nd_ To Proflt 1,4 mllll On In D_V_lopm_nt ProJ_otI

/Sh_ndon_ Zhouyu_n S__d _nd Nur__ry Co,, Ltd (SZsN7
fo,2g up ao,Bx (9_51 EST7

SZSN _nnouno_d lt_ _nt_rln_ lnto _n __r__m_nt ln _ r__l
__t_t_ d_v_lopm_nt th_t _t_nd_ to proflt th_ o an p_ny f1,4
mllllOn_ Thl_ oomp_ny l_ Ju_t ___rln_ up_ Aot f__t _nd __t
on SZSN,

/Good luck with that :).

Hope that helps,
JES

Re: pdf tools clarification? - PDFText

Posted by James MacLean <ma...@ednet.ns.ca>.
JT DeLys wrote, on 16/07/07 07:02 PM:
>
> Seems to me that, assuming I can get the prereqs for FuzzOCR+pdf built 
> correctly (working), that FuzzyOCR /for/ OCR plus PDFText2 for text 
> might be a solid solution ...
>
Wish I had your confidence :). PDFText2 is still too younge to know if 
it holds up under a good load, especially since it calls external 
programs to do its work ;).

All the best,
JES

Re: pdf tools clarification? - PDFText

Posted by JT DeLys <jt...@gmail.com>.
> When PDFText2 is loaded, it's rendered text will be tested for the word
> stock just like everything else that SpamAssassin offers for your tests to
> match against. You might consider it to be the more SpamAssassin natural way
> of matching against PDF text :).
>

Clear. Thanks.


> Well, I am going to say similar, yet different :). PDFText2 currently does
> an OCR of the images and adds them to the rendered text. The OCRed text may
> not be very accurate and will not match that well.
>
> FuzzyOCR, if I understand what I have seen so far and the author will be
> much better then I to respond, takes the OCR rendered from any one of the
> available OCR engines and uses String::Approx (and maybe other tools) to
> match against a word list you supply specifically for fuzzyOCR. Much better
> chance of getting a hit on images.
>

Seems to me that, assuming I can get the prereqs for FuzzOCR+pdf built
correctly (working), that FuzzyOCR /for/ OCR plus PDFText2 for text might be
a solid solution ...

-- 
Thanks,

    JTDeLys

Re: pdf tools clarification? - PDFText

Posted by James MacLean <ma...@ednet.ns.ca>.
JT DeLys wrote, on 16/07/07 06:36 PM:
> Hi,
>
>     With PDFText2, the found text is added (rendered) to the main
>     tests that SpamAssassin does.
>
>
> Do you mean to those tests defined in 80_additional.cf? or others?
It means any test you do on the body of e-mail will test against this. 
for example, in your local.cf you might have :

body STOCK_TEST /stock/i
describe STOCK_TEST Found the word stock
score STOCK_TEST 4.5

When PDFText2 is loaded, it's rendered text will be tested for the word 
stock just like everything else that SpamAssassin offers for your tests 
to match against. You might consider it to be the more SpamAssassin 
natural way of matching against PDF text :).
>
>     PDFText2 can also use gocr to do OCR on any PDF images. I'm not
>     sold on that as the first one I tested it on gave back :
>
>
> Is that different capability/functionality than FuzzyOCR is undertaking?
>
>
Well, I am going to say similar, yet different :). PDFText2 currently 
does an OCR of the images and adds them to the rendered text. The OCRed 
text may not be very accurate and will not match that well.

FuzzyOCR, if I understand what I have seen so far and the author will be 
much better then I to respond, takes the OCR rendered from any one of 
the available OCR engines and uses String::Approx (and maybe other 
tools) to match against a word list you supply specifically for 
fuzzyOCR. Much better chance of getting a hit on images.
>
> -- 
> Thanks,
>     JTDeLys 
Quite welcome,
JES

Re: pdf tools clarification? - PDFText

Posted by JT DeLys <jt...@gmail.com>.
Hi,

With PDFText2, the found text is added (rendered) to the main tests that
> SpamAssassin does.
>

Do you mean to those tests defined in 80_additional.cf? or others?

PDFText2 can also use gocr to do OCR on any PDF images. I'm not sold on that
> as the first one I tested it on gave back :
>

Is that different capability/functionality than FuzzyOCR is undertaking?

Hope that helps,
> JES
>


-- 
Thanks,

    JTDeLys