You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by James MacLean <ma...@ednet.ns.ca> on 2007/07/16 23:26:38 UTC
Re: pdf tools clarification? - PDFText
JT DeLys wrote, on 16/07/07 02:14 PM:
> Hi,
>
> Could someone perhaps succinctly summarize the various & sundry
> anti-pdf-image-spam tools that are currently in play?
>
> PDFText
> -- works in 3.2, not 3.1
>
This one is my fault :(. PDFText _does_ work in 3.1 and that is where we
are getting the most use from it. PDFtext2 is for 3.2.
It's goal was/is to get the text from PDF's and do your SPAM matching on
them. With PDFText, you have to request the match tests with the
exectute command, i.e. :
body PDF_TO_TEXT eval:check_pdftext('stock','profit','Symbol::4')
That example gives the match "Symbol:" a value of 4 points.
With PDFText2, the found text is added (rendered) to the main tests that
SpamAssassin does.
Both get the info that comes from running pdfinfo and pdftotext on the
PDFs attached, which gives you access to information like "Title:".
PDFText2 can also use gocr to do OCR on any PDF images. I'm not sold on
that as the first one I tested it on gave back :
SZSN St_nd_ To Proflt 1,4 mllll On In D_V_lopm_nt ProJ_otI
/Sh_ndon_ Zhouyu_n S__d _nd Nur__ry Co,, Ltd (SZsN7
fo,2g up ao,Bx (9_51 EST7
SZSN _nnouno_d lt_ _nt_rln_ lnto _n __r__m_nt ln _ r__l
__t_t_ d_v_lopm_nt th_t _t_nd_ to proflt th_ o an p_ny f1,4
mllllOn_ Thl_ oomp_ny l_ Ju_t ___rln_ up_ Aot f__t _nd __t
on SZSN,
/Good luck with that :).
Hope that helps,
JES
Re: pdf tools clarification? - PDFText
Posted by James MacLean <ma...@ednet.ns.ca>.
JT DeLys wrote, on 16/07/07 07:02 PM:
>
> Seems to me that, assuming I can get the prereqs for FuzzOCR+pdf built
> correctly (working), that FuzzyOCR /for/ OCR plus PDFText2 for text
> might be a solid solution ...
>
Wish I had your confidence :). PDFText2 is still too younge to know if
it holds up under a good load, especially since it calls external
programs to do its work ;).
All the best,
JES
Re: pdf tools clarification? - PDFText
Posted by JT DeLys <jt...@gmail.com>.
> When PDFText2 is loaded, it's rendered text will be tested for the word
> stock just like everything else that SpamAssassin offers for your tests to
> match against. You might consider it to be the more SpamAssassin natural way
> of matching against PDF text :).
>
Clear. Thanks.
> Well, I am going to say similar, yet different :). PDFText2 currently does
> an OCR of the images and adds them to the rendered text. The OCRed text may
> not be very accurate and will not match that well.
>
> FuzzyOCR, if I understand what I have seen so far and the author will be
> much better then I to respond, takes the OCR rendered from any one of the
> available OCR engines and uses String::Approx (and maybe other tools) to
> match against a word list you supply specifically for fuzzyOCR. Much better
> chance of getting a hit on images.
>
Seems to me that, assuming I can get the prereqs for FuzzOCR+pdf built
correctly (working), that FuzzyOCR /for/ OCR plus PDFText2 for text might be
a solid solution ...
--
Thanks,
JTDeLys
Re: pdf tools clarification? - PDFText
Posted by James MacLean <ma...@ednet.ns.ca>.
JT DeLys wrote, on 16/07/07 06:36 PM:
> Hi,
>
> With PDFText2, the found text is added (rendered) to the main
> tests that SpamAssassin does.
>
>
> Do you mean to those tests defined in 80_additional.cf? or others?
It means any test you do on the body of e-mail will test against this.
for example, in your local.cf you might have :
body STOCK_TEST /stock/i
describe STOCK_TEST Found the word stock
score STOCK_TEST 4.5
When PDFText2 is loaded, it's rendered text will be tested for the word
stock just like everything else that SpamAssassin offers for your tests
to match against. You might consider it to be the more SpamAssassin
natural way of matching against PDF text :).
>
> PDFText2 can also use gocr to do OCR on any PDF images. I'm not
> sold on that as the first one I tested it on gave back :
>
>
> Is that different capability/functionality than FuzzyOCR is undertaking?
>
>
Well, I am going to say similar, yet different :). PDFText2 currently
does an OCR of the images and adds them to the rendered text. The OCRed
text may not be very accurate and will not match that well.
FuzzyOCR, if I understand what I have seen so far and the author will be
much better then I to respond, takes the OCR rendered from any one of
the available OCR engines and uses String::Approx (and maybe other
tools) to match against a word list you supply specifically for
fuzzyOCR. Much better chance of getting a hit on images.
>
> --
> Thanks,
> JTDeLys
Quite welcome,
JES
Re: pdf tools clarification? - PDFText
Posted by JT DeLys <jt...@gmail.com>.
Hi,
With PDFText2, the found text is added (rendered) to the main tests that
> SpamAssassin does.
>
Do you mean to those tests defined in 80_additional.cf? or others?
PDFText2 can also use gocr to do OCR on any PDF images. I'm not sold on that
> as the first one I tested it on gave back :
>
Is that different capability/functionality than FuzzyOCR is undertaking?
Hope that helps,
> JES
>
--
Thanks,
JTDeLys