You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by decoder <de...@own-hero.net> on 2007/07/03 15:22:05 UTC

FuzzyOcr and PDF files

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello all,

because some people insisted on it, I added an experimental feature to
FuzzyOcr that allows you to scan PDFs as if they were images.

The feature was implemented in the latest SVN revision and is of
course disabled by default.

Personally, I would not use this feature because the risk of false
positives on important documents is really high, but if you really
want to test this, here are the steps to enable it:

1. Get dependencies:
 -A netpbm version that includes pstopnm
 -Poppler (http://poppler.freedesktop.org/) for the pdfinfo and
pdftops binaries

2. Add those binaries as helper apps in FuzzyOcr.cf (see the .cf file
included in SVN)
3. Enable PDF scanning with focr_scan_pdfs 1 in config.

Optionally, it is possible to skip PDFs which contain more than x
pages (focr_pdf_maxpages).

Currently, the parameters for pstopnm are hardcoded (-xsize=1000),
there might be better ways/values to translate PDFs into usable, but
not too big pnm files.

If you know better ways, tell me. Also I am missing some recent PDF
spam samples (which contain images), so if you could upload some
sample, that would also help.


Best regards,


Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGik19JQIKXnJyDxURAs04AKDFRAq4khA+iRouIbpVBZEsjxEJ6ACeLpBO
F4GSUMSqpHubHr9bZkSLS+w=
=Nu8d
-----END PGP SIGNATURE-----