You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by arni <ma...@arni.name> on 2007/06/28 03:14:02 UTC

pdf spam solution idea

Hi,

its come up several times now that people ask for a way to directly 
detect pdf spam by the pdf content and not only through headers or other 
means (hashes, bayes).
I've found a solution that should be pretty easy to realise in a 
Fuzzy-OCR like plugin. Here is what it should do:

Use xpdf (http://www.foolabs.com/xpdf/download.html) to read the pdf 
document
export the images to ppm files using `pdfimages`
export the text parts to a simple text using `pdftotext`

This plugin should run as one of the first to make the raw text read 
available (for example by attaching it as an extra mime part or somehow 
internally) as well as make the images available to FuzzyOCR or similar 
by the same means as above.

Unfortunately i wont be able to write such a plugin myself, it should be 
rather easy to do but i cant start to learn pearl just for this ;-)

Maybe i gave some hints ...

arni

Re: pdf spam solution idea

Posted by Dallas Engelken <da...@uribl.com>.
arni wrote:
> Hi,
>
> its come up several times now that people ask for a way to directly 
> detect pdf spam by the pdf content and not only through headers or 
> other means (hashes, bayes).
> I've found a solution that should be pretty easy to realise in a 
> Fuzzy-OCR like plugin. Here is what it should do:
>
> Use xpdf (http://www.foolabs.com/xpdf/download.html) to read the pdf 
> document
> export the images to ppm files using `pdfimages`
> export the text parts to a simple text using `pdftotext`
>
> This plugin should run as one of the first to make the raw text read 
> available (for example by attaching it as an extra mime part or 
> somehow internally) as well as make the images available to FuzzyOCR 
> or similar by the same means as above.
>
> Unfortunately i wont be able to write such a plugin myself, it should 
> be rather easy to do but i cant start to learn pearl just for this ;-)

I already have... I'll be releasing the info soon.

-- 
Dallas Engelken
dallase@uribl.com
http://uribl.com