You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by arni <ma...@arni.name> on 2007/07/12 04:00:33 UTC

PDF Decoder - Show of concept

Hi,

what i'm going to show you is purely show or prove of concept - there is 
no way you should use the code in a productional environment, because it 
most likely has exploitable bugs as well as inacuracies that will not be 
able to parse all mail properly.

I put this together within an around an hour to show how its possible to 
cope with pdf spam - the script compeltely decodes the pdf attachment 
into text and images and reattaches them. Like this the text is fully 
available to all means of sa processing, as well as the images to 
FuzzyOCR, if installed.
The code is php, because thats easiest for me to write.

It also has a nice side effect, that you are able to see the text from a 
pdf without having to open it ;-)

If someone could make a sa plugin that can do the same thing in a clean 
and safe manner, this would be great,
arni

Re: PDF Decoder - Show of concept

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jul 12, 2007 at 04:00:33AM +0200, arni wrote:
> I put this together within an around an hour to show how its possible to 
> cope with pdf spam - the script compeltely decodes the pdf attachment 
> into text and images and reattaches them. Like this the text is fully 
> available to all means of sa processing, as well as the images to 
> FuzzyOCR, if installed.

Please don't do that (adding in new message parts), btw.  There's a 3.2
plugin call (post_message_parse, per bug 5069) which was specifically
added such that plugins can manipulate messages after the initial parse
has completed.  This allows for things like OCR of images and PDF->text,
and the rendered text can go right in the message part, and then gets
included automatically by SA as body text and so is available for body
rules, uri parsing, etc.


-- 
Randomly Selected Tagline:
"Never go off on tangents, which are lines that intersect a curve at only
 one point and were discovered by Euclid, who live in the 6th century,
 which was an era dominated by the Goths, who lived in what we now know
 as Poland." - Unknown from Nov. 1998 issue of Infosystems Executive.