You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by James MacLean <ma...@ednet.ns.ca> on 2007/07/14 00:09:06 UTC

PDFText Plugin for PDF file scoring - not for PDF images

Hi folks,

Regrets if this is the wrong list.

Wanted to be able to score on text found in PDF files. Did not see any 
obvious route, so made a plugin that calls XPDF's pdfinfo and pdftotext 
to get the text that is then scored.

Sample local.cf could be :

pdftotext_cmd /usr/local/bin/pdftotext
pdfinfo_cmd /usr/local/bin/pdfinfo
body PDF_TO_TEXT 
eval:check_pdftext("^Error","sex","drugs",'Title:\s+stock_tmp.pdf:4','Creator:\s+OpenOffice.org 
1.1.4:4')

Notice that a :4 gives a find of that regex 4 points.

Really don't know if this was the right road to follow, as I copied the 
AntiVirus.pm and came up with this:

http://support.ednet.ns.ca/SpamAssassin/PDFText.pm

So far... it appears to work as expected and didn't take down a pretty 
busy server ;).

Enjoy hearing any positive criticisms :).

JES

Re: PDFText2 Plugin for PDF file scoring

Posted by James MacLean <ma...@ednet.ns.ca>.
James MacLean wrote, on 15/07/07 05:05 PM:
> Subject:
> Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2
> From:
> James MacLean <ma...@ednet.ns.ca>
> Date:
> Sun, 15 Jul 2007 17:05:38 -0300
> To:
> users@spamassassin.apache.org
>
> To:
> users@spamassassin.apache.org
>
>
> Theo Van Dinter wrote, on 14/07/07 02:13 PM:
>> On Sat, Jul 14, 2007 at 09:54:36AM -0300, James MacLean wrote:
>>   
>>> Where do I find information on hooking into post_message_parse()? Tried 
>>> greping in the module area with no luck :(. Certainly agree it would be 
>>> better to get the text out and let everyone at it :).
>>>     
>>
>> You can ask. :)  But yes, I didn't do a good job of fully documenting how
>> this is supposed to work -- you have to know about the plugin call, then
>> hunt around Message and Message::Node, etc.  Sorry.  Here's the basics:
>>
>> First, create a plugin with the post_message_parse method.  Then in
>> there, use $msg->find_parts() to find the parts that you're looking
>> for (find_parts() is pretty well documented).  Then, you simply take
>> the data from $part->decode() and do something to convert it to text.
>> Then you take that text and call $part->set_rendered($text).
>>
>> Later on, when SA looks for the text to use for body rules, uri parsing,
>> etc, it takes anything that has rendered text.
>>
>>   
> Thanks Theo. From this I now have:
>
> http://support.ednet.ns.ca/SpamAssassin/PDFText2.pm
>
> Sorry that I was not aware that I had not been developing for a 
> current version :(. Explains why I could not find the pieces that I 
> was told about ;).
>
> Sample setup in local.cf :
>
> pdftext_pdfinfo_cmd /usr/bin/pdfinfo
> pdftext_pdftotext_cmd /usr/bin/pdftotext
> pdftext_pdfimages_cmd /usr/bin/pdfimages
> pdftext_gocr_cmd /usr/bin/gocr
>
> body PLUGIN_PDFTEXT_TEST /Stock/i
> describe PLUGIN_PDFTEXT_TEST Found word Stock
> score PLUGIN_PDFTEXT_TEST 2.5
>
> body PLUGIN_PDFTEXT2 /PDFText2-Title: stock_tmp.pdf/i
> describe PLUGIN_PDFTEXT2 Found the Title stick_tmp.pdf
> score PLUGIN_PDFTEXT2 4.5
>
> Current comments :
> . now it will prepend PDFText2- to the pdfinfo pushed to render so 
> that accurate PDFinfo matching can be done. Or was that the wrong 
> thing to do?
> . added gocr of the images, but I see FuzzyOCR does fuzzy matching 
> which this doesn't so as long as you don't set pdftext_gocr_cmd, it 
> won't do that part. Maybe there is a way this one can call that one?
> . not comfortable with how I create temporary dirs for pdfimages, so 
> that might make trouble for folks.
> . I can not test it in our production environment as that is still 3.1 
> and I don't want to try the SVN FuzzyOCR just yet :). So that means I 
> am only lightly testing in a development environment.
>
> Is there any similar function to post_message_parse in the 3.1 series?
>
> Thanks again everyone,
> JES
Made a quick update to now include .fdf files in application/octet-streams.

JES

Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

Posted by James MacLean <ma...@ednet.ns.ca>.
Michael Parker wrote, on 16/07/07 01:58 PM:
> Theo Van Dinter wrote:
>   
>> IMO, if people find this a useful enough feature of 3.2, it's a relatively
>> trivial change in the code as I recall, so a bugzilla request to backport
>> may get somewhere for a future 3.1 release.
>>
>>     
>
> I would +1 a backport.
>
> Michael
>   
If it were added, I would likely code PDFText for it, but I believe in 
moving ahead, so if I get some time, I will try to migrate to 3.2 :).

JES

Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

Posted by Michael Parker <pa...@pobox.com>.
Theo Van Dinter wrote:
> 
> IMO, if people find this a useful enough feature of 3.2, it's a relatively
> trivial change in the code as I recall, so a bugzilla request to backport
> may get somewhere for a future 3.1 release.
> 

I would +1 a backport.

Michael

Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

Posted by Theo Van Dinter <fe...@apache.org>.
On Sun, Jul 15, 2007 at 05:05:38PM -0300, James MacLean wrote:
> . I can not test it in our production environment as that is still 3.1 
> and I don't want to try the SVN FuzzyOCR just yet :). So that means I am 
> only lightly testing in a development environment.
> 
> Is there any similar function to post_message_parse in the 3.1 series?

Yes and no.  You could probably fake it the rendering part via check_start
or something, but 3.1 will only look at text/ and message/ parts, so
even if you did render something it wouldn't get used unless you (imo)
completely mangle the internal data structure by change the content-type
of the part.

IMO, if people find this a useful enough feature of 3.2, it's a relatively
trivial change in the code as I recall, so a bugzilla request to backport
may get somewhere for a future 3.1 release.

-- 
Randomly Selected Tagline:
"Of course, the more I learn, the more I realize I don't know.  At some
 point, I hope to learn enough to realize that I know nothing at all.
 Then maybe I'll be able to snatch a pebble from Julia Child's hand."
         - Alton Brown, "I'm Just Here For The Food"

Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

Posted by James MacLean <ma...@ednet.ns.ca>.
Theo Van Dinter wrote, on 14/07/07 02:13 PM:
> On Sat, Jul 14, 2007 at 09:54:36AM -0300, James MacLean wrote:
>   
>> Where do I find information on hooking into post_message_parse()? Tried 
>> greping in the module area with no luck :(. Certainly agree it would be 
>> better to get the text out and let everyone at it :).
>>     
>
> You can ask. :)  But yes, I didn't do a good job of fully documenting how
> this is supposed to work -- you have to know about the plugin call, then
> hunt around Message and Message::Node, etc.  Sorry.  Here's the basics:
>
> First, create a plugin with the post_message_parse method.  Then in
> there, use $msg->find_parts() to find the parts that you're looking
> for (find_parts() is pretty well documented).  Then, you simply take
> the data from $part->decode() and do something to convert it to text.
> Then you take that text and call $part->set_rendered($text).
>
> Later on, when SA looks for the text to use for body rules, uri parsing,
> etc, it takes anything that has rendered text.
>
>   
Thanks Theo. From this I now have:

http://support.ednet.ns.ca/SpamAssassin/PDFText2.pm

Sorry that I was not aware that I had not been developing for a current 
version :(. Explains why I could not find the pieces that I was told 
about ;).

Sample setup in local.cf :

pdftext_pdfinfo_cmd /usr/bin/pdfinfo
pdftext_pdftotext_cmd /usr/bin/pdftotext
pdftext_pdfimages_cmd /usr/bin/pdfimages
pdftext_gocr_cmd /usr/bin/gocr

body PLUGIN_PDFTEXT_TEST /Stock/i
describe PLUGIN_PDFTEXT_TEST Found word Stock
score PLUGIN_PDFTEXT_TEST 2.5

body PLUGIN_PDFTEXT2 /PDFText2-Title: stock_tmp.pdf/i
describe PLUGIN_PDFTEXT2 Found the Title stick_tmp.pdf
score PLUGIN_PDFTEXT2 4.5

Current comments :
. now it will prepend PDFText2- to the pdfinfo pushed to render so that 
accurate PDFinfo matching can be done. Or was that the wrong thing to do?
. added gocr of the images, but I see FuzzyOCR does fuzzy matching which 
this doesn't so as long as you don't set pdftext_gocr_cmd, it won't do 
that part. Maybe there is a way this one can call that one?
. not comfortable with how I create temporary dirs for pdfimages, so 
that might make trouble for folks.
. I can not test it in our production environment as that is still 3.1 
and I don't want to try the SVN FuzzyOCR just yet :). So that means I am 
only lightly testing in a development environment.

Is there any similar function to post_message_parse in the 3.1 series?

Thanks again everyone,
JES

Re: PDFText Plugin for PDF file scoring - not for PDF images

Posted by Theo Van Dinter <fe...@apache.org>.
On Sat, Jul 14, 2007 at 09:54:36AM -0300, James MacLean wrote:
> Where do I find information on hooking into post_message_parse()? Tried 
> greping in the module area with no luck :(. Certainly agree it would be 
> better to get the text out and let everyone at it :).

You can ask. :)  But yes, I didn't do a good job of fully documenting how
this is supposed to work -- you have to know about the plugin call, then
hunt around Message and Message::Node, etc.  Sorry.  Here's the basics:

First, create a plugin with the post_message_parse method.  Then in
there, use $msg->find_parts() to find the parts that you're looking
for (find_parts() is pretty well documented).  Then, you simply take
the data from $part->decode() and do something to convert it to text.
Then you take that text and call $part->set_rendered($text).

Later on, when SA looks for the text to use for body rules, uri parsing,
etc, it takes anything that has rendered text.

So here's a quick n' dirty sample that takes parts of "image/theo" and
"renders" them into "The plugin works!\n":

------------
package Mail::SpamAssassin::Plugin::RenderExample;

use Mail::SpamAssassin::Plugin;
use strict;
use warnings;

use vars qw(@ISA);
@ISA = qw(Mail::SpamAssassin::Plugin);

sub new {
  my $class = shift; 
  my $mailsaobject = shift;
  $class = ref($class) || $class;
  my $self = $class->SUPER::new($mailsaobject);
  bless ($self, $class);
  return $self;
}

sub post_message_parse {
  my ($self, $opts) = @_;
  my $msg = $opts->{'message'};
  foreach my $p ( $msg->find_parts(qr!^image/theo$!, 1) ) {
    $p->set_rendered("The plugin works!\n");
  }
}

1;
------------

-- 
Randomly Selected Tagline:
"I'm a programmer: I don't buy software, I write it." - Tom Christiansen

Re: PDFText Plugin for PDF file scoring - not for PDF images

Posted by James MacLean <ma...@ednet.ns.ca>.
Dallas Engelken wrote, on 14/07/07 12:17 AM:
> James MacLean wrote:
>> Hi folks,
>>
>> Regrets if this is the wrong list.
>>
>> Wanted to be able to score on text found in PDF files. Did not see 
>> any obvious route, so made a plugin that calls XPDF's pdfinfo and 
>> pdftotext to get the text that is then scored.
>>
>> Sample local.cf could be :
>>
>> pdftotext_cmd /usr/local/bin/pdftotext
>> pdfinfo_cmd /usr/local/bin/pdfinfo
>> body PDF_TO_TEXT 
>> eval:check_pdftext("^Error","sex","drugs",'Title:\s+stock_tmp.pdf:4','Creator:\s+OpenOffice.org 
>> 1.1.4:4')
>>
>> Notice that a :4 gives a find of that regex 4 points.
>>
>> Really don't know if this was the right road to follow, as I copied 
>> the AntiVirus.pm and came up with this:
>> http://support.ednet.ns.ca/SpamAssassin/PDFText.pm
>>
>> So far... it appears to work as expected and didn't take down a 
>> pretty busy server ;).
>>
>> Enjoy hearing any positive criticisms :).
>
> I did this the other day with CAM::PDF, but Theo recommended this work 
> should be done in the post_message_parse() plugin call.   Then you 
> could just write body rules against the text, uris would get checked 
> by uribldns plugin, etc....
>
> -- 
> Dallas Engelken
> dallase@uribl.com
> http://uribl.com
>
I did start with keeping it all in Perl, but when I tested my first SPAM 
with the CAM::PDF utils, it resulted in just a bunch of space separated 
letters :(. Interested in getting something working, I switched to the 
XPDF utils. Maybe getpdftext.pl is not a good example of how the modules 
work?

Where do I find information on hooking into post_message_parse()? Tried 
greping in the module area with no luck :(. Certainly agree it would be 
better to get the text out and let everyone at it :). I couldn't see how 
to do that when I started down this road. I was even first trying to see 
if Exim would add another attachment to the e-mail which would be the 
output of pfdtotext, but again, wanted to get something running, so 
opted for what is there now :(.

Thanks,
JES

Re: PDFText Plugin for PDF file scoring - not for PDF images

Posted by Dallas Engelken <da...@uribl.com>.
James MacLean wrote:
> Hi folks,
>
> Regrets if this is the wrong list.
>
> Wanted to be able to score on text found in PDF files. Did not see any 
> obvious route, so made a plugin that calls XPDF's pdfinfo and 
> pdftotext to get the text that is then scored.
>
> Sample local.cf could be :
>
> pdftotext_cmd /usr/local/bin/pdftotext
> pdfinfo_cmd /usr/local/bin/pdfinfo
> body PDF_TO_TEXT 
> eval:check_pdftext("^Error","sex","drugs",'Title:\s+stock_tmp.pdf:4','Creator:\s+OpenOffice.org 
> 1.1.4:4')
>
> Notice that a :4 gives a find of that regex 4 points.
>
> Really don't know if this was the right road to follow, as I copied 
> the AntiVirus.pm and came up with this:
> http://support.ednet.ns.ca/SpamAssassin/PDFText.pm
>
> So far... it appears to work as expected and didn't take down a pretty 
> busy server ;).
>
> Enjoy hearing any positive criticisms :).

I did this the other day with CAM::PDF, but Theo recommended this work 
should be done in the post_message_parse() plugin call.   Then you could 
just write body rules against the text, uris would get checked by 
uribldns plugin, etc....

--
Dallas Engelken
dallase@uribl.com
http://uribl.com