You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by James MacLean <ma...@ednet.ns.ca> on 2007/07/15 22:05:38 UTC

Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

Theo Van Dinter wrote, on 14/07/07 02:13 PM:
> On Sat, Jul 14, 2007 at 09:54:36AM -0300, James MacLean wrote:
>   
>> Where do I find information on hooking into post_message_parse()? Tried 
>> greping in the module area with no luck :(. Certainly agree it would be 
>> better to get the text out and let everyone at it :).
>>     
>
> You can ask. :)  But yes, I didn't do a good job of fully documenting how
> this is supposed to work -- you have to know about the plugin call, then
> hunt around Message and Message::Node, etc.  Sorry.  Here's the basics:
>
> First, create a plugin with the post_message_parse method.  Then in
> there, use $msg->find_parts() to find the parts that you're looking
> for (find_parts() is pretty well documented).  Then, you simply take
> the data from $part->decode() and do something to convert it to text.
> Then you take that text and call $part->set_rendered($text).
>
> Later on, when SA looks for the text to use for body rules, uri parsing,
> etc, it takes anything that has rendered text.
>
>   
Thanks Theo. From this I now have:

http://support.ednet.ns.ca/SpamAssassin/PDFText2.pm

Sorry that I was not aware that I had not been developing for a current 
version :(. Explains why I could not find the pieces that I was told 
about ;).

Sample setup in local.cf :

pdftext_pdfinfo_cmd /usr/bin/pdfinfo
pdftext_pdftotext_cmd /usr/bin/pdftotext
pdftext_pdfimages_cmd /usr/bin/pdfimages
pdftext_gocr_cmd /usr/bin/gocr

body PLUGIN_PDFTEXT_TEST /Stock/i
describe PLUGIN_PDFTEXT_TEST Found word Stock
score PLUGIN_PDFTEXT_TEST 2.5

body PLUGIN_PDFTEXT2 /PDFText2-Title: stock_tmp.pdf/i
describe PLUGIN_PDFTEXT2 Found the Title stick_tmp.pdf
score PLUGIN_PDFTEXT2 4.5

Current comments :
. now it will prepend PDFText2- to the pdfinfo pushed to render so that 
accurate PDFinfo matching can be done. Or was that the wrong thing to do?
. added gocr of the images, but I see FuzzyOCR does fuzzy matching which 
this doesn't so as long as you don't set pdftext_gocr_cmd, it won't do 
that part. Maybe there is a way this one can call that one?
. not comfortable with how I create temporary dirs for pdfimages, so 
that might make trouble for folks.
. I can not test it in our production environment as that is still 3.1 
and I don't want to try the SVN FuzzyOCR just yet :). So that means I am 
only lightly testing in a development environment.

Is there any similar function to post_message_parse in the 3.1 series?

Thanks again everyone,
JES

Re: PDFText2 Plugin for PDF file scoring

Posted by James MacLean <ma...@ednet.ns.ca>.
James MacLean wrote, on 15/07/07 05:05 PM:
> Subject:
> Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2
> From:
> James MacLean <ma...@ednet.ns.ca>
> Date:
> Sun, 15 Jul 2007 17:05:38 -0300
> To:
> users@spamassassin.apache.org
>
> To:
> users@spamassassin.apache.org
>
>
> Theo Van Dinter wrote, on 14/07/07 02:13 PM:
>> On Sat, Jul 14, 2007 at 09:54:36AM -0300, James MacLean wrote:
>>   
>>> Where do I find information on hooking into post_message_parse()? Tried 
>>> greping in the module area with no luck :(. Certainly agree it would be 
>>> better to get the text out and let everyone at it :).
>>>     
>>
>> You can ask. :)  But yes, I didn't do a good job of fully documenting how
>> this is supposed to work -- you have to know about the plugin call, then
>> hunt around Message and Message::Node, etc.  Sorry.  Here's the basics:
>>
>> First, create a plugin with the post_message_parse method.  Then in
>> there, use $msg->find_parts() to find the parts that you're looking
>> for (find_parts() is pretty well documented).  Then, you simply take
>> the data from $part->decode() and do something to convert it to text.
>> Then you take that text and call $part->set_rendered($text).
>>
>> Later on, when SA looks for the text to use for body rules, uri parsing,
>> etc, it takes anything that has rendered text.
>>
>>   
> Thanks Theo. From this I now have:
>
> http://support.ednet.ns.ca/SpamAssassin/PDFText2.pm
>
> Sorry that I was not aware that I had not been developing for a 
> current version :(. Explains why I could not find the pieces that I 
> was told about ;).
>
> Sample setup in local.cf :
>
> pdftext_pdfinfo_cmd /usr/bin/pdfinfo
> pdftext_pdftotext_cmd /usr/bin/pdftotext
> pdftext_pdfimages_cmd /usr/bin/pdfimages
> pdftext_gocr_cmd /usr/bin/gocr
>
> body PLUGIN_PDFTEXT_TEST /Stock/i
> describe PLUGIN_PDFTEXT_TEST Found word Stock
> score PLUGIN_PDFTEXT_TEST 2.5
>
> body PLUGIN_PDFTEXT2 /PDFText2-Title: stock_tmp.pdf/i
> describe PLUGIN_PDFTEXT2 Found the Title stick_tmp.pdf
> score PLUGIN_PDFTEXT2 4.5
>
> Current comments :
> . now it will prepend PDFText2- to the pdfinfo pushed to render so 
> that accurate PDFinfo matching can be done. Or was that the wrong 
> thing to do?
> . added gocr of the images, but I see FuzzyOCR does fuzzy matching 
> which this doesn't so as long as you don't set pdftext_gocr_cmd, it 
> won't do that part. Maybe there is a way this one can call that one?
> . not comfortable with how I create temporary dirs for pdfimages, so 
> that might make trouble for folks.
> . I can not test it in our production environment as that is still 3.1 
> and I don't want to try the SVN FuzzyOCR just yet :). So that means I 
> am only lightly testing in a development environment.
>
> Is there any similar function to post_message_parse in the 3.1 series?
>
> Thanks again everyone,
> JES
Made a quick update to now include .fdf files in application/octet-streams.

JES

Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

Posted by James MacLean <ma...@ednet.ns.ca>.
Michael Parker wrote, on 16/07/07 01:58 PM:
> Theo Van Dinter wrote:
>   
>> IMO, if people find this a useful enough feature of 3.2, it's a relatively
>> trivial change in the code as I recall, so a bugzilla request to backport
>> may get somewhere for a future 3.1 release.
>>
>>     
>
> I would +1 a backport.
>
> Michael
>   
If it were added, I would likely code PDFText for it, but I believe in 
moving ahead, so if I get some time, I will try to migrate to 3.2 :).

JES

Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

Posted by Michael Parker <pa...@pobox.com>.
Theo Van Dinter wrote:
> 
> IMO, if people find this a useful enough feature of 3.2, it's a relatively
> trivial change in the code as I recall, so a bugzilla request to backport
> may get somewhere for a future 3.1 release.
> 

I would +1 a backport.

Michael

Re: PDFText Plugin for PDF file scoring - PDFText2.pm for ver 3.2

Posted by Theo Van Dinter <fe...@apache.org>.
On Sun, Jul 15, 2007 at 05:05:38PM -0300, James MacLean wrote:
> . I can not test it in our production environment as that is still 3.1 
> and I don't want to try the SVN FuzzyOCR just yet :). So that means I am 
> only lightly testing in a development environment.
> 
> Is there any similar function to post_message_parse in the 3.1 series?

Yes and no.  You could probably fake it the rendering part via check_start
or something, but 3.1 will only look at text/ and message/ parts, so
even if you did render something it wouldn't get used unless you (imo)
completely mangle the internal data structure by change the content-type
of the part.

IMO, if people find this a useful enough feature of 3.2, it's a relatively
trivial change in the code as I recall, so a bugzilla request to backport
may get somewhere for a future 3.1 release.

-- 
Randomly Selected Tagline:
"Of course, the more I learn, the more I realize I don't know.  At some
 point, I hope to learn enough to realize that I know nothing at all.
 Then maybe I'll be able to snatch a pebble from Julia Child's hand."
         - Alton Brown, "I'm Just Here For The Food"