You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jonas Eckerman <jo...@frukt.org> on 2009/06/25 14:44:42 UTC

Re: Plugin extracting text from docs

Matus UHLAR - fantomas wrote:

>> I'm currently working on a modular plugin for extracting text and add it  
>> to SA message parts.
> 
> if possible, extract images too, so the fuzzyocr and similar plugins would
> be able to look at that too.

You meen extract images and add them as parts to the message?

I guess that should be doable. I know that "unrtf" can extract images 
from RTF files. I'll probably implement support for this, but I'll 
probably not implement actually doing it right away.

> IIRC spammers did even put PDF's to .doc files to make the stuff harder, but
> if you manage the above, it shouldn't be hard to extract PDF's too :)

This I don't understand. Do they put PDFs inside .doc files as if the 
..doc was an archive?

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Matus UHLAR - fantomas wrote:

>> This I don't understand. Do they put PDFs inside .doc files as if the  
>> ..doc was an archive?
> 
> I am not sure but I think something alike was done.

Considering that an OpenXML format is basically a zip file with XML 
files inside and that the actual document can contain hyperlinks I guess 
it could be possible to do something like that. Don't know enough about 
the format to know though.

> What I mean is to have
> generic chain of format converters, where at the end would be plain image
> or even text, that could be processed by classic rules like bayes,
> replacetags etc.

If I manage to figure out how to add new parts to a message from within 
the "post_message_parse" method, that should work just fine.

An extractor plugin can return a list of parts to be added to the 
message, and my module will keep looping through the message parts if 
new parts are added.

So, if a Word extractor extracts a PDF and returns it, the PDF woudl be 
added to a new part, and in the next loop the PDF part will be sent to a 
PDF extractor if that exists. And so on. I'm running 
"post_message_parse" at priority -1 so any added image parts should be 
available to plugins like FuzzyOCR as well as plugins running 
"post_message_parse" at default priority.

The missing parts are:

1: How do I add a new part to a parsed message (including a singlepart 
one). This is of course the main problem.

2: The actual extractor plugin that extracts whatever files are included 
in the word document. Antiword only extracts text, and my extractor for 
OpenXML is little more than an extremely basic XML remover.

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by David B Funk <db...@engineering.uiowa.edu>.
On Fri, 26 Jun 2009, Jonas Eckerman wrote:

> Theo Van Dinter wrote:
>
> > the convolution is a
> > fingerprint that you could write a rule for and then you don't care
> > what the content actually is.  For example, you'd render something
> > like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
> > same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
> > and they'd all be different tokes.
>
> That's really a good idea. Put the chains of extraction in a
> pseudoheader that can be tested in rules and seen as a token by bayes.
>
> I'm putting that in the todo for the plugin.

It would be a bit cumbersome but you could:
create a "pre-filter" program/milter which would parse attachments &
MIME structures, create special pseudoheaders with the analysis
results in them, insert them into the message and then pass it on
to SA. The full power of SA would then be available to attack the
exposed info in any way that you wanted and wouldn't require any
mods to SA.
If you were worried about information leakage you could create a
post-filter that would remove the pseudoheaders.


-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Theo Van Dinter wrote:

> the convolution is a
> fingerprint that you could write a rule for and then you don't care
> what the content actually is.  For example, you'd render something
> like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
> same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
> and they'd all be different tokes.

That's really a good idea. Put the chains of extraction in a 
pseudoheader that can be tested in rules and seen as a token by bayes.

I'm putting that in the todo for the plugin.

>> The most common thing to extract apart from text will most likely be images.
>> Any OCR text extractor tied into my plugin would get to see those images,
>> but any OCR SA plugins run after my plugin won't. It might be good to make
>> extracted images available to those, and other image handling plugins.

> But yours already ran, so who cares about the others?

Because they work very differently?

A OCR plugin that adds the rendered text to the message for bayes and 
text rules is very different from one that does it's own scoring based 
on the OCRed text.

> If you're expending the resources to OCR the same image in an email
> multiple times ...  You clearly either have a lot of hardware or not a
> lot of mail.

*I* don't use any OCR at all. We don't have the resources for that 
(beeing a small non-profit NGO), and so far I haven't seen any need for 
OCR either since we never had much image spam slip through anyway.

So I will not implement a OCR extractor for my plugin. I'll leave that 
for others. This is actually one of the reasons I'd like to let existing 
OCR plugins have access to any images extracted by my plugin. So that 
those who allready do use OCR can get a benefit from the extraction.

I'm not going to spend much time on it though. I'm happy just extracting 
text. :-) And it does extract text (currently from Word, OpenXML, 
OpenDocument and RTF documents). :-)

I actually hadn't even thought about this image/OCR etc stuff before 
Matus suggested it.

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 25, 2009 at 3:41 PM, Jonas Eckerman<jo...@frukt.org> wrote:
> Matus example was a Word document that contained as PDF wich (might in turn
> contain an image). A plugin that knows how to read word document could
> extract th text of the word document and then use "set_rendered" to make
> that avaiölable to SA. It cannot currently extract the PDF and make it
> available to any plugins that knows how tpo read PDFs though.

My view would be that if someone is going to try making things so
convoluted such as that, a) we've won because no one is going to go
through the trouble of opening that doc, b) the convolution is a
fingerprint that you could write a rule for and then you don't care
what the content actually is.  For example, you'd render something
like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
and they'd all be different tokes.

But yes, you're right, the Message/Message::Node stuff wasn't designed
with the idea of supporting multiple independent data objects from a
single mime part.  I can see the argument for "treat embeded files
similar to multipart", but I still lean towards mime structure only.

> For some stuff coordination would be needed, yes. But not for what I'm
> thinking of.

Why not?  If you have no coordination, you would possibly look for
images first, then pdfs, then word docs, and end up not getting
anywhere.  If it's all your plugin, you can configure the order.  If
it's not, you need coordination.  For example, as from above, if
there's zip file with a doc which has a pdf which has a jpg, and your
plugin doesn't handle zip but another one does ...

> The most common thing to extract apart from text will most likely be images.
> Any OCR text extractor tied into my plugin would get to see those images,
> but any OCR SA plugins run after my plugin won't. It might be good to make
> extracted images available to those, and other image handling plugins.

But yours already ran, so who cares about the others?

Seriously.

If you're expending the resources to OCR the same image in an email
multiple times ...  You clearly either have a lot of hardware or not a
lot of mail.

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Theo Van Dinter wrote:

> I would comment that plugins should probably skip parts they want to
> render that already has rendered text available.

Ah. That's a good idea. Now I'll have to search for a nice way to check 
that. :-)

>> I can't see how "set_rendered" would help in creating a fucntioning chain
>> where one converter could put an arbitrary extracted object (image, pdf,
>> whatever) where another converter could have a go at it.

> If a plugin wants to get image/* parts and do something with the
> contents, they can do that already.

Not if the image/* parts are actually inside a document.

> If you want to have a plugin do some work on a part's contents, then
> store that result and let another plugin pick up and continue doing
> other work ...  There's no official method to do that.

I guessed as much. This however is what me and Matus were talking about.

> You can store
> data as part of the Node object.

> But what would be a use case for that?

Matus example was a Word document that contained as PDF wich (might in 
turn contain an image). A plugin that knows how to read word document 
could extract th text of the word document and then use "set_rendered" 
to make that avaiölable to SA. It cannot currently extract the PDF and 
make it available to any plugins that knows how tpo read PDFs though.

Matus idea about chains would be that in this example the the plugin 
reading the Word document would store any other objects somehow. In this 
case a PDF. After that, any plugin that knows how to handle PDFs will 
get to look at the PDF and extract text and other stuff from it. In case 
it extracts an image, it would then store it the same way, and any image 
handling plugins would find it.

I really don't know how common that is. I have never seen a Word 
document with a PDF inside it myself.

I have however seen many documents that contain images, and I think it 
would be a good idea to make those images available to things like 
FuzzyOCR and ImageInfo.

> Arguably, there could be multiple people developing plugins for
> different types, but you'd need some coordination for the
> register_method_priority calls to figure out who goes in what order.

For some stuff coordination would be needed, yes. But not for what I'm 
thinking of.

The text extraction plugin I'm working on (wich started this) itself 
have simple extractor plugins. These plugins will be able to return 
arbitrary objects as well as text, and my plugin will check the return 
objects the same way it checks the original message parts. This way, all 
the extractors that are tied into my plugins will be able to extract 
stuff from objects extracted by other extractors. So far so good.

The most common thing to extract apart from text will most likely be 
images. Any OCR text extractor tied into my plugin would get to see 
those images, but any OCR SA plugins run after my plugin won't. It might 
be good to make extracted images available to those, and other image 
handling plugins.

My plugin is called after the message is parsed, wich is very good for a 
text extractor. FuzzyOCR (as an example) however works by scoring OCR 
output (wich may well be very different from the text in the image as we 
see it), and therefore has to be called at a later stage. The same gioes 
for ImageInfo.

It might therefore be a good idea to make the extracted images and other 
objects available to scoring plugins as well.

 > I just found the register_method_priority() method. \o/)

It's nice, isn't it? :-)

I'm using it in my URLRedirect plugin.

> Note: Do not try to add or remove parts in the tree.  The tree is
> meant to represent the mime structure of the mail, and each node
> relates to that specific mime part.  The tree is not meant to be a
> temporary data storage mechanism.

Ok. That makes things easier and less easy for me. I know that I'll have 
to implement my own list of stuff to loop though when extractors return 
additional parts in my plugin. That's the easy part.

The difficult part is how to make extracted stuff available to other 
plugins in a way they understand. I see two main ways to do this:

1: Invent a new way. This would require modifications of any plugins 
that should check the extracted objects.

2: Add a container part somewhere that "find_parts" would find, but wich 
is not actually a member of the message tree, and then add a simple way 
to add parts to that container. This would require modification of 
Mail::SpamAssassin::Message, but not of the plugins.

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 25, 2009 at 1:12 PM, Jonas Eckerman<jo...@frukt.org> wrote:
>> Already exists, check recent list history for "set_rendered".
>
> I though that was for text only.

It is only for text.

> In any case, any plugin looking for images, or a PDF, will most likely look
> at MIME type and/or file name, and then use the "decode" method to get the
> data, and AFAICT the "set_rendered" method doesn't have any impact on any of
> that.

Of course.  There are three states for the data in a Message::Node object:
  - raw: whatever the email had originally.  may be encoded, etc.
  - decoded: the raw content, decoded (ie: base64 or
quoted-printable).  may be binary.
  - rendered: the text content.  if it was a text part, it's the same
as decoded.  if it was a html part, the decoded data gets "rendered"
into text.  if it's anything else, the rendered text is blank because
nothing else is supported.

The goal with the plugin calls and set_rendered is to allow other
plugins to find parts that they understand how to convert into text,
and set the rendered version of the part to whatever as appropriate.
So if you want to do OCR on image/*, you can do that.  If you want to
convert PDF/DOC/whatever to text, you can do that.

I would comment that plugins should probably skip parts they want to
render that already has rendered text available.

Rules, Bayes, etc, then take all the rendered parts and use them.

> I can't see how "set_rendered" would help in creating a fucntioning chain
> where one converter could put an arbitrary extracted object (image, pdf,
> whatever) where another converter could have a go at it.

Well, you wouldn't do that because there's no point. ;)   (feel free
to disagree with me though)
If a plugin wants to get image/* parts and do something with the
contents, they can do that already.
If a plugin wants to get application/octet-stream w/ filename "*.pdf"
and do something with the contents, they can do that already.

If you want to have a plugin do some work on a part's contents, then
store that result and let another plugin pick up and continue doing
other work ...  There's no official method to do that.  You can store
data as part of the Node object.  You could potentially also write a
tempfile, though you'll want to be careful to clean up the tempfile as
necessary.

But what would be a use case for that?  I guess something like
converting a PDF to a TIFF, then OCR the TIFF?
I'd probably implement that as a single plugin w/ "ocr" as a function
that gets called from both the PDF and TIFF handlers.
Arguably, there could be multiple people developing plugins for
different types, but you'd need some coordination for the
register_method_priority calls to figure out who goes in what order.
(btw: I just found the register_method_priority() method. \o/)

Note: Do not try to add or remove parts in the tree.  The tree is
meant to represent the mime structure of the mail, and each node
relates to that specific mime part.  The tree is not meant to be a
temporary data storage mechanism.


Hope this helps.

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Theo Van Dinter wrote:

>> I am not sure but I think something alike was done. What I mean is to have
>> generic chain of format converters, where at the end would be plain image
>> or even text, that could be processed by classic rules like bayes,
>> replacetags etc.

> Already exists, check recent list history for "set_rendered".
> :)

I though that was for text only.

In any case, any plugin looking for images, or a PDF, will most likely 
look at MIME type and/or file name, and then use the "decode" method to 
get the data, and AFAICT the "set_rendered" method doesn't have any 
impact on any of that.

I can't see how "set_rendered" would help in creating a fucntioning 
chain where one converter could put an arbitrary extracted object 
(image, pdf, whatever) where another converter could have a go at it.

Since the "set_rendered" method seems very undocumented I could of 
course be wrong here. In that case I hope to be verbosely corrected. :-)

/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 25, 2009 at 11:48 AM, Matus UHLAR -
fantomas<uh...@fantomas.sk> wrote:
> I am not sure but I think something alike was done. What I mean is to have
> generic chain of format converters, where at the end would be plain image
> or even text, that could be processed by classic rules like bayes,
> replacetags etc.

Already exists, check recent list history for "set_rendered".
:)

Re: Plugin extracting text from docs

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> Matus UHLAR - fantomas wrote:
>
>>> I'm currently working on a modular plugin for extracting text and add 
>>> it  to SA message parts.
>>
>> if possible, extract images too, so the fuzzyocr and similar plugins would
>> be able to look at that too.
>
> You meen extract images and add them as parts to the message?
>
> I guess that should be doable. I know that "unrtf" can extract images  
> from RTF files. I'll probably implement support for this, but I'll  
> probably not implement actually doing it right away.
>
>> IIRC spammers did even put PDF's to .doc files to make the stuff harder, but
>> if you manage the above, it shouldn't be hard to extract PDF's too :)

On 25.06.09 14:44, Jonas Eckerman wrote:
> This I don't understand. Do they put PDFs inside .doc files as if the  
> ..doc was an archive?

I am not sure but I think something alike was done. What I mean is to have
generic chain of format converters, where at the end would be plain image
or even text, that could be processed by classic rules like bayes,
replacetags etc.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Jonas Eckerman wrote:

> You meen extract images and add them as parts to the message?
> 
> I guess that should be doable. I know that "unrtf" can extract images 
> from RTF files. I'll probably implement support for this, but I'll 
> probably not implement actually doing it right away.

This'll probably have to wait. Browsing the POD and source of 
Mail::SpamAssassin::Message::Node and Mail::SpamAssassin::Message I 
found no obvious way of adding new parts to a message node. Especially 
if the node is a leaf node (I'm guessing that singlepart messages only 
has a leaf node).

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/