You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jason Haar <Ja...@trimble.co.nz> on 2009/06/19 03:04:40 UTC

new spam using large images

Hi there, just a FYI

I just received this: http://pastebin.com/m54006b68

420K in size - standard configuration of SA wouldn't have even run over
this message. Also the inline image is too large for FuzzyOCR to trigger
- I would guess FuzzyOCR has the (screen) size limit as a mechanism to
reduce FPs. Anyway, if you increase focr_max_height/focr_max_width then
FuzzyOCR grabs the text out just fine - and it looks like your standard
"you're a w...r!" scam.

This is the sort of thing that always worried me. As spammer don't care
about the load their apps put on stolen PCs, they can just increase the
size of their email formats until antispam tools start to break.

Speaking of image/rtf/word attachment spam; is there any work going on
to standardize this so that the textual output of such attachments could
be fed back into SA?

-- 
Cheers

Jason Haar
Information Security Manager, Trimble Navigation Ltd.
Phone: +64 3 9635 377 Fax: +64 3 9635 417
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1


Re: new spam using large images

Posted by Theo Van Dinter <fe...@apache.org>.
On Fri, Jun 19, 2009 at 4:42 PM, Charles Gregory<cg...@hwcn.org> wrote:
> Hmmmm. Big question for developers: Does the performance 'burden' of a large
> e-mail come from the 'reading' of that mail into spamassassin and initial
> processing? Or is the 'cost' of a large message only 'paid' when SA attempts
> to run 'rawbody' or 'full' rules against the entire message?

There is very little load for reading in a message.  It's all about
the running of rules.  Some rules "cost" more than others, "full", for
example.

Re: new spam using large images

Posted by Charles Gregory <cg...@hwcn.org>.
On Fri, 19 Jun 2009, Jason Haar wrote:
> Hi there, just a FYI
> I just received this: http://pastebin.com/m54006b68
> 420K in size...

Hmmmm. Big question for developers: Does the performance 'burden' of a 
large e-mail come from the 'reading' of that mail into spamassassin and 
initial processing? Or is the 'cost' of a large message only 'paid' when 
SA attempts to run 'rawbody' or 'full' rules against the entire message?

I am *hoping* it is the latter, and that a parameter value can be coded 
within the spamassassin config (or as a command line option) that will 
amount to 'ignore attachements larger than...', while still allowing the 
headers and any text body parts to be scanned. In particular, given the 
success of RBL's, it seems reasonable to have a way to process the headers 
from *all* messages, as long as loading the oversize message does not (for 
example) tie up memory merely by loading the message into spamassassin....

Yes, I already use RBL's at the MTA level for the ones I trust to be a 
poison pill. But I often still see spam hit multiple 'lower trust' RBL's 
in spamassassin, adding up to a rejection score. So it's worth figuring a 
way to check larger mails if that is what spammers are going to do.

If the cost has more to do with SA reading the mail at the beginning, then 
perhaps we could figure a 'subfunction' of spamassassin that would accept 
a command line option to only read the headers (all lines up to the first 
blank line) and then return a score as a result code? Obviously it could 
not modify the message in that case, but if the spammers are going to 
just make their spew over-sized, then its something that may be 
needed.... and it would at least help with the rejection of mails that
surpass the 'auto reject' threshold.

Thoughts?

- Charles



Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Matus UHLAR - fantomas wrote:

>> This I don't understand. Do they put PDFs inside .doc files as if the  
>> ..doc was an archive?
> 
> I am not sure but I think something alike was done.

Considering that an OpenXML format is basically a zip file with XML 
files inside and that the actual document can contain hyperlinks I guess 
it could be possible to do something like that. Don't know enough about 
the format to know though.

> What I mean is to have
> generic chain of format converters, where at the end would be plain image
> or even text, that could be processed by classic rules like bayes,
> replacetags etc.

If I manage to figure out how to add new parts to a message from within 
the "post_message_parse" method, that should work just fine.

An extractor plugin can return a list of parts to be added to the 
message, and my module will keep looping through the message parts if 
new parts are added.

So, if a Word extractor extracts a PDF and returns it, the PDF woudl be 
added to a new part, and in the next loop the PDF part will be sent to a 
PDF extractor if that exists. And so on. I'm running 
"post_message_parse" at priority -1 so any added image parts should be 
available to plugins like FuzzyOCR as well as plugins running 
"post_message_parse" at default priority.

The missing parts are:

1: How do I add a new part to a parsed message (including a singlepart 
one). This is of course the main problem.

2: The actual extractor plugin that extracts whatever files are included 
in the word document. Antiword only extracts text, and my extractor for 
OpenXML is little more than an extremely basic XML remover.

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by David B Funk <db...@engineering.uiowa.edu>.
On Fri, 26 Jun 2009, Jonas Eckerman wrote:

> Theo Van Dinter wrote:
>
> > the convolution is a
> > fingerprint that you could write a rule for and then you don't care
> > what the content actually is.  For example, you'd render something
> > like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
> > same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
> > and they'd all be different tokes.
>
> That's really a good idea. Put the chains of extraction in a
> pseudoheader that can be tested in rules and seen as a token by bayes.
>
> I'm putting that in the todo for the plugin.

It would be a bit cumbersome but you could:
create a "pre-filter" program/milter which would parse attachments &
MIME structures, create special pseudoheaders with the analysis
results in them, insert them into the message and then pass it on
to SA. The full power of SA would then be available to attack the
exposed info in any way that you wanted and wouldn't require any
mods to SA.
If you were worried about information leakage you could create a
post-filter that would remove the pseudoheaders.


-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Theo Van Dinter wrote:

> the convolution is a
> fingerprint that you could write a rule for and then you don't care
> what the content actually is.  For example, you'd render something
> like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
> same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
> and they'd all be different tokes.

That's really a good idea. Put the chains of extraction in a 
pseudoheader that can be tested in rules and seen as a token by bayes.

I'm putting that in the todo for the plugin.

>> The most common thing to extract apart from text will most likely be images.
>> Any OCR text extractor tied into my plugin would get to see those images,
>> but any OCR SA plugins run after my plugin won't. It might be good to make
>> extracted images available to those, and other image handling plugins.

> But yours already ran, so who cares about the others?

Because they work very differently?

A OCR plugin that adds the rendered text to the message for bayes and 
text rules is very different from one that does it's own scoring based 
on the OCRed text.

> If you're expending the resources to OCR the same image in an email
> multiple times ...  You clearly either have a lot of hardware or not a
> lot of mail.

*I* don't use any OCR at all. We don't have the resources for that 
(beeing a small non-profit NGO), and so far I haven't seen any need for 
OCR either since we never had much image spam slip through anyway.

So I will not implement a OCR extractor for my plugin. I'll leave that 
for others. This is actually one of the reasons I'd like to let existing 
OCR plugins have access to any images extracted by my plugin. So that 
those who allready do use OCR can get a benefit from the extraction.

I'm not going to spend much time on it though. I'm happy just extracting 
text. :-) And it does extract text (currently from Word, OpenXML, 
OpenDocument and RTF documents). :-)

I actually hadn't even thought about this image/OCR etc stuff before 
Matus suggested it.

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 25, 2009 at 3:41 PM, Jonas Eckerman<jo...@frukt.org> wrote:
> Matus example was a Word document that contained as PDF wich (might in turn
> contain an image). A plugin that knows how to read word document could
> extract th text of the word document and then use "set_rendered" to make
> that avaiölable to SA. It cannot currently extract the PDF and make it
> available to any plugins that knows how tpo read PDFs though.

My view would be that if someone is going to try making things so
convoluted such as that, a) we've won because no one is going to go
through the trouble of opening that doc, b) the convolution is a
fingerprint that you could write a rule for and then you don't care
what the content actually is.  For example, you'd render something
like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
and they'd all be different tokes.

But yes, you're right, the Message/Message::Node stuff wasn't designed
with the idea of supporting multiple independent data objects from a
single mime part.  I can see the argument for "treat embeded files
similar to multipart", but I still lean towards mime structure only.

> For some stuff coordination would be needed, yes. But not for what I'm
> thinking of.

Why not?  If you have no coordination, you would possibly look for
images first, then pdfs, then word docs, and end up not getting
anywhere.  If it's all your plugin, you can configure the order.  If
it's not, you need coordination.  For example, as from above, if
there's zip file with a doc which has a pdf which has a jpg, and your
plugin doesn't handle zip but another one does ...

> The most common thing to extract apart from text will most likely be images.
> Any OCR text extractor tied into my plugin would get to see those images,
> but any OCR SA plugins run after my plugin won't. It might be good to make
> extracted images available to those, and other image handling plugins.

But yours already ran, so who cares about the others?

Seriously.

If you're expending the resources to OCR the same image in an email
multiple times ...  You clearly either have a lot of hardware or not a
lot of mail.

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Theo Van Dinter wrote:

> I would comment that plugins should probably skip parts they want to
> render that already has rendered text available.

Ah. That's a good idea. Now I'll have to search for a nice way to check 
that. :-)

>> I can't see how "set_rendered" would help in creating a fucntioning chain
>> where one converter could put an arbitrary extracted object (image, pdf,
>> whatever) where another converter could have a go at it.

> If a plugin wants to get image/* parts and do something with the
> contents, they can do that already.

Not if the image/* parts are actually inside a document.

> If you want to have a plugin do some work on a part's contents, then
> store that result and let another plugin pick up and continue doing
> other work ...  There's no official method to do that.

I guessed as much. This however is what me and Matus were talking about.

> You can store
> data as part of the Node object.

> But what would be a use case for that?

Matus example was a Word document that contained as PDF wich (might in 
turn contain an image). A plugin that knows how to read word document 
could extract th text of the word document and then use "set_rendered" 
to make that avaiölable to SA. It cannot currently extract the PDF and 
make it available to any plugins that knows how tpo read PDFs though.

Matus idea about chains would be that in this example the the plugin 
reading the Word document would store any other objects somehow. In this 
case a PDF. After that, any plugin that knows how to handle PDFs will 
get to look at the PDF and extract text and other stuff from it. In case 
it extracts an image, it would then store it the same way, and any image 
handling plugins would find it.

I really don't know how common that is. I have never seen a Word 
document with a PDF inside it myself.

I have however seen many documents that contain images, and I think it 
would be a good idea to make those images available to things like 
FuzzyOCR and ImageInfo.

> Arguably, there could be multiple people developing plugins for
> different types, but you'd need some coordination for the
> register_method_priority calls to figure out who goes in what order.

For some stuff coordination would be needed, yes. But not for what I'm 
thinking of.

The text extraction plugin I'm working on (wich started this) itself 
have simple extractor plugins. These plugins will be able to return 
arbitrary objects as well as text, and my plugin will check the return 
objects the same way it checks the original message parts. This way, all 
the extractors that are tied into my plugins will be able to extract 
stuff from objects extracted by other extractors. So far so good.

The most common thing to extract apart from text will most likely be 
images. Any OCR text extractor tied into my plugin would get to see 
those images, but any OCR SA plugins run after my plugin won't. It might 
be good to make extracted images available to those, and other image 
handling plugins.

My plugin is called after the message is parsed, wich is very good for a 
text extractor. FuzzyOCR (as an example) however works by scoring OCR 
output (wich may well be very different from the text in the image as we 
see it), and therefore has to be called at a later stage. The same gioes 
for ImageInfo.

It might therefore be a good idea to make the extracted images and other 
objects available to scoring plugins as well.

 > I just found the register_method_priority() method. \o/)

It's nice, isn't it? :-)

I'm using it in my URLRedirect plugin.

> Note: Do not try to add or remove parts in the tree.  The tree is
> meant to represent the mime structure of the mail, and each node
> relates to that specific mime part.  The tree is not meant to be a
> temporary data storage mechanism.

Ok. That makes things easier and less easy for me. I know that I'll have 
to implement my own list of stuff to loop though when extractors return 
additional parts in my plugin. That's the easy part.

The difficult part is how to make extracted stuff available to other 
plugins in a way they understand. I see two main ways to do this:

1: Invent a new way. This would require modifications of any plugins 
that should check the extracted objects.

2: Add a container part somewhere that "find_parts" would find, but wich 
is not actually a member of the message tree, and then add a simple way 
to add parts to that container. This would require modification of 
Mail::SpamAssassin::Message, but not of the plugins.

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 25, 2009 at 1:12 PM, Jonas Eckerman<jo...@frukt.org> wrote:
>> Already exists, check recent list history for "set_rendered".
>
> I though that was for text only.

It is only for text.

> In any case, any plugin looking for images, or a PDF, will most likely look
> at MIME type and/or file name, and then use the "decode" method to get the
> data, and AFAICT the "set_rendered" method doesn't have any impact on any of
> that.

Of course.  There are three states for the data in a Message::Node object:
  - raw: whatever the email had originally.  may be encoded, etc.
  - decoded: the raw content, decoded (ie: base64 or
quoted-printable).  may be binary.
  - rendered: the text content.  if it was a text part, it's the same
as decoded.  if it was a html part, the decoded data gets "rendered"
into text.  if it's anything else, the rendered text is blank because
nothing else is supported.

The goal with the plugin calls and set_rendered is to allow other
plugins to find parts that they understand how to convert into text,
and set the rendered version of the part to whatever as appropriate.
So if you want to do OCR on image/*, you can do that.  If you want to
convert PDF/DOC/whatever to text, you can do that.

I would comment that plugins should probably skip parts they want to
render that already has rendered text available.

Rules, Bayes, etc, then take all the rendered parts and use them.

> I can't see how "set_rendered" would help in creating a fucntioning chain
> where one converter could put an arbitrary extracted object (image, pdf,
> whatever) where another converter could have a go at it.

Well, you wouldn't do that because there's no point. ;)   (feel free
to disagree with me though)
If a plugin wants to get image/* parts and do something with the
contents, they can do that already.
If a plugin wants to get application/octet-stream w/ filename "*.pdf"
and do something with the contents, they can do that already.

If you want to have a plugin do some work on a part's contents, then
store that result and let another plugin pick up and continue doing
other work ...  There's no official method to do that.  You can store
data as part of the Node object.  You could potentially also write a
tempfile, though you'll want to be careful to clean up the tempfile as
necessary.

But what would be a use case for that?  I guess something like
converting a PDF to a TIFF, then OCR the TIFF?
I'd probably implement that as a single plugin w/ "ocr" as a function
that gets called from both the PDF and TIFF handlers.
Arguably, there could be multiple people developing plugins for
different types, but you'd need some coordination for the
register_method_priority calls to figure out who goes in what order.
(btw: I just found the register_method_priority() method. \o/)

Note: Do not try to add or remove parts in the tree.  The tree is
meant to represent the mime structure of the mail, and each node
relates to that specific mime part.  The tree is not meant to be a
temporary data storage mechanism.


Hope this helps.

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Theo Van Dinter wrote:

>> I am not sure but I think something alike was done. What I mean is to have
>> generic chain of format converters, where at the end would be plain image
>> or even text, that could be processed by classic rules like bayes,
>> replacetags etc.

> Already exists, check recent list history for "set_rendered".
> :)

I though that was for text only.

In any case, any plugin looking for images, or a PDF, will most likely 
look at MIME type and/or file name, and then use the "decode" method to 
get the data, and AFAICT the "set_rendered" method doesn't have any 
impact on any of that.

I can't see how "set_rendered" would help in creating a fucntioning 
chain where one converter could put an arbitrary extracted object 
(image, pdf, whatever) where another converter could have a go at it.

Since the "set_rendered" method seems very undocumented I could of 
course be wrong here. In that case I hope to be verbosely corrected. :-)

/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 25, 2009 at 11:48 AM, Matus UHLAR -
fantomas<uh...@fantomas.sk> wrote:
> I am not sure but I think something alike was done. What I mean is to have
> generic chain of format converters, where at the end would be plain image
> or even text, that could be processed by classic rules like bayes,
> replacetags etc.

Already exists, check recent list history for "set_rendered".
:)

Re: Plugin extracting text from docs

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> Matus UHLAR - fantomas wrote:
>
>>> I'm currently working on a modular plugin for extracting text and add 
>>> it  to SA message parts.
>>
>> if possible, extract images too, so the fuzzyocr and similar plugins would
>> be able to look at that too.
>
> You meen extract images and add them as parts to the message?
>
> I guess that should be doable. I know that "unrtf" can extract images  
> from RTF files. I'll probably implement support for this, but I'll  
> probably not implement actually doing it right away.
>
>> IIRC spammers did even put PDF's to .doc files to make the stuff harder, but
>> if you manage the above, it shouldn't be hard to extract PDF's too :)

On 25.06.09 14:44, Jonas Eckerman wrote:
> This I don't understand. Do they put PDFs inside .doc files as if the  
> ..doc was an archive?

I am not sure but I think something alike was done. What I mean is to have
generic chain of format converters, where at the end would be plain image
or even text, that could be processed by classic rules like bayes,
replacetags etc.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Jonas Eckerman wrote:

> You meen extract images and add them as parts to the message?
> 
> I guess that should be doable. I know that "unrtf" can extract images 
> from RTF files. I'll probably implement support for this, but I'll 
> probably not implement actually doing it right away.

This'll probably have to wait. Browsing the POD and source of 
Mail::SpamAssassin::Message::Node and Mail::SpamAssassin::Message I 
found no obvious way of adding new parts to a message node. Especially 
if the node is a leaf node (I'm guessing that singlepart messages only 
has a leaf node).

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Matus UHLAR - fantomas wrote:

>> I'm currently working on a modular plugin for extracting text and add it  
>> to SA message parts.
> 
> if possible, extract images too, so the fuzzyocr and similar plugins would
> be able to look at that too.

You meen extract images and add them as parts to the message?

I guess that should be doable. I know that "unrtf" can extract images 
from RTF files. I'll probably implement support for this, but I'll 
probably not implement actually doing it right away.

> IIRC spammers did even put PDF's to .doc files to make the stuff harder, but
> if you manage the above, it shouldn't be hard to extract PDF's too :)

This I don't understand. Do they put PDFs inside .doc files as if the 
..doc was an archive?

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs (was: new spam using large images)

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> Jason Haar wrote:
>
>> Speaking of image/rtf/word attachment spam; is there any work going on
>> to standardize this so that the textual output of such attachments could
>> be fed back into SA?

On 24.06.09 19:33, Jonas Eckerman wrote:
> Just as a note:
>
> I'm currently working on a modular plugin for extracting text and add it  
> to SA message parts.

if possible, extract images too, so the fuzzyocr and similar plugins would
be able to look at that too.

IIRC spammers did even put PDF's to .doc files to make the stuff harder, but
if you manage the above, it shouldn't be hard to extract PDF's too :)

(and then extracting text/images from PDF's too)

> The plugin can use either external tools or it's own simple plugin  
> modules. How to extract text from parts is configurable, and based on  
> mime types and file names, so new formats can be added by simply  
> configuring for new external tolls or creating a new plugin module.
>
> My *far* from finished module currently manages to extract text from  
> Word documents (using antiword), OpenXML text documents (using a simple  
> plugin) and RTF (using unrtf).
>
> I haven't tested where and how the extracted text is available to  
> SpamAssassin yet (as noted, it's *far* from finished), but I am using     
>   "set_rendered" method as in the example, so it should work. ;-)

great!
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
If Barbie is so popular, why do you have to buy her friends? 

Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Benny Pedersen wrote:

>> <http://whatever.frukt.org/graphdefang/ExtractText.zip>).

I've now mirrored the file as
<http://mmm.truls.org/m/ExtractText.zip>

I hope that will work better.

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

Posted by Benny Pedersen <me...@junc.org>.
On Wed, July 1, 2009 21:51, Jonas Eckerman wrote:

> <http://whatever.frukt.org/graphdefang/ExtractText.zip>).

i had to use wget --continue to get it downloaded, is this a firewall limit ?

stalls in 8k here, so multiple wget try to get the full zip down :(

-- 
xpoint


Re: Plugin extracting text from docs

Posted by Jonas Eckerman <jo...@frukt.org>.
Rosenbaum, Larry M. wrote:

> We can use antiword to render text from MSWord files, and unrtf to render text from RTF files.  What is the best tool to render text from PDF files?

I don't know what the best tool is, but I'm currently using pdftohtml in 
XML mode (and then stripping the XML) in my ExtractText plugin.

(For more info about the plugin, see my post with subject "ExtractText 
plugin", or download it from 
<http://whatever.frukt.org/graphdefang/ExtractText.zip>).

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

RE: Plugin extracting text from docs (was: new spam using large images)

Posted by Giampaolo Tomassoni <g....@libero.it>.
> We can use antiword to render text from MSWord files, and unrtf to
> render text from RTF files.  What is the best tool to render text from
> PDF files?
> 
> (We are running Solaris 9)

FWIK, antiword is the best tradeoff between speed and conversion quality.

The best converter I know of, even for batch use, is actually OpenOffice with its "uno" interface, but it isn't that easy to handle from perl since it uses some kind of Java jndi in order to exchange word files and converted text with any implementation of a conversion controller. Also, it tends to consume a lot of memory since current versions keep "growing" core size for each document you convert (even when you close them...).

Antiword seems more resource conscious in this...

Giampaolo



> 
> L
> 
> > -----Original Message-----
> > From: Jonas Eckerman [mailto:jonas_lists@frukt.org]
> > Sent: Wednesday, June 24, 2009 1:34 PM
> > To: users@spamassassin.apache.org
> > Subject: Plugin extracting text from docs (was: new spam using large
> > images)
> >
> > Jason Haar wrote:
> >
> > > Speaking of image/rtf/word attachment spam; is there any work going
> > on
> > > to standardize this so that the textual output of such attachments
> > could
> > > be fed back into SA?
> >
> > Just as a note:
> >
> > I'm currently working on a modular plugin for extracting text and add
> > it
> > to SA message parts.
> >
> > The plugin can use either external tools or it's own simple plugin
> > modules. How to extract text from parts is configurable, and based on
> > mime types and file names, so new formats can be added by simply
> > configuring for new external tolls or creating a new plugin module.
> >
> > My *far* from finished module currently manages to extract text from
> > Word documents (using antiword), OpenXML text documents (using a
> simple
> > plugin) and RTF (using unrtf).
> >
> > I haven't tested where and how the extracted text is available to
> > SpamAssassin yet (as noted, it's *far* from finished), but I am using
> >        "set_rendered" method as in the example, so it should work. ;-
> )
> >
> > Regards
> > /Jonas
> > --
> > Jonas Eckerman
> > Fruktträdet & Förbundet Sveriges Dövblinda
> > http://www.fsdb.org/
> > http://www.frukt.org/
> > http://whatever.frukt.org/


RE: Plugin extracting text from docs (was: new spam using large images)

Posted by "Rosenbaum, Larry M." <ro...@ornl.gov>.
We can use antiword to render text from MSWord files, and unrtf to render text from RTF files.  What is the best tool to render text from PDF files?

(We are running Solaris 9)

L

> -----Original Message-----
> From: Jonas Eckerman [mailto:jonas_lists@frukt.org]
> Sent: Wednesday, June 24, 2009 1:34 PM
> To: users@spamassassin.apache.org
> Subject: Plugin extracting text from docs (was: new spam using large
> images)
> 
> Jason Haar wrote:
> 
> > Speaking of image/rtf/word attachment spam; is there any work going
> on
> > to standardize this so that the textual output of such attachments
> could
> > be fed back into SA?
> 
> Just as a note:
> 
> I'm currently working on a modular plugin for extracting text and add
> it
> to SA message parts.
> 
> The plugin can use either external tools or it's own simple plugin
> modules. How to extract text from parts is configurable, and based on
> mime types and file names, so new formats can be added by simply
> configuring for new external tolls or creating a new plugin module.
> 
> My *far* from finished module currently manages to extract text from
> Word documents (using antiword), OpenXML text documents (using a simple
> plugin) and RTF (using unrtf).
> 
> I haven't tested where and how the extracted text is available to
> SpamAssassin yet (as noted, it's *far* from finished), but I am using
>        "set_rendered" method as in the example, so it should work. ;-)
> 
> Regards
> /Jonas
> --
> Jonas Eckerman
> Fruktträdet & Förbundet Sveriges Dövblinda
> http://www.fsdb.org/
> http://www.frukt.org/
> http://whatever.frukt.org/

Plugin extracting text from docs (was: new spam using large images)

Posted by Jonas Eckerman <jo...@frukt.org>.
Jason Haar wrote:

> Speaking of image/rtf/word attachment spam; is there any work going on
> to standardize this so that the textual output of such attachments could
> be fed back into SA?

Just as a note:

I'm currently working on a modular plugin for extracting text and add it 
to SA message parts.

The plugin can use either external tools or it's own simple plugin 
modules. How to extract text from parts is configurable, and based on 
mime types and file names, so new formats can be added by simply 
configuring for new external tolls or creating a new plugin module.

My *far* from finished module currently manages to extract text from 
Word documents (using antiword), OpenXML text documents (using a simple 
plugin) and RTF (using unrtf).

I haven't tested where and how the extracted text is available to 
SpamAssassin yet (as noted, it's *far* from finished), but I am using 
       "set_rendered" method as in the example, so it should work. ;-)

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: new spam using large images

Posted by Theo Van Dinter <fe...@apache.org>.
Once you have a part you can use the documented methods in
Message::Node to access data (see "perldoc
Mail::SpamAssassin::Message::Node").  You will probably want
$p->decode() which returns a decoded (base64, quoted-printable) string
of the part contents.


On Fri, Jun 19, 2009 at 7:00 PM, Rosenbaum, Larry
M.<ro...@ornl.gov> wrote:
>> From: felicity@kluge.net On Behalf Of Theo Van Dinter
>>
>> On Fri, Jun 19, 2009 at 3:04 AM, Jason Haar<Ja...@trimble.co.nz>
>> wrote:
>> > Speaking of image/rtf/word attachment spam; is there any work going
>> on
>> > to standardize this so that the textual output of such attachments
>> could
>> > be fed back into SA?
>>
>> That functionality already exists (has for almost 3 years, actually),
>> but as in the past (list archives) the documentation hasn't improved
>> for it. :(
>>
>> Here's my last(?) post about it which has some sample code and
>> everything:
>>
>> http://www.nabble.com/Re:-PDFText-Plugin-for-PDF-file-scoring---not-
>> for-PDF-images-p11595641.html
>
> Thanks for the sample code.  Once you get the $p object from $msg->find_parts(), how do you extract the contents of the message part to run it through antiword or whatever?
>
> L
>

RE: new spam using large images

Posted by "Rosenbaum, Larry M." <ro...@ornl.gov>.
> From: felicity@kluge.net On Behalf Of Theo Van Dinter
>
> On Fri, Jun 19, 2009 at 3:04 AM, Jason Haar<Ja...@trimble.co.nz>
> wrote:
> > Speaking of image/rtf/word attachment spam; is there any work going
> on
> > to standardize this so that the textual output of such attachments
> could
> > be fed back into SA?
> 
> That functionality already exists (has for almost 3 years, actually),
> but as in the past (list archives) the documentation hasn't improved
> for it. :(
> 
> Here's my last(?) post about it which has some sample code and
> everything:
> 
> http://www.nabble.com/Re:-PDFText-Plugin-for-PDF-file-scoring---not-
> for-PDF-images-p11595641.html

Thanks for the sample code.  Once you get the $p object from $msg->find_parts(), how do you extract the contents of the message part to run it through antiword or whatever?

L

Re: new spam using large images

Posted by Theo Van Dinter <fe...@apache.org>.
On Fri, Jun 19, 2009 at 3:04 AM, Jason Haar<Ja...@trimble.co.nz> wrote:
> Speaking of image/rtf/word attachment spam; is there any work going on
> to standardize this so that the textual output of such attachments could
> be fed back into SA?

That functionality already exists (has for almost 3 years, actually),
but as in the past (list archives) the documentation hasn't improved
for it. :(

Here's my last(?) post about it which has some sample code and everything:

http://www.nabble.com/Re:-PDFText-Plugin-for-PDF-file-scoring---not-for-PDF-images-p11595641.html

Re: new spam using large images

Posted by LuKreme <kr...@kreme.com>.
On 19 Jun, 2009, at 14:38 , Karsten Bräckelmann wrote:
> On Fri, 2009-06-19 at 13:57 -0600, LuKreme wrote:
>> On 19 Jun, 2009, at 06:12 , Karsten Bräckelmann wrote:
>
>>>> I just received this: http://pastebin.com/m54006b68
>>>>
>>>> 420K in size - standard configuration of SA wouldn't have even  
>>>> run over
>>>> this message. [...]
>>>
>>> SA would have scanned it by default just fine. The default size  
>>> limit
>>> for spamc is 500 KB. No size limit imposed by spamd.
>>
>> A 420K image will result in an email well over 500K in size.
>
> Next time, please do your homework, Lu.
>
> We're talking about a message that's 427,868 Byte large in total.

Oh, I just looked at the numbers in the emails and figured they would  
be right, I did not look at the actual message. If the image was, in  
fact 420K, it would have to be over 500K encoded.

> Including the base64 *encoded* image. You probably would have noticed
> that, if you would have bothered to check any detail at all before
> posting.

Yep, had I downloaded the email in question and checked its byte count  
I would have found the *encoded* image was 417,920 bytes (408KB), and  
not that the actual image was 420KB (430,080). I trusted OP to post  
the right information. For the record, the image was 304KB (309,272)  
for an inflation rate that would have easily put any 420KB image over  
the 500KB threshold.



-- 
Advance and attack! Attack and destroy! Destroy and rejoice!


Re: new spam using large images

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2009-06-19 at 13:57 -0600, LuKreme wrote:
> On 19 Jun, 2009, at 06:12 , Karsten Bräckelmann wrote:

> >> I just received this: http://pastebin.com/m54006b68
> >>
> >> 420K in size - standard configuration of SA wouldn't have even run over
> >> this message. [...]
> >
> > SA would have scanned it by default just fine. The default size limit
> > for spamc is 500 KB. No size limit imposed by spamd.
> 
> A 420K image will result in an email well over 500K in size.

Next time, please do your homework, Lu.

We're talking about a message that's 427,868 Byte large in total.
Including the base64 *encoded* image. You probably would have noticed
that, if you would have bothered to check any detail at all before
posting.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: new spam using large images

Posted by LuKreme <kr...@kreme.com>.
On 19 Jun, 2009, at 06:12 , Karsten Bräckelmann wrote:
> On Fri, 2009-06-19 at 13:04 +1200, Jason Haar wrote:
>> Hi there, just a FYI
>>
>> I just received this: http://pastebin.com/m54006b68
>>
>> 420K in size - standard configuration of SA wouldn't have even run  
>> over
>> this message. [...]
>
> SA would have scanned it by default just fine. The default size limit
> for spamc is 500 KB. No size limit imposed by spamd.

A 420K image will result in an email well over 500K in size.

-- 
Rincewind had always been happy to think of himself as a racist.
	The One Hundred Meters, the Mile, the Marathon -- he'd run them
	all.


Re: new spam using large images

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2009-06-19 at 13:04 +1200, Jason Haar wrote:
> Hi there, just a FYI
> 
> I just received this: http://pastebin.com/m54006b68
> 
> 420K in size - standard configuration of SA wouldn't have even run over
> this message. [...]

SA would have scanned it by default just fine. The default size limit
for spamc is 500 KB. No size limit imposed by spamd.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}