You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jonas Eckerman <jo...@frukt.org> on 2009/06/29 20:06:54 UTC

ExtractText plugin

Hello!

For anyone who likes to test stuff, I've uploaded my plugin that 
extracts text from documents to
<http://whatever.frukt.org/graphdefang/ExtractText.zip>

I started writing last week, so it hasn't been heavily tested yet, but 
it has been running here over the weekend with no showstopping problems.

What it does is use external tools and simple (interface wise) extractor 
plugins to extract text from message parts. The extractors are choosed 
by MIME type, file name and optionally content magic. The extracted text 
is seen by bayes and SA rules. It is completely possible to create an 
OCR extractor, but I haven't done so, and I currently don't plan on 
doing it.

The plugin currently comes with a *very* rudimentary OpenXML (recent MS 
Word) extractor, and a configuration using external tools "antiword", 
"unrtf", "odt2txt" and "pdftohtml" to extract text from MS Word, RTF, 
OpenDocument (OpenOffice/StarOffice) and PDF files.

It is also possible for an extractor plugin to return several binary 
objects as well as text. These objects will also be processed by all 
extractors, so an extractor for a container type of file can return (as 
an example) a bunch of images, that is then processed by an OCR 
extractor. I have not implemented any extractor that does this, so it's 
completely untested.

Stuff I allready know is missing:

* A safe-guarding maximum depth of processing.

* A way for extractor plugins to get config lines.

Test it if you feel like it.

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: ExtractText plugin

Posted by Jonas Eckerman <jo...@frukt.org>.
Jonas Eckerman wrote:

> For anyone who likes to test stuff, I've uploaded my plugin that 
> extracts text from documents to
> <http://whatever.frukt.org/graphdefang/ExtractText.zip>

In case any of you have problems downloading the file, it's now mirrored as
<http://mmm.truls.org/m/ExtractText.zip>

And, please tell me of problems.

Regards
/Jonas

-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/