You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by Serge Huber <sh...@jahia.com> on 2003/05/20 13:49:03 UTC

Interested pure Word text extraction patch ?

Hi all,

First of all thanks for all the work that went into POI . I've only started 
working with the code recently, and I must say there's a kind of "magic" to 
be finally reading file formats that seem to have been voluntarily 
obfuscated :)

Anyway, I have been working on integration of POI with Lucene, mostly to 
get Word file indexing working well enough to fit my needs. Despite the 
fact that I still have some problems with some "complex" files, the result 
is acceptable for now.

I must admit that my modifications are quite "hacky", and I'm not sure if 
they are fitted for an real patch. Should I submit my modifications as they 
are into bugzilla or should I host somewhere else my modifications so that 
people can try them out ?

The modifications I've done are :
- deactivate formatting parsing. I didn't need it so I commented out the 
"findFormatting" in the WordDocument class
- small patches here and there to remove exceptions
- modifications to fall-back to main stream document text if the parsing of 
the piece tables seemed to give nothing (it seems there are a lot of 
problems with some files here but I'm not knowledgeable about the format 
enough to know what I'm doing). And it seems the binary file format 
document is not telling us everything that is really going on here :(
- modifications in the writeAllText method of the WordDocument

The result I got :
- I tested on the 384 Word files I found on my computer
- 1 couldn't be parsed at all becuase of a signature problem (POIFS problem ?)
- 3 were actually RTF files so they are ignored
- 5 files seemed to have problem with piece tables. If I "Save As..." the 
files to transform into "simple" files the text extraction works fine. The 
piece table seemed to always point me to text after the value of fib.fcMax. 
Here I made a patch the reverts to the main document text stream in this case
- 4 files had piece tables that covered some of the main document stream 
and some parts outside, which means I only got part of the text in my 
extractions.
- the rest of the files worked very well !

I'm sorry to say that most of these files are not test cases I could send 
off just like this as some of the data is personal and/or not for public 
eyes. I also seemed to have problems with the test case files that were 
included in POI, that don't even work on the real MS Word !

Basically what I can do not is I have a method that looks like this :

         public String HDFExtractor.getHDFContent(File f);

That gives me a String containing all the text of an HDF encoded file. I 
then index this into Lucene to do the text indexing. It doesn't work with 
every Word file I've encountered but it's better than nothing for me.

Let me know if you still want me to contribute my "hacks" (or patches if 
you prefer)...

Regards,
   Serge Huber.


- -- --- -----=[ serge.huber at jahia dot com ]=---- --- -- -
Jahia : A collaborative source CMS and Portal Server
www.jahia.org Community and product web site
www.jahia.com Commercial services company

Re: Interested pure Word text extraction patch ?

Posted by Serge Huber <sh...@jahia.com>.

Hi Ryan,

Sorry for posting to the mailing list twice. I wasn't sure my first mail 
got through because I got this in return from the postmaster at jakarta :

----- Your message was undeliverable for the following reason -----
The mailbox for laredotornado is currently full.
----- The original message follows -----

Anyway I posted the patch under 
http://issues.apache.org/bugzilla/show_bug.cgi?id=20060 . I hope it's not 
too messy and of interest.

Regards,
   Serge Huber.

At 09:15 AM 5/20/2003 -0400, you wrote:
>You can submit it and I will try to find a place for it, I am not promising
>anything.
>
>FYI, I wrote a little library to do text extraction from Word documents with
>POI. I am using it for my thesis research. You can get it at
>http://www.textmining.org .
>
>You may have had a problem with some of your documents because they were
>fast-saved. I didn't even attempt to support that because it didn't seem
>worth the effort to support the very few documents that are fast-saved.
>
>Ryan Ackley
>
>----- Original Message -----
>From: "Serge Huber" <sh...@jahia.com>
>To: <po...@jakarta.apache.org>
>Sent: Tuesday, May 20, 2003 7:49 AM
>Subject: Interested pure Word text extraction patch ?
>
>
> >
> > Hi all,
> >
> > First of all thanks for all the work that went into POI . I've only
>started
> > working with the code recently, and I must say there's a kind of "magic"
>to
> > be finally reading file formats that seem to have been voluntarily
> > obfuscated :)
> >
> > Anyway, I have been working on integration of POI with Lucene, mostly to
> > get Word file indexing working well enough to fit my needs. Despite the
> > fact that I still have some problems with some "complex" files, the result
> > is acceptable for now.
> >
> > I must admit that my modifications are quite "hacky", and I'm not sure if
> > they are fitted for an real patch. Should I submit my modifications as
>they
> > are into bugzilla or should I host somewhere else my modifications so that
> > people can try them out ?
> >
> > The modifications I've done are :
> > - deactivate formatting parsing. I didn't need it so I commented out the
> > "findFormatting" in the WordDocument class
> > - small patches here and there to remove exceptions
> > - modifications to fall-back to main stream document text if the parsing
>of
> > the piece tables seemed to give nothing (it seems there are a lot of
> > problems with some files here but I'm not knowledgeable about the format
> > enough to know what I'm doing). And it seems the binary file format
> > document is not telling us everything that is really going on here :(
> > - modifications in the writeAllText method of the WordDocument
> >
> > The result I got :
> > - I tested on the 384 Word files I found on my computer
> > - 1 couldn't be parsed at all becuase of a signature problem (POIFS
>problem ?)
> > - 3 were actually RTF files so they are ignored
> > - 5 files seemed to have problem with piece tables. If I "Save As..." the
> > files to transform into "simple" files the text extraction works fine. The
> > piece table seemed to always point me to text after the value of
>fib.fcMax.
> > Here I made a patch the reverts to the main document text stream in this
>case
> > - 4 files had piece tables that covered some of the main document stream
> > and some parts outside, which means I only got part of the text in my
> > extractions.
> > - the rest of the files worked very well !
> >
> > I'm sorry to say that most of these files are not test cases I could send
> > off just like this as some of the data is personal and/or not for public
> > eyes. I also seemed to have problems with the test case files that were
> > included in POI, that don't even work on the real MS Word !
> >
> > Basically what I can do not is I have a method that looks like this :
> >
> >          public String HDFExtractor.getHDFContent(File f);
> >
> > That gives me a String containing all the text of an HDF encoded file. I
> > then index this into Lucene to do the text indexing. It doesn't work with
> > every Word file I've encountered but it's better than nothing for me.
> >
> > Let me know if you still want me to contribute my "hacks" (or patches if
> > you prefer)...
> >
> > Regards,
> >    Serge Huber.
> >
> >
> > - -- --- -----=[ serge.huber at jahia dot com ]=---- --- -- -
> > Jahia : A collaborative source CMS and Portal Server
> > www.jahia.org Community and product web site
> > www.jahia.com Commercial services company
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: poi-dev-help@jakarta.apache.org
> >
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: poi-dev-help@jakarta.apache.org

- -- --- -----=[ serge.huber at jahia dot com ]=---- --- -- -
Jahia : A collaborative source CMS and Portal Server
www.jahia.org Community and product web site
www.jahia.com Commercial services company

Re: Interested pure Word text extraction patch ?

Posted by Ryan Ackley <sa...@cfl.rr.com>.

You can submit it and I will try to find a place for it, I am not promising
anything.

FYI, I wrote a little library to do text extraction from Word documents with
POI. I am using it for my thesis research. You can get it at
http://www.textmining.org .

You may have had a problem with some of your documents because they were
fast-saved. I didn't even attempt to support that because it didn't seem
worth the effort to support the very few documents that are fast-saved.

Ryan Ackley

----- Original Message ----- 
From: "Serge Huber" <sh...@jahia.com>
To: <po...@jakarta.apache.org>
Sent: Tuesday, May 20, 2003 7:49 AM
Subject: Interested pure Word text extraction patch ?


>
> Hi all,
>
> First of all thanks for all the work that went into POI . I've only
started
> working with the code recently, and I must say there's a kind of "magic"
to
> be finally reading file formats that seem to have been voluntarily
> obfuscated :)
>
> Anyway, I have been working on integration of POI with Lucene, mostly to
> get Word file indexing working well enough to fit my needs. Despite the
> fact that I still have some problems with some "complex" files, the result
> is acceptable for now.
>
> I must admit that my modifications are quite "hacky", and I'm not sure if
> they are fitted for an real patch. Should I submit my modifications as
they
> are into bugzilla or should I host somewhere else my modifications so that
> people can try them out ?
>
> The modifications I've done are :
> - deactivate formatting parsing. I didn't need it so I commented out the
> "findFormatting" in the WordDocument class
> - small patches here and there to remove exceptions
> - modifications to fall-back to main stream document text if the parsing
of
> the piece tables seemed to give nothing (it seems there are a lot of
> problems with some files here but I'm not knowledgeable about the format
> enough to know what I'm doing). And it seems the binary file format
> document is not telling us everything that is really going on here :(
> - modifications in the writeAllText method of the WordDocument
>
> The result I got :
> - I tested on the 384 Word files I found on my computer
> - 1 couldn't be parsed at all becuase of a signature problem (POIFS
problem ?)
> - 3 were actually RTF files so they are ignored
> - 5 files seemed to have problem with piece tables. If I "Save As..." the
> files to transform into "simple" files the text extraction works fine. The
> piece table seemed to always point me to text after the value of
fib.fcMax.
> Here I made a patch the reverts to the main document text stream in this
case
> - 4 files had piece tables that covered some of the main document stream
> and some parts outside, which means I only got part of the text in my
> extractions.
> - the rest of the files worked very well !
>
> I'm sorry to say that most of these files are not test cases I could send
> off just like this as some of the data is personal and/or not for public
> eyes. I also seemed to have problems with the test case files that were
> included in POI, that don't even work on the real MS Word !
>
> Basically what I can do not is I have a method that looks like this :
>
>          public String HDFExtractor.getHDFContent(File f);
>
> That gives me a String containing all the text of an HDF encoded file. I
> then index this into Lucene to do the text indexing. It doesn't work with
> every Word file I've encountered but it's better than nothing for me.
>
> Let me know if you still want me to contribute my "hacks" (or patches if
> you prefer)...
>
> Regards,
>    Serge Huber.
>
>
> - -- --- -----=[ serge.huber at jahia dot com ]=---- --- -- -
> Jahia : A collaborative source CMS and Portal Server
> www.jahia.org Community and product web site
> www.jahia.com Commercial services company
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
>