You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Nir Nußbaum <ni...@gmail.com> on 2006/08/10 11:35:54 UTC

Re: Word extraction doesn't work??

 Hi

I am trying to extract pure text from Word (to index into Lucene):
I did:
*            org.apache.poi.hwpf.extractor.WordExtractor we=new
org.apache.poi.hwpf.extractor.WordExtractor(is);
            bodyText=we.getText();
*
I tested it on 48 documents, which are mostly quite easy (don't contain
pictures or so) but some of them are quite old (from 2000 or so).

I get exception on 47 of the 48 documents... the stack trace is, for
instance:
**
*java.io.IOException: Invalid header signature; read 7015536635646467195,
expected -2226271756974174256
        at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(
HeaderBlockReader.java:91)
        at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(
POIFSFileSystem.java:83)
        at org.apache.poi.hdf.extractor.WordDocument.<init>(
WordDocument.java:193)
*
Would love to get replies.

Re: Word extraction doesn't work??

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 10 Aug 2006, Nir Nußbaum wrote:
> http://nussbaum.bizhat.com/poi/olddoc.doc - a word document from 2000

This seems to be a word95 file.

> http://nussbaum.bizhat.com/poi/newdoc.doc - a word document from 2006

This seems to be a Corel Word Perfect file

Nick

Re: Word extraction doesn't work??

Posted by Nir Nußbaum <ni...@gmail.com>.
I am quite sure it is a "real" word document.
I uploaded two files:
http://nussbaum.bizhat.com/poi/olddoc.doc - a word document from 2000
http://nussbaum.bizhat.com/poi/newdoc.doc - a word document from 2006

I would be happy if you/anyone will have a look at it.

Re: Word extraction doesn't work??

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 10 Aug 2006, Nir Nußbaum wrote:
> Here is a stack trace example:
> *Caused by: java.io.IOException: Invalid header signature; read
> 2337475350589629764, expected -2226271756974174256
>       at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(
> HeaderBlockReader.java:91)
>       at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(
> POIFSFileSystem.java:83)

Hmm, something deep within poifs doesn't think that your document is a 
valid OLE2 stream. Are you sure they're really word format, and not just 
something else that happens to have been saved as a .doc extension?

For example, I commonly see that error when I try to open RTF files with 
hwpf.

Nick

Re: Word extraction doesn't work??

Posted by Nir Nußbaum <ni...@gmail.com>.
I have no idea what version those files were created in. They were created
in the year 2000.
I just tried conversion of 1600 documents from the year 2006 created with a
new version and 464 files were converted successfully out of 1592. 29%
success.
Here is a stack trace example:
*Caused by: java.io.IOException: Invalid header signature; read
2337475350589629764, expected -2226271756974174256
        at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(
HeaderBlockReader.java:91)
        at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(
POIFSFileSystem.java:83)
        at org.apache.poi.hwpf.extractor.WordExtractor.<init>(
WordExtractor.java:32)
        at nl.uva.ilps.lib.dochandlers.WordDocumentHandler.extract(
WordDocumentHandler.java:60)
*
Any help?
Thanx

Re: Word extraction doesn't work??

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 10 Aug 2006, Nir Nußbaum wrote:
> I get it also with hwpf:
> *Caused by: java.io.FileNotFoundException: no such entry: "0Table"
>       at org.apache.poi.poifs.filesystem.DirectoryNode.getEntry(
> DirectoryNode.java:246)

I think you have a really early (pre word 97? pre word 95?) word document, 
if the file doesn't have the Table streams

Nick

Re: Word extraction doesn't work??

Posted by Nir Nußbaum <ni...@gmail.com>.
I get it also with hwpf:
*Caused by: java.io.FileNotFoundException: no such entry: "0Table"
        at org.apache.poi.poifs.filesystem.DirectoryNode.getEntry(
DirectoryNode.java:246)
        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:134)
        at org.apache.poi.hwpf.extractor.WordExtractor.<init>(
WordExtractor.java:40)
        at org.apache.poi.hwpf.extractor.WordExtractor.<init>(
WordExtractor.java:32)
        at nl.uva.ilps.lib.dochandlers.WordDocumentHandler.extract(
WordDocumentHandler.java:60)
        ... 2 more
*


2006/8/10, Nick Burch <ni...@torchbox.com>:
>
> On Thu, 10 Aug 2006, Nir Nußbaum wrote:
> > As you see from:
> > *org.apache.poi.hwpf.extractor.WordExtractor *
> > I used hwpf.
>
> But your stack trace was from hdf!
>
> Nick
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>


-- 
----------------------------------------------
Nir Nußbaum

Re: Word extraction doesn't work??

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 10 Aug 2006, Nir Nußbaum wrote:
> As you see from:
> *org.apache.poi.hwpf.extractor.WordExtractor *
> I used hwpf.

But your stack trace was from hdf!

Nick

Re: Word extraction doesn't work??

Posted by Nir Nußbaum <ni...@gmail.com>.
Thanks Nick,
As you see from:
*org.apache.poi.hwpf.extractor.WordExtractor *
I used hwpf.
It is Word, not RTF, albite from 2000.
By the way, I tried to convert with AbiWord command-line and it all went
well, more or less.
I try to enclose one of the documents, that can't be converted. Thanks
again.


2006/8/10, Nick Burch <ni...@torchbox.com>:
>
> On Thu, 10 Aug 2006, Nir Nußbaum wrote:
> > I am trying to extract pure text from Word (to index into Lucene):
> > I did:
> > *            org.apache.poi.hwpf.extractor.WordExtractor we=new
> > org.apache.poi.hwpf.extractor.WordExtractor(is);
> >           bodyText=we.getText();
>
> *snip*
>
> >       at org.apache.poi.hdf.extractor.WordDocument.<init>(
> > WordDocument.java:193)
>
> Which are you using, hdf or hwpf? You will probably have more luck with
> hwpf than hdf.
>
>
> My best guess though is that these aren't word documents. Try opening them
> in word, and see what they really are (eg rtf)
>
> Nick
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>


-- 
----------------------------------------------
Nir Nußbaum

Re: Word extraction doesn't work??

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 10 Aug 2006, Nir Nußbaum wrote:
> I am trying to extract pure text from Word (to index into Lucene):
> I did:
> *            org.apache.poi.hwpf.extractor.WordExtractor we=new
> org.apache.poi.hwpf.extractor.WordExtractor(is);
>           bodyText=we.getText();

*snip*

>       at org.apache.poi.hdf.extractor.WordDocument.<init>(
> WordDocument.java:193)

Which are you using, hdf or hwpf? You will probably have more luck with 
hwpf than hdf.


My best guess though is that these aren't word documents. Try opening them 
in word, and see what they really are (eg rtf)

Nick