You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by ahnf <in...@yahoo.com> on 2007/02/12 21:16:40 UTC

high failure rate in WordDocument.writeAllText() extraction?

Hi,
We have roughly ~1900 MS Word documents in a file repository that is used in a DAM system. We have a need to simply extract text from the word documents for indexing purposes and figured we would give POI a try. We have tried using the stable 2.5.1 release as well as the alpha code, both with simular results of high failure percentages. 

Using WordDocument.writeAllText() 

SUCCESS= 1341 FAIL=585

Here are the 3 main exceptions we constantly get (below)

Using WordExtractor.getText() get < 10 failures
------------------------------------------------------------------------------------------------------

java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256
    at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:91)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
    at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:189)


java.lang.NegativeArraySizeException
    at org.apache.poi.hdf.extractor.data.ListTables.createLVL(ListTables.java:176)
    at org.apache.poi.hdf.extractor.data.ListTables.initLFO(ListTables.java:148)
    at org.apache.poi.hdf.extractor.data.ListTables.<init>(ListTables.java:42)
    at org.apache.poi.hdf.extractor.WordDocument.createListTables(WordDocument.java:1639)
    at org.apache.poi.hdf.extractor.WordDocument.findFormatting(WordDocument.java:364)
    at org.apache.poi.hdf.extractor.WordDocument.processComplexFile(WordDocument.java:291)
    at org.apache.poi.hdf.extractor.WordDocument.readFIB(WordDocument.java:243)
    at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:193)



java.lang.ArrayIndexOutOfBoundsException: 396
    at org.apache.poi.hdf.extractor.Utils.convertBytesToShort(Utils.java:47)
    at org.apache.poi.hdf.extractor.data.ListTables.createLVL(ListTables.java:175)
    at org.apache.poi.hdf.extractor.data.ListTables.initLFO(ListTables.java:148)
    at org.apache.poi.hdf.extractor.data.ListTables.<init>(ListTables.java:42)
    at org.apache.poi.hdf.extractor.WordDocument.createListTables(WordDocument.java:1639)
    at org.apache.poi.hdf.extractor.WordDocument.findFormatting(WordDocument.java:364)
    at org.apache.poi.hdf.extractor.WordDocument.processComplexFile(WordDocument.java:291)
    at org.apache.poi.hdf.extractor.WordDocument.readFIB(WordDocument.java:243)
    at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:193)
    at org.openmrm.core.file.service.POIConverterService.executeConversion(POIConverterService.java:147)









 
---------------------------------
Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.

Re: high failure rate in WordDocument.writeAllText() extraction?

Posted by Nick Burch <ni...@torchbox.com>.
On Mon, 12 Feb 2007, ahnf wrote:
> Using WordDocument.writeAllText()

Try using org.apache.poi.hwpf.extractor.WordExtractor from the latest
alpha - people tend to have the best luck with that.
	http://jakarta.apache.org/poi/hwpf/quick-guide.html

> java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256
>     at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:91)
>     at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
>     at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:189)

The document isn't an OLE2 document, so poi can't read it. Try with hwpf,
it tends to give more helpful error messages from invalid files (eg
IllegalArgumentException("The document is really a RTF file") )


As for the others, see if you get a different error with hwpf. You might
have more luck (I usually do)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Excel text extraction?

Posted by Daniel Noll <da...@nuix.com>.
ahnf wrote:
> Hi, Simular to the WordExtractor, does POI (or anyone else for that
> matter) provide a simple text extraction utility for just pulling all
> textual content out of an excel file?

It's reasonably easy to write one yourself.

One reason it might not be in there already is that people have wildly 
differing ideas on what constitutes "text" in an Excel document.

For example, my own opinion is that:
   - String values are text
   - Numeric and date values are not text
   - Cell comments are text

Daniel


-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Excel text extraction?

Posted by ahnf <in...@yahoo.com>.
Hi,
Simular to the WordExtractor, does POI (or anyone else for that matter) provide a simple text extraction utility for just pulling all textual content out of an excel file?

thanks!

 
---------------------------------
Everyone is raving about the all-new Yahoo! Mail beta.