You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by ahnf <in...@yahoo.com> on 2007/02/12 21:16:40 UTC
high failure rate in WordDocument.writeAllText() extraction?
Hi,
We have roughly ~1900 MS Word documents in a file repository that is used in a DAM system. We have a need to simply extract text from the word documents for indexing purposes and figured we would give POI a try. We have tried using the stable 2.5.1 release as well as the alpha code, both with simular results of high failure percentages.
Using WordDocument.writeAllText()
SUCCESS= 1341 FAIL=585
Here are the 3 main exceptions we constantly get (below)
Using WordExtractor.getText() get < 10 failures
------------------------------------------------------------------------------------------------------
java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256
at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:91)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:189)
java.lang.NegativeArraySizeException
at org.apache.poi.hdf.extractor.data.ListTables.createLVL(ListTables.java:176)
at org.apache.poi.hdf.extractor.data.ListTables.initLFO(ListTables.java:148)
at org.apache.poi.hdf.extractor.data.ListTables.<init>(ListTables.java:42)
at org.apache.poi.hdf.extractor.WordDocument.createListTables(WordDocument.java:1639)
at org.apache.poi.hdf.extractor.WordDocument.findFormatting(WordDocument.java:364)
at org.apache.poi.hdf.extractor.WordDocument.processComplexFile(WordDocument.java:291)
at org.apache.poi.hdf.extractor.WordDocument.readFIB(WordDocument.java:243)
at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:193)
java.lang.ArrayIndexOutOfBoundsException: 396
at org.apache.poi.hdf.extractor.Utils.convertBytesToShort(Utils.java:47)
at org.apache.poi.hdf.extractor.data.ListTables.createLVL(ListTables.java:175)
at org.apache.poi.hdf.extractor.data.ListTables.initLFO(ListTables.java:148)
at org.apache.poi.hdf.extractor.data.ListTables.<init>(ListTables.java:42)
at org.apache.poi.hdf.extractor.WordDocument.createListTables(WordDocument.java:1639)
at org.apache.poi.hdf.extractor.WordDocument.findFormatting(WordDocument.java:364)
at org.apache.poi.hdf.extractor.WordDocument.processComplexFile(WordDocument.java:291)
at org.apache.poi.hdf.extractor.WordDocument.readFIB(WordDocument.java:243)
at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:193)
at org.openmrm.core.file.service.POIConverterService.executeConversion(POIConverterService.java:147)
---------------------------------
Any questions? Get answers on any topic at Yahoo! Answers. Try it now.
Re: high failure rate in WordDocument.writeAllText() extraction?
Posted by Nick Burch <ni...@torchbox.com>.
On Mon, 12 Feb 2007, ahnf wrote:
> Using WordDocument.writeAllText()
Try using org.apache.poi.hwpf.extractor.WordExtractor from the latest
alpha - people tend to have the best luck with that.
http://jakarta.apache.org/poi/hwpf/quick-guide.html
> java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256
> at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:91)
> at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
> at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:189)
The document isn't an OLE2 document, so poi can't read it. Try with hwpf,
it tends to give more helpful error messages from invalid files (eg
IllegalArgumentException("The document is really a RTF file") )
As for the others, see if you get a different error with hwpf. You might
have more luck (I usually do)
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: Excel text extraction?
Posted by Daniel Noll <da...@nuix.com>.
ahnf wrote:
> Hi, Simular to the WordExtractor, does POI (or anyone else for that
> matter) provide a simple text extraction utility for just pulling all
> textual content out of an excel file?
It's reasonably easy to write one yourself.
One reason it might not be in there already is that people have wildly
differing ideas on what constitutes "text" in an Excel document.
For example, my own opinion is that:
- String values are text
- Numeric and date values are not text
- Cell comments are text
Daniel
--
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61 2 9280 0699
Web: http://nuix.com/ Fax: +61 2 9212 6902
This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Excel text extraction?
Posted by ahnf <in...@yahoo.com>.
Hi,
Simular to the WordExtractor, does POI (or anyone else for that matter) provide a simple text extraction utility for just pulling all textual content out of an excel file?
thanks!
---------------------------------
Everyone is raving about the all-new Yahoo! Mail beta.