You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by Antony Bowesman <ad...@teamware.com> on 2006/11/14 01:14:32 UTC

Word document extraction fails - bad file length

I'm using Lucene+POI to index documents.  Text extraction from a Word document 
fails, either using HDF WordDocument or HWPF WordExtractor.  Esentially it is 
the same IOException of

java.io.IOException: Unable to read entire block; 511 bytes read; expected 512 bytes

coming from

org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:62)
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:51)
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:86)

The original Word document is 28671 bytes long and this is 1 byte short of a 512 
byte boundary.  If I use Word to just remove the final line of the document and 
resave it, it becomes a 512 byte bounded 28672.

The original does seem to be a Word document, i.e. it's not RTF and has similar 
binary structure as other .doc files.

Is it usual to find documents that are not padded to 512 byte boundaries. 
Looking back at old Word documents from the '90s, I can see a number that are not.

As an experiment I took one of the old docs and padded it suitably and got the 
Exception

Caused by: java.io.FileNotFoundException: no such entry: "0Table"
         at 
org.apache.poi.poifs.filesystem.DirectoryNode.getEntry(DirectoryNode.java:245)
         at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:134)
         at 
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:39)
         at 
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:31)

Does anyone know the rule about this?  Are non 512 byte padded documents invalid 
or just some older version of the doc format.

Can anyone shed any light...
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: Word document extraction fails - bad file length

Posted by Antony Bowesman <ad...@teamware.com>.

Hi Rainer,

Thanks for your comments.

>> The original does seem to be a Word document, i.e. it's not RTF and 
>> has similar 
>> binary structure as other .doc files.
> 
> Do you know which version it is? (Can you find something like
> "Word.Document.#" where # is a number in the hex dump?)

The one particular problematic document has "Word.Document.8" in the last 512 
byte block in the file (which is 511 bytes).  Earlier in the file it has 
"Microsoft Word 10.0".  The properties of the docuement when opened with my Word 
2002 show "Microsoft Word 97-2002 Document".  I understand this is an Office XP 
file.

> The Word files which HDF/HWPF can handle must have a table stream
> with the name "0Table" or "1Table". So either the file is not an OLE2
> docfile, or it is but does not have a table stream (not sure whether
> the second case exists or not).
> So my guess is, that the '90s file is too old for HWPF in that it is
> not an OLE2 docfile.

The '90s files are shows as Word 6.0/95, For the problem doc, the Root Entry 
block has "1_T_a_b_l_e" at pos root + 0x100 and the doc was created Jan 2006.

> Maybe a quick indicator would be: Look for "0_T_a_b_l_e_" or
> "1_T_a_b_l_e_" in the hex dump ('_' shall represent the 0x00 byte for
> now). If its not there, HWPF/HDF can't read it. 

Thanks for that, I've looked at the Nutch sources and seen that is uses a 
specific WordDocument6 parser to parse the older files and POI for others.

Just opening the problem file with word 2002 and saving it to a new filename 
adds the extra trailing null to the file.

If MS Word can open my problem document without telling me there is a problem, 
shouldn't POI be able to handle this case.

Ha, I see that even Jon Postel now has a Wikipedia entry for his "robustness 
principle" called Postel's Law!

http://en.wikipedia.org/wiki/Postel's_law

I can make a hack to my document handler to check for Word doc 512 byte padding 
in the event of Exceptions and repad the file and have another go, but it's a 
bit of a hack :)

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: Word document extraction fails - bad file length

Posted by Rainer Schwarze <rs...@admadic.de>.

Hi Antony,

At 01:14 14.11.2006, Antony Bowesman wrote:
>I'm using Lucene+POI to index documents.  Text extraction from a
>Word document 
>fails, either using HDF WordDocument or HWPF WordExtractor.  Esentially it is 
>the same IOException of
>
>java.io.IOException: Unable to read entire block; 511 bytes read; 
>expected 512 bytes

I'm not too sure about POIFS - I would expect that Word files using
the OLE2 docfile format should have a size which is a multiple of 512.

>The original does seem to be a Word document, i.e. it's not RTF and 
>has similar 
>binary structure as other .doc files.

Do you know which version it is? (Can you find something like
"Word.Document.#" where # is a number in the hex dump?)

>Is it usual to find documents that are not padded to 512 byte boundaries. 
>Looking back at old Word documents from the '90s, I can see a number 
>that are not.
[...]
>As an experiment I took one of the old docs and padded it suitably 
>and got the Exception
[...]
>Caused by: java.io.FileNotFoundException: no such entry: "0Table"

The Word files which HDF/HWPF can handle must have a table stream
with the name "0Table" or "1Table". So either the file is not an OLE2
docfile, or it is but does not have a table stream (not sure whether
the second case exists or not).
So my guess is, that the '90s file is too old for HWPF in that it is
not an OLE2 docfile.

>Does anyone know the rule about this?  Are non 512 byte padded 
>documents invalid 
>or just some older version of the doc format.

Maybe a quick indicator would be: Look for "0_T_a_b_l_e_" or
"1_T_a_b_l_e_" in the hex dump ('_' shall represent the 0x00 byte for
now). If its not there, HWPF/HDF can't read it. 

Best wishes,
Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/