You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/11/24 21:23:47 UTC

Invalid header for xls: 0x0010000000060409?

All,
  I recently ran Tika against the ~1 million files in govdocs1.  Nearly 91% (2,579/2,828) of the XLS exceptions via Tika 1.7 are the following.  Tika is detecting these as XLS and then the header exception is thrown.
  Does this header ring any bells?  Old version of XLS, perhaps?  The triggering files open in Excel and I think I see that they are "Excel 4".
  I can't get the link to work, but one triggering file is 004444.xls.

          Best,

                   Tim


Caused by: java.io.IOException: Invalid header signature; read 0x0010000000060409, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140) at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:162) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 13 more

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: Invalid header for xls: 0x0010000000060409?

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 25 Nov 2014, Allison, Timothy B. wrote:
> Thank you, Nick!  I'll post a file to Tika's JIRA.  Or, should I raise 
> this on POI's bugzilla?  I can't imagine there's a burning need (or 
> interest to add) processing for pre-OLE2 docs.

It'll want to be a Tika issue - POI won't handle these, and Tika needs to 
be returning a different mimetype for them as they're not an OLE2 based 
Excel file

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: Invalid header for xls: 0x0010000000060409?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Nick!  I'll post a file to Tika's JIRA.  Or, should I raise this on POI's bugzilla?  I can't imagine there's a burning need (or interest to add) processing for pre-OLE2 docs.

 -----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Tuesday, November 25, 2014 9:20 AM
To: POI Users List
Subject: Re: Invalid header for xls: 0x0010000000060409?

On Mon, 24 Nov 2014, Allison, Timothy B. wrote:
> I recently ran Tika against the ~1 million files in govdocs1.  Nearly 
> 91% (2,579/2,828) of the XLS exceptions via Tika 1.7 are the following. 
> Tika is detecting these as XLS and then the header exception is thrown.

You need to read that backwards to see the pattern, so the file starts 
with 0x090406

> Does this header ring any bells?  Old version of XLS, perhaps?  The 
> triggering files open in Excel and I think I see that they are "Excel 
> 4".

Sounds like one of the very old, pre-ole2 versions

Looking at the OpenOffice documentation, under section 2.2 and 2.3:
http://www.openoffice.org/sc/excelfileformat.pdf

That suggests that Excel 5 onwards (5, 95, 97 etc) used OLE2, so that'd 
mean it's Excel 1 through Excel 4

> I can't get the link to work, but one triggering file is 004444.xls.

If you can get that file out, and raise a JIRA, then we can look to add in 
magic to correctly detect/handle those files!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Invalid header for xls: 0x0010000000060409?

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 24 Nov 2014, Allison, Timothy B. wrote:
> I recently ran Tika against the ~1 million files in govdocs1.  Nearly 
> 91% (2,579/2,828) of the XLS exceptions via Tika 1.7 are the following. 
> Tika is detecting these as XLS and then the header exception is thrown.

You need to read that backwards to see the pattern, so the file starts 
with 0x090406

> Does this header ring any bells?  Old version of XLS, perhaps?  The 
> triggering files open in Excel and I think I see that they are "Excel 
> 4".

Sounds like one of the very old, pre-ole2 versions

Looking at the OpenOffice documentation, under section 2.2 and 2.3:
http://www.openoffice.org/sc/excelfileformat.pdf

That suggests that Excel 5 onwards (5, 95, 97 etc) used OLE2, so that'd 
mean it's Excel 1 through Excel 4

> I can't get the link to work, but one triggering file is 004444.xls.

If you can get that file out, and raise a JIRA, then we can look to add in 
magic to correctly detect/handle those files!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org