You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "John Mastarone (Created) (JIRA)" <ji...@apache.org> on 2011/12/22 05:15:30 UTC

[jira] [Created] (TIKA-826) TikaException / OfficeXmlFileException with .xlsb files

TikaException / OfficeXmlFileException with .xlsb files
-------------------------------------------------------

                 Key: TIKA-826
                 URL: https://issues.apache.org/jira/browse/TIKA-826
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.1
         Environment: Windows 7
            Reporter: John Mastarone


The file testEXCEL.xlsb in the tika-parsers test-documents folder causes a POI OfficeXmlFileException when one tries to open it with TikaGUI or TikaCLI, using a latest build.  The reason: Tika has it configured to be opened with the OfficeParser class, rather than the OOXMLParser class; it is an Office 2007 file, and should be opened with the OOXMLParser class.  Neither the ExcelParserTest class nor the OOXMLParserTest class has anything related to .xlsb files.  Once changes are made to these two parsers so that the OOXMLParser is used (I'll submit a patch shortly for these), the OfficeXmlFileException goes away, and a new POI exception (IllegalArgumentException in the ExtractorFactory class) arises in its place, somewhat related to unsolved POI bug 51921; the creator of this bug mentions a .xlsb file among others.  This exception appears to occur because POI doesn't seem to be able to handle .xlsb files whatsoever.  A cursory search of the source for "xlsb" or its mime type yields nothing relevant, and its project has no .xlsb test files that I can see.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-826) TikaException / OfficeXmlFileException with .xlsb files

Posted by "John Mastarone (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175625#comment-13175625 ] 

John Mastarone commented on TIKA-826:
-------------------------------------

The way things are presently, there's an explicit line in the OfficeParser for the xlsb mime type, which results in the OfficeXmlFileException, with message "The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)".  If one removes this line from OfficeParser, then, whether one cuts and pastes this line into the OOXMLParser or does nothing and leaves it out of the parsers entirely, the OOXMLParser tries to handle the document, and the same exception ensues in either case: "org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while getting content for thmx and xps file types
	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:106)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)....
Caused by: java.lang.IllegalArgumentException: No supported documents found in the OOXML package (found application/vnd.ms-excel.sheet.binary.macroEnabled.main)
	at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:191)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63)
	... 43 more "
I don't know enough about the code base to know how to get the parsers to leave .xlsb files alone entirely.
                
> TikaException / OfficeXmlFileException with .xlsb files
> -------------------------------------------------------
>
>                 Key: TIKA-826
>                 URL: https://issues.apache.org/jira/browse/TIKA-826
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>         Environment: Windows 7
>            Reporter: John Mastarone
>         Attachments: TIKA-826.patch
>
>
> The file testEXCEL.xlsb in the tika-parsers test-documents folder causes a POI OfficeXmlFileException when one tries to open it with TikaGUI or TikaCLI, using a latest build.  The reason: Tika has it configured to be opened with the OfficeParser class, rather than the OOXMLParser class; it is an Office 2007 file, and should be opened with the OOXMLParser class.  Neither the ExcelParserTest class nor the OOXMLParserTest class has anything related to .xlsb files.  Once changes are made to these two parsers so that the OOXMLParser is used (I'll submit a patch shortly for these), the OfficeXmlFileException goes away, and a new POI exception (IllegalArgumentException in the ExtractorFactory class) arises in its place, somewhat related to unsolved POI bug 51921; the creator of this bug mentions a .xlsb file among others.  This exception appears to occur because POI doesn't seem to be able to handle .xlsb files whatsoever.  A cursory search of the source for "xlsb" or its mime type yields nothing relevant, and its project has no .xlsb test files that I can see.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-826) TikaException / OfficeXmlFileException with .xlsb files

Posted by "John Mastarone (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174803#comment-13174803 ] 

John Mastarone commented on TIKA-826:
-------------------------------------

After reading a little more on this, I see that POI doesn't plan on supporting xlsb files anytime soon (?), so, should Tika really try to handle them at all?
                
> TikaException / OfficeXmlFileException with .xlsb files
> -------------------------------------------------------
>
>                 Key: TIKA-826
>                 URL: https://issues.apache.org/jira/browse/TIKA-826
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>         Environment: Windows 7
>            Reporter: John Mastarone
>         Attachments: TIKA-826.patch
>
>
> The file testEXCEL.xlsb in the tika-parsers test-documents folder causes a POI OfficeXmlFileException when one tries to open it with TikaGUI or TikaCLI, using a latest build.  The reason: Tika has it configured to be opened with the OfficeParser class, rather than the OOXMLParser class; it is an Office 2007 file, and should be opened with the OOXMLParser class.  Neither the ExcelParserTest class nor the OOXMLParserTest class has anything related to .xlsb files.  Once changes are made to these two parsers so that the OOXMLParser is used (I'll submit a patch shortly for these), the OfficeXmlFileException goes away, and a new POI exception (IllegalArgumentException in the ExtractorFactory class) arises in its place, somewhat related to unsolved POI bug 51921; the creator of this bug mentions a .xlsb file among others.  This exception appears to occur because POI doesn't seem to be able to handle .xlsb files whatsoever.  A cursory search of the source for "xlsb" or its mime type yields nothing relevant, and its project has no .xlsb test files that I can see.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-826) TikaException / OfficeXmlFileException with .xlsb files

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175240#comment-13175240 ] 

Nick Burch commented on TIKA-826:
---------------------------------

POI doesn't support .xlsb files, and nor is it likely to any time soon (there's just no volunteer energy to work on it, nor interest)

.xlsb files should be being detected as application/vnd.ms-excel.sheet.binary.macroenabled.12, and I don't believe any parser should be claiming them.

Are you able to identify where along the way things go wrong?
                
> TikaException / OfficeXmlFileException with .xlsb files
> -------------------------------------------------------
>
>                 Key: TIKA-826
>                 URL: https://issues.apache.org/jira/browse/TIKA-826
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>         Environment: Windows 7
>            Reporter: John Mastarone
>         Attachments: TIKA-826.patch
>
>
> The file testEXCEL.xlsb in the tika-parsers test-documents folder causes a POI OfficeXmlFileException when one tries to open it with TikaGUI or TikaCLI, using a latest build.  The reason: Tika has it configured to be opened with the OfficeParser class, rather than the OOXMLParser class; it is an Office 2007 file, and should be opened with the OOXMLParser class.  Neither the ExcelParserTest class nor the OOXMLParserTest class has anything related to .xlsb files.  Once changes are made to these two parsers so that the OOXMLParser is used (I'll submit a patch shortly for these), the OfficeXmlFileException goes away, and a new POI exception (IllegalArgumentException in the ExtractorFactory class) arises in its place, somewhat related to unsolved POI bug 51921; the creator of this bug mentions a .xlsb file among others.  This exception appears to occur because POI doesn't seem to be able to handle .xlsb files whatsoever.  A cursory search of the source for "xlsb" or its mime type yields nothing relevant, and its project has no .xlsb test files that I can see.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-826) TikaException / OfficeXmlFileException with .xlsb files

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178632#comment-13178632 ] 

Nick Burch commented on TIKA-826:
---------------------------------

Should be fixed in r1226651 - Neither parser now claims the format, and if it gets to the OOXML one on the basis of the parent type, it's declined. Tests also added for these cases.
                
> TikaException / OfficeXmlFileException with .xlsb files
> -------------------------------------------------------
>
>                 Key: TIKA-826
>                 URL: https://issues.apache.org/jira/browse/TIKA-826
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>         Environment: Windows 7
>            Reporter: John Mastarone
>             Fix For: 1.1
>
>         Attachments: TIKA-826.patch
>
>
> The file testEXCEL.xlsb in the tika-parsers test-documents folder causes a POI OfficeXmlFileException when one tries to open it with TikaGUI or TikaCLI, using a latest build.  The reason: Tika has it configured to be opened with the OfficeParser class, rather than the OOXMLParser class; it is an Office 2007 file, and should be opened with the OOXMLParser class.  Neither the ExcelParserTest class nor the OOXMLParserTest class has anything related to .xlsb files.  Once changes are made to these two parsers so that the OOXMLParser is used (I'll submit a patch shortly for these), the OfficeXmlFileException goes away, and a new POI exception (IllegalArgumentException in the ExtractorFactory class) arises in its place, somewhat related to unsolved POI bug 51921; the creator of this bug mentions a .xlsb file among others.  This exception appears to occur because POI doesn't seem to be able to handle .xlsb files whatsoever.  A cursory search of the source for "xlsb" or its mime type yields nothing relevant, and its project has no .xlsb test files that I can see.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (TIKA-826) TikaException / OfficeXmlFileException with .xlsb files

Posted by "Nick Burch (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-826.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
    
> TikaException / OfficeXmlFileException with .xlsb files
> -------------------------------------------------------
>
>                 Key: TIKA-826
>                 URL: https://issues.apache.org/jira/browse/TIKA-826
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>         Environment: Windows 7
>            Reporter: John Mastarone
>             Fix For: 1.1
>
>         Attachments: TIKA-826.patch
>
>
> The file testEXCEL.xlsb in the tika-parsers test-documents folder causes a POI OfficeXmlFileException when one tries to open it with TikaGUI or TikaCLI, using a latest build.  The reason: Tika has it configured to be opened with the OfficeParser class, rather than the OOXMLParser class; it is an Office 2007 file, and should be opened with the OOXMLParser class.  Neither the ExcelParserTest class nor the OOXMLParserTest class has anything related to .xlsb files.  Once changes are made to these two parsers so that the OOXMLParser is used (I'll submit a patch shortly for these), the OfficeXmlFileException goes away, and a new POI exception (IllegalArgumentException in the ExtractorFactory class) arises in its place, somewhat related to unsolved POI bug 51921; the creator of this bug mentions a .xlsb file among others.  This exception appears to occur because POI doesn't seem to be able to handle .xlsb files whatsoever.  A cursory search of the source for "xlsb" or its mime type yields nothing relevant, and its project has no .xlsb test files that I can see.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (TIKA-826) TikaException / OfficeXmlFileException with .xlsb files

Posted by "John Mastarone (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Mastarone updated TIKA-826:
--------------------------------

    Attachment: TIKA-826.patch

Patch for OfficeParser and OOXMLParser classes so that .xlsb files are handled by the latter.  Solves one problem and exposes another.
                
> TikaException / OfficeXmlFileException with .xlsb files
> -------------------------------------------------------
>
>                 Key: TIKA-826
>                 URL: https://issues.apache.org/jira/browse/TIKA-826
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>         Environment: Windows 7
>            Reporter: John Mastarone
>         Attachments: TIKA-826.patch
>
>
> The file testEXCEL.xlsb in the tika-parsers test-documents folder causes a POI OfficeXmlFileException when one tries to open it with TikaGUI or TikaCLI, using a latest build.  The reason: Tika has it configured to be opened with the OfficeParser class, rather than the OOXMLParser class; it is an Office 2007 file, and should be opened with the OOXMLParser class.  Neither the ExcelParserTest class nor the OOXMLParserTest class has anything related to .xlsb files.  Once changes are made to these two parsers so that the OOXMLParser is used (I'll submit a patch shortly for these), the OfficeXmlFileException goes away, and a new POI exception (IllegalArgumentException in the ExtractorFactory class) arises in its place, somewhat related to unsolved POI bug 51921; the creator of this bug mentions a .xlsb file among others.  This exception appears to occur because POI doesn't seem to be able to handle .xlsb files whatsoever.  A cursory search of the source for "xlsb" or its mime type yields nothing relevant, and its project has no .xlsb test files that I can see.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Issue Comment Edited] (TIKA-826) TikaException / OfficeXmlFileException with .xlsb files

Posted by "John Mastarone (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174803#comment-13174803 ] 

John Mastarone edited comment on TIKA-826 at 12/22/11 1:35 PM:
---------------------------------------------------------------

After reading a little more on this, I see that POI doesn't plan on supporting xlsb files anytime soon (I think), so, should Tika really try to handle them at all?
                
      was (Author: jfm.apache):
    After reading a little more on this, I see that POI doesn't plan on supporting xlsb files anytime soon (?), so, should Tika really try to handle them at all?
                  
> TikaException / OfficeXmlFileException with .xlsb files
> -------------------------------------------------------
>
>                 Key: TIKA-826
>                 URL: https://issues.apache.org/jira/browse/TIKA-826
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>         Environment: Windows 7
>            Reporter: John Mastarone
>         Attachments: TIKA-826.patch
>
>
> The file testEXCEL.xlsb in the tika-parsers test-documents folder causes a POI OfficeXmlFileException when one tries to open it with TikaGUI or TikaCLI, using a latest build.  The reason: Tika has it configured to be opened with the OfficeParser class, rather than the OOXMLParser class; it is an Office 2007 file, and should be opened with the OOXMLParser class.  Neither the ExcelParserTest class nor the OOXMLParserTest class has anything related to .xlsb files.  Once changes are made to these two parsers so that the OOXMLParser is used (I'll submit a patch shortly for these), the OfficeXmlFileException goes away, and a new POI exception (IllegalArgumentException in the ExtractorFactory class) arises in its place, somewhat related to unsolved POI bug 51921; the creator of this bug mentions a .xlsb file among others.  This exception appears to occur because POI doesn't seem to be able to handle .xlsb files whatsoever.  A cursory search of the source for "xlsb" or its mime type yields nothing relevant, and its project has no .xlsb test files that I can see.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira