You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/05 16:14:10 UTC

[jira] [Created] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Valid OOXML PPT file hits InvalidFormatException thrown in POI
--------------------------------------------------------------

                 Key: TIKA-705
                 URL: https://issues.apache.org/jira/browse/TIKA-705
             Project: Tika
          Issue Type: Bug
            Reporter: Michael McCandless
         Attachments: testPPT_various.pptx

I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.

But when I did this for PPTX, the resulting file hits this exception:
{noformat}

Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
	... 9 more
{noformat}

All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097319#comment-13097319 ] 

Nick Burch commented on TIKA-705:
---------------------------------

I'll need to read the spec to be sure, but I have a feeling it could be our issue with not removing anchors before fetching parts.

Either way we probably want to make it easier for people to get related parts anyway, as the current method is a bit more fiddly that we really want.

This will probably largely all be done on the POI side though, with the only Tika bit being moving to the new, simpler code once available

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097217#comment-13097217 ] 

Nick Burch commented on TIKA-705:
---------------------------------

Looks to be a problem with a reference to part of a slide, rather than a whole slide:
   /ppt/slides/slide1.xml#_ftn1

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-705:
------------------------------------

    Attachment: testPPT_various.pptx

PPTX file showing the exception.

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Michael McCandless
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107974#comment-13107974 ] 

Michael McCandless commented on TIKA-705:
-----------------------------------------

Thanks Nick!

I verified that the testVarious test case (in OOXMLParserTest) now passes (I had left it commented out), so I'll go uncomment & commit.

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-705:
-------------------------------

    Fix Version/s:     (was: 0.10)

Removing from the 0.10 roadmap, let's set the fix version to the next release once the fix is in.

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172973#comment-13172973 ] 

Nick Burch commented on TIKA-705:
---------------------------------

Code simplified in r1221115 now that we've upgraded POI
                
> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107959#comment-13107959 ] 

Nick Burch commented on TIKA-705:
---------------------------------

Initial workaround committed in r1172690.

The proper fix is commented out in the code, and can be activated when we upgrade to POI 3.8 beta 5 (I've added a new method there)

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-705:
------------------------------------

          Component/s: parser
    Affects Version/s: 0.9
        Fix Version/s: 1.0

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-705.
-------------------------------------

    Resolution: Fixed

I think we can resolve this now?  TIKA-757 is open to address TODOs on next POI upgrade.
                
> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097261#comment-13097261 ] 

Michael McCandless commented on TIKA-705:
-----------------------------------------

Thanks for looking at this Nick!

So, is this something I somehow screwed up using Powerpoint 2007?  Or PowerPoint 2007 is simply producing an invalid OOXML file?

Is there anything we (or POI) can do here?  It's bad if users can produce things "normally" (ie just using PowerPoint) which Tika then chokes on...

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-705:
------------------------------------

    Fix Version/s: 1.0
    
> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira