You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org> on 2011/10/25 23:12:33 UTC

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

     [ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-715:
-----------------------------------

    Fix Version/s:     (was: 1.0)
                   1.1

- push out to 1.1: prep for 1.0.
                
> Some parsers produce non-well-formed XHTML SAX events
> -----------------------------------------------------
>
>                 Key: TIKA-715
>                 URL: https://issues.apache.org/jira/browse/TIKA-715
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Michael McCandless
>             Fix For: 1.1
>
>         Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that <p> is never
> embedded inside another <p>; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
> 	at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
> 	at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
> 	at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
> 	at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
> 	at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
> 	at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
> 	at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
> 	at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
> 	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
> 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
> 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
> 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> 	at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
> 	at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 0.025 sec  <<< ERROR!
> java.lang.AssertionError: p inside p
> 	at org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216)
> 	at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:203)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:267)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:271)
> 	at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:128)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
> 	at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:77)
> 	at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:101)
> 	at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:72)
> 	at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:75)
> testUnusualFromAddress(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 0.037 sec  <<< ERROR!
> java.lang.AssertionError: p inside p
> 	at org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216)
> 	at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:203)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:267)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:283)
> 	at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:201)
> 	at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
> 	at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
> 	at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
> 	at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
> 	at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:472)
> 	at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:202)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
> 	at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:77)
> 	at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:101)
> 	at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:72)
> 	at org.apache.tika.parser.mail.RFC822ParserTest.testUnusualFromAddress(RFC822ParserTest.java:166)
> testOO3(org.apache.tika.parser.odf.ODFParserTest)  Time elapsed: 0.003 sec  <<< ERROR!
> java.lang.AssertionError: p inside p
> 	at org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216)
> 	at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ElementMappingContentHandler.startElement(ElementMappingContentHandler.java:54)
> 	at org.apache.tika.parser.odf.OpenDocumentContentParser$1.startElement(OpenDocumentContentParser.java:271)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.parser.odf.NSNormalizerContentHandler.startElement(NSNormalizerContentHandler.java:68)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:501)
> 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:400)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2755)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
> 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
> 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
> 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> 	at org.apache.tika.parser.odf.OpenDocumentContentParser.parse(OpenDocumentContentParser.java:335)
> 	at org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:125)
> 	at org.apache.tika.parser.odf.ODFParserTest.testOO3(ODFParserTest.java:49)
> testOO3Metadata(org.apache.tika.parser.odf.ODFParserTest)  Time elapsed: 0.001 sec  <<< ERROR!
> java.lang.AssertionError: p inside p
> 	at org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216)
> 	at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ElementMappingContentHandler.startElement(ElementMappingContentHandler.java:54)
> 	at org.apache.tika.parser.odf.OpenDocumentContentParser$1.startElement(OpenDocumentContentParser.java:271)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.parser.odf.NSNormalizerContentHandler.startElement(NSNormalizerContentHandler.java:68)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:501)
> 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:400)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2755)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
> 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
> 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
> 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> 	at org.apache.tika.parser.odf.OpenDocumentContentParser.parse(OpenDocumentContentParser.java:335)
> 	at org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:125)
> 	at org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:48)
> 	at org.apache.tika.parser.odf.ODFParserTest.testOO3Metadata(ODFParserTest.java:168)
> testZipBombPrevention(org.apache.tika.parser.AutoDetectParserTest)  Time elapsed: 0.055 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=p close=div
> 	at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
> 	at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
> 	at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
> 	at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:106)
> 	at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
> 	at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:107)
> 	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:61)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> 	at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:95)
> 	at org.apache.tika.parser.pkg.PackageExtractor.decompress(PackageExtractor.java:135)
> 	at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:93)
> 	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:61)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
> 	at org.apache.tika.parser.AutoDetectParserTest.testZipBombPrevention(AutoDetectParserTest.java:224)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira