You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2013/07/25 12:09:54 UTC
[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

    [ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719468#comment-13719468 ] 

Nick Burch commented on TIKA-1154:
----------------------------------

Stracktrace for the hang seems to be:

	at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
	at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54)
	at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41)
	at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192)
	at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)

Perhaps we need to tweak the options/configuration we give to the SAX parser when we ask it to work out what kind of XML it is to avoid this?
                
> Tika hangs on format detection of malformed HTML file.
> ------------------------------------------------------
>
>                 Key: TIKA-1154
>                 URL: https://issues.apache.org/jira/browse/TIKA-1154
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.4
>            Reporter: Andrew Jackson
>            Priority: Minor
>         Attachments: tika-breaker.html
>
>
> We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection.
> An example file is attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira