You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Erik Hetzner (JIRA)" <ji...@apache.org> on 2010/05/19 02:34:56 UTC

[jira] Created: (TIKA-427) Parsing CSS as XML

Parsing CSS as XML
------------------

                 Key: TIKA-427
                 URL: https://issues.apache.org/jira/browse/TIKA-427
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.7
            Reporter: Erik Hetzner
            Priority: Minor


Perhaps related to TIKA-426?

$ curl -s http://datacenter.cit.nih.gov/interface/styles/nihstyles.css | java -jar tika-app-0.7.jar
Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@28bb0d0d
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:155)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:65)
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195)
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1039)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:86)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
	... 3 more


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-427) Parsing CSS as XML

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871611#action_12871611 ] 

Jukka Zitting commented on TIKA-427:
------------------------------------

The type detection code in Tika gets confused by the <!-- comment --> at the beginning of the file.

We should probably make the XML detector look beyond the first comment(s) to see what the followup text really looks like. Alternatively we could capture early parse errors in the XMLParser class and fall back to TXTParser in such cases.

> Parsing CSS as XML
> ------------------
>
>                 Key: TIKA-427
>                 URL: https://issues.apache.org/jira/browse/TIKA-427
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Erik Hetzner
>            Priority: Minor
>
> Perhaps related to TIKA-426?
> $ curl -s http://datacenter.cit.nih.gov/interface/styles/nihstyles.css | java -jar tika-app-0.7.jar
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@28bb0d0d
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:155)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:65)
> Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
> 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195)
> 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174)
> 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388)
> 	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1039)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
> 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
> 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
> 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:86)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> 	... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-427) Parsing CSS as XML

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-427.
--------------------------------

         Assignee: Jukka Zitting
    Fix Version/s: 0.8
       Resolution: Duplicate

The TIKA-426 fix helps here as well, so resolving as a duplicate of that issue.

> Parsing CSS as XML
> ------------------
>
>                 Key: TIKA-427
>                 URL: https://issues.apache.org/jira/browse/TIKA-427
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Erik Hetzner
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.8
>
>
> Perhaps related to TIKA-426?
> $ curl -s http://datacenter.cit.nih.gov/interface/styles/nihstyles.css | java -jar tika-app-0.7.jar
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@28bb0d0d
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:155)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:65)
> Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
> 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195)
> 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174)
> 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388)
> 	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1039)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
> 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
> 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
> 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:86)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> 	... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.