You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Erik Hetzner (JIRA)" <ji...@apache.org> on 2010/05/19 02:34:56 UTC
[jira] Created: (TIKA-427) Parsing CSS as XML
Parsing CSS as XML
------------------
Key: TIKA-427
URL: https://issues.apache.org/jira/browse/TIKA-427
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.7
Reporter: Erik Hetzner
Priority: Minor
Perhaps related to TIKA-426?
$ curl -s http://datacenter.cit.nih.gov/interface/styles/nihstyles.css | java -jar tika-app-0.7.jar
Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@28bb0d0d
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:155)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:65)
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1039)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:86)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
... 3 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-427) Parsing CSS as XML
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871611#action_12871611 ]
Jukka Zitting commented on TIKA-427:
------------------------------------
The type detection code in Tika gets confused by the <!-- comment --> at the beginning of the file.
We should probably make the XML detector look beyond the first comment(s) to see what the followup text really looks like. Alternatively we could capture early parse errors in the XMLParser class and fall back to TXTParser in such cases.
> Parsing CSS as XML
> ------------------
>
> Key: TIKA-427
> URL: https://issues.apache.org/jira/browse/TIKA-427
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Reporter: Erik Hetzner
> Priority: Minor
>
> Perhaps related to TIKA-426?
> $ curl -s http://datacenter.cit.nih.gov/interface/styles/nihstyles.css | java -jar tika-app-0.7.jar
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@28bb0d0d
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:155)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:65)
> Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195)
> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174)
> at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388)
> at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414)
> at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1039)
> at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
> at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
> at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
> at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
> at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
> at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
> at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:86)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> ... 3 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-427) Parsing CSS as XML
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-427.
--------------------------------
Assignee: Jukka Zitting
Fix Version/s: 0.8
Resolution: Duplicate
The TIKA-426 fix helps here as well, so resolving as a duplicate of that issue.
> Parsing CSS as XML
> ------------------
>
> Key: TIKA-427
> URL: https://issues.apache.org/jira/browse/TIKA-427
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Reporter: Erik Hetzner
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.8
>
>
> Perhaps related to TIKA-426?
> $ curl -s http://datacenter.cit.nih.gov/interface/styles/nihstyles.css | java -jar tika-app-0.7.jar
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException from org.apache.tika.parser.xml.DcXMLParser@28bb0d0d
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:155)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:65)
> Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195)
> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174)
> at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388)
> at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414)
> at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1039)
> at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
> at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
> at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
> at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
> at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
> at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
> at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:86)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> ... 3 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.