You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/06/03 12:28:07 UTC

[jira] Created: (TIKA-239) System.err prints from XmlRootExtractor

System.err prints from XmlRootExtractor
---------------------------------------

                 Key: TIKA-239
                 URL: https://issues.apache.org/jira/browse/TIKA-239
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Jukka Zitting


The XmlRootExtractor is often given non-XML files to look at, which causes the XML parser to fail with error messages. It looks like the default behaviour is to print some of these error messages to System.err, as shown below:

$ java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.2.0-src.zip > /dev/null
java.io.EOFException
	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.endEntity(XMLDTDScannerImpl.java:662)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.endEntity(XMLEntityManager.java:1393)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1763)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipSpaces(XMLEntityScanner.java:1543)
	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.skipSeparator(XMLDTDScannerImpl.java:2055)
	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanEntityDecl(XMLDTDScannerImpl.java:1533)
	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:1986)
	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDInternalSubset(XMLDTDScannerImpl.java:377)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1140)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1089)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:976)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
	at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:55)
	at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:219)
	at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:514)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:76)
	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:83)
	at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:65)
	at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85)
	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:83)
	at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:65)
	at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:116)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57)

Tika should never print stuff to System.out or System.err unless explicitly instructed to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-239) System.err prints from XmlRootExtractor

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-239.
--------------------------------

    Resolution: Cannot Reproduce

Not sure what we've changed related to this, but I can no longer reproduce this problem with Tika 0.6.

> System.err prints from XmlRootExtractor
> ---------------------------------------
>
>                 Key: TIKA-239
>                 URL: https://issues.apache.org/jira/browse/TIKA-239
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Jukka Zitting
>
> The XmlRootExtractor is often given non-XML files to look at, which causes the XML parser to fail with error messages. It looks like the default behaviour is to print some of these error messages to System.err, as shown below:
> $ java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.2.0-src.zip > /dev/null
> java.io.EOFException
> 	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.endEntity(XMLDTDScannerImpl.java:662)
> 	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.endEntity(XMLEntityManager.java:1393)
> 	at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1763)
> 	at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipSpaces(XMLEntityScanner.java:1543)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.skipSeparator(XMLDTDScannerImpl.java:2055)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanEntityDecl(XMLDTDScannerImpl.java:1533)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:1986)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDInternalSubset(XMLDTDScannerImpl.java:377)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1140)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1089)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:976)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
> 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
> 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
> 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> 	at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:55)
> 	at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:219)
> 	at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:514)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:76)
> 	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:83)
> 	at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:65)
> 	at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85)
> 	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:83)
> 	at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:65)
> 	at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:116)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57)
> Tika should never print stuff to System.out or System.err unless explicitly instructed to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.