You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "David Morana (JIRA)" <ji...@apache.org> on 2013/04/04 22:14:15 UTC

[jira] [Updated] (TIKA-1101) XML parse error caused by org.xml.sax.SAXParseException;The entity "nbsp" was referenced, but not declared

     [ https://issues.apache.org/jira/browse/TIKA-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Morana updated TIKA-1101:
-------------------------------

    Description: 
Good afternoon,
This web page (see below) when crawled by ManifoldCF causes severe errors in Solr and causes ManifoldCF to abort the current job.
I verified the error by sending the URL to tika-app 1.2 and 1.3.
I can't find any kind of a fix for this.
Please advise...
P.S. can you also provide a list of all tika supporting jars? (i.e. poi, jempbox etc etc)
Thanks,

Here's the HTML
{code}
<div id="leftcol">
	  <ul>
        <li><a href="/mission/sec/sec.html"> Security and Information Sciences Home&nbsp;&rsaquo;</a>        </li>
        <li><a href="/mission/sec/publications/-publications.html">Publications&nbsp;&rsaquo;</a> </li>
        <li><a href="/mission/sec/corpora/corpora.html">Corpora&nbsp;&rsaquo;</a> </li>
        <li><a href="/mission/sec/softwaretools/tools.html">Software Tools&nbsp;&rsaquo;</a></li>
        <li><a href="/mission/sec/CSO/CSO.html"> Systems and Operations&nbsp;&rsaquo;</a>
          <ul>
            <li><a href="/mission/sec/publications/-publications.html">Publications &rsaquo;</a></li>
            <li><a href="/mission/sec/CSO/biographies/CSObios.html">Biographies&nbsp;&rsaquo;</a></li>
          </ul>
        </li>
        <li><a href="/mission/sec/CST/CST.html"> Systems and Technology&nbsp;&rsaquo;</a> </li>
        <li><a href="/mission/sec/CSA/CSA.html"> System Assessments&nbsp;&rsaquo;</a> </li>
	    <li><a href="/mission/sec/HLT/HLT.html">Human Language Technology&nbsp;&rsaquo;</a>
<li><a href="/mission/sec/computing/computing.html">Computing and Analytics&nbsp;&rsaquo;</a></li>
  </ul>
</div>
{code}

Here's the error:
{code}
Apr 03, 2013 4:23:23 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error
	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:581)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
	at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004)
	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1686)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.tika.exception.TikaException: XML parse error
	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
	... 21 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 105; The entity "nbsp" was referenced, but not declared.
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1861)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2994)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
	... 25 more
{code}


  was:
Good afternoon,
This web page (see below) when crawled by ManifoldCF causes severe errors in Solr and causes ManifoldCF to abort the current job.
I verified the error by sending the URL to tika-app 1.2 and 1.3.
I can't find any kind of a fix for this.
Please advise...
P.S. can you provide a list of all tika supporting jars? (i.e. poi, jempbox etc etc)
Thanks,

Here's the HTML
{code}
<div id="leftcol">
	  <ul>
        <li><a href="/mission/sec/sec.html"> Security and Information Sciences Home&nbsp;&rsaquo;</a>        </li>
        <li><a href="/mission/sec/publications/-publications.html">Publications&nbsp;&rsaquo;</a> </li>
        <li><a href="/mission/sec/corpora/corpora.html">Corpora&nbsp;&rsaquo;</a> </li>
        <li><a href="/mission/sec/softwaretools/tools.html">Software Tools&nbsp;&rsaquo;</a></li>
        <li><a href="/mission/sec/CSO/CSO.html"> Systems and Operations&nbsp;&rsaquo;</a>
          <ul>
            <li><a href="/mission/sec/publications/-publications.html">Publications &rsaquo;</a></li>
            <li><a href="/mission/sec/CSO/biographies/CSObios.html">Biographies&nbsp;&rsaquo;</a></li>
          </ul>
        </li>
        <li><a href="/mission/sec/CST/CST.html"> Systems and Technology&nbsp;&rsaquo;</a> </li>
        <li><a href="/mission/sec/CSA/CSA.html"> System Assessments&nbsp;&rsaquo;</a> </li>
	    <li><a href="/mission/sec/HLT/HLT.html">Human Language Technology&nbsp;&rsaquo;</a>
<li><a href="/mission/sec/computing/computing.html">Computing and Analytics&nbsp;&rsaquo;</a></li>
  </ul>
</div>
{code}

Here's the error:
{code}
Apr 03, 2013 4:23:23 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error
	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:581)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
	at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004)
	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1686)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.tika.exception.TikaException: XML parse error
	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
	... 21 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 105; The entity "nbsp" was referenced, but not declared.
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1861)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2994)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
	... 25 more
{code}


    
> XML parse error caused by org.xml.sax.SAXParseException;The entity "nbsp" was referenced, but not declared
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1101
>                 URL: https://issues.apache.org/jira/browse/TIKA-1101
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.2, 1.3
>         Environment: I'm using solr 4.0 final with tika 1.2 and ManifoldCF v1.2 dev on tomcat 7 (RHL)
>            Reporter: David Morana
>             Fix For: 1.2, 1.3
>
>
> Good afternoon,
> This web page (see below) when crawled by ManifoldCF causes severe errors in Solr and causes ManifoldCF to abort the current job.
> I verified the error by sending the URL to tika-app 1.2 and 1.3.
> I can't find any kind of a fix for this.
> Please advise...
> P.S. can you also provide a list of all tika supporting jars? (i.e. poi, jempbox etc etc)
> Thanks,
> Here's the HTML
> {code}
> <div id="leftcol">
> 	  <ul>
>         <li><a href="/mission/sec/sec.html"> Security and Information Sciences Home&nbsp;&rsaquo;</a>        </li>
>         <li><a href="/mission/sec/publications/-publications.html">Publications&nbsp;&rsaquo;</a> </li>
>         <li><a href="/mission/sec/corpora/corpora.html">Corpora&nbsp;&rsaquo;</a> </li>
>         <li><a href="/mission/sec/softwaretools/tools.html">Software Tools&nbsp;&rsaquo;</a></li>
>         <li><a href="/mission/sec/CSO/CSO.html"> Systems and Operations&nbsp;&rsaquo;</a>
>           <ul>
>             <li><a href="/mission/sec/publications/-publications.html">Publications &rsaquo;</a></li>
>             <li><a href="/mission/sec/CSO/biographies/CSObios.html">Biographies&nbsp;&rsaquo;</a></li>
>           </ul>
>         </li>
>         <li><a href="/mission/sec/CST/CST.html"> Systems and Technology&nbsp;&rsaquo;</a> </li>
>         <li><a href="/mission/sec/CSA/CSA.html"> System Assessments&nbsp;&rsaquo;</a> </li>
> 	    <li><a href="/mission/sec/HLT/HLT.html">Human Language Technology&nbsp;&rsaquo;</a>
> <li><a href="/mission/sec/computing/computing.html">Computing and Analytics&nbsp;&rsaquo;</a></li>
>   </ul>
> </div>
> {code}
> Here's the error:
> {code}
> Apr 03, 2013 4:23:23 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error
> 	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> 	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
> 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
> 	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:581)
> 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
> 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
> 	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936)
> 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
> 	at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004)
> 	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
> 	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1686)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> 	at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.tika.exception.TikaException: XML parse error
> 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> 	... 21 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 105; The entity "nbsp" was referenced, but not declared.
> 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
> 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
> 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
> 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
> 	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1861)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2994)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
> 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
> 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
> 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
> 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
> 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
> 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
> 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
> 	... 25 more
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira