You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marcello Lorenzi <ml...@sorint.it> on 2013/11/14 15:26:23 UTC
Solr xml img parsing exception
Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf
to pass web content to the shard collection, but during the crawling we
have noticed a lot of this exception:
ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: XML parse error
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642)
at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147)
... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type "img" must be terminated by the matching end-tag
"</img>".
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:846)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:775)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:628)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
... 28 more
Could it be not configured correctly the SOLR collection?
Thanks,
Marcello
Re: Solr xml img parsing exception
Posted by Marcello Lorenzi <ml...@sorint.it>.
Hi Erik,
but in this case the custom loader receives an HTTP Error 500 by SOLR?
Thanks,
Marcello
On 11/14/2013 04:29 PM, Erik Hatcher wrote:
> Also there's a custom loader here that is the culprit: com.lsegroup.solr.handler.CwsExtractingDocumentLoader
>
> On Nov 14, 2013, at 10:20, Erick Erickson <er...@gmail.com> wrote:
>
>> It looks like bad data. The XML you're sending to Solr looks mal-formed, so
>> I
>> suspect this is completely outside of Solr's purview.
>>
>> Best,
>> Erick
>>
>>
>> On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi <ml...@sorint.it>wrote:
>>
>>> Hi,
>>> I have installed a Solr 4.3 instance and we have configured manifoldcf to
>>> pass web content to the shard collection, but during the crawling we have
>>> noticed a lot of this exception:
>>>
>>> ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
>>> org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
>>> XML parse error
>>> at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
>>> CwsExtractingDocumentLoader.java:150)
>>> at org.apache.solr.handler.ContentStreamHandlerBase.
>>> handleRequestBody(ContentStreamHandlerBase.java:74)
>>> at org.apache.solr.handler.RequestHandlerBase.handleRequest(
>>> RequestHandlerBase.java:135)
>>> at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
>>> handleRequest(RequestHandlers.java:242)
>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>>> at org.apache.solr.servlet.SolrDispatchFilter.execute(
>>> SolrDispatchFilter.java:656)
>>> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>>> SolrDispatchFilter.java:359)
>>> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>>> SolrDispatchFilter.java:155)
>>> at org.apache.catalina.core.ApplicationFilterChain.
>>> internalDoFilter(ApplicationFilterChain.java:241)
>>> at org.apache.catalina.core.ApplicationFilterChain.doFilter(
>>> ApplicationFilterChain.java:208)
>>> at org.apache.catalina.core.StandardWrapperValve.invoke(
>>> StandardWrapperValve.java:221)
>>> at org.apache.catalina.core.StandardContextValve.invoke(
>>> StandardContextValve.java:107)
>>> at org.apache.catalina.core.StandardHostValve.invoke(
>>> StandardHostValve.java:155)
>>> at org.apache.catalina.valves.ErrorReportValve.invoke(
>>> ErrorReportValve.java:76)
>>> at org.apache.catalina.valves.AccessLogValve.invoke(
>>> AccessLogValve.java:934)
>>> at org.apache.catalina.core.StandardEngineValve.invoke(
>>> StandardEngineValve.java:90)
>>> at org.apache.catalina.connector.CoyoteAdapter.service(
>>> CoyoteAdapter.java:515)
>>> at org.apache.coyote.http11.AbstractHttp11Processor.process(
>>> AbstractHttp11Processor.java:1012)
>>> at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
>>> process(AbstractProtocol.java:642)
>>> at org.apache.coyote.http11.Http11NioProtocol$
>>> Http11ConnectionHandler.process(Http11NioProtocol.java:223)
>>> at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
>>> doRun(NioEndpoint.java:1597)
>>> at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
>>> run(NioEndpoint.java:1555)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1145)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:724)
>>> Caused by: org.apache.tika.exception.TikaException: XML parse error
>>> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>>> at org.apache.tika.parser.CompositeParser.parse(
>>> CompositeParser.java:242)
>>> at org.apache.tika.parser.CompositeParser.parse(
>>> CompositeParser.java:242)
>>> at org.apache.tika.parser.AutoDetectParser.parse(
>>> AutoDetectParser.java:120)
>>> at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
>>> CwsExtractingDocumentLoader.java:147)
>>> ... 24 more
>>> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
>>> 105; The element type "img" must be terminated by the matching end-tag
>>> "</img>".
>>> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
>>> createSAXParseException(ErrorHandlerWrapper.java:198)
>>> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
>>> fatalError(ErrorHandlerWrapper.java:177)
>>> at com.sun.org.apache.xerces.internal.impl.
>>> XMLErrorReporter.reportError(XMLErrorReporter.java:441)
>>> at com.sun.org.apache.xerces.internal.impl.
>>> XMLErrorReporter.reportError(XMLErrorReporter.java:368)
>>> at com.sun.org.apache.xerces.internal.impl.XMLScanner.
>>> reportFatalError(XMLScanner.java:1388)
>>> at com.sun.org.apache.xerces.internal.impl.
>>> XMLDocumentFragmentScannerImpl.scanEndElement(
>>> XMLDocumentFragmentScannerImpl.java:1753)
>>> at com.sun.org.apache.xerces.internal.impl.
>>> XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
>>> XMLDocumentFragmentScannerImpl.java:2951)
>>> at com.sun.org.apache.xerces.internal.impl.
>>> XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>>> at com.sun.org.apache.xerces.internal.impl.
>>> XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
>>> at com.sun.org.apache.xerces.internal.impl.
>>> XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl
>>> .java:511)
>>> at com.sun.org.apache.xerces.internal.parsers.
>>> XML11Configuration.parse(XML11Configuration.java:846)
>>> at com.sun.org.apache.xerces.internal.parsers.
>>> XML11Configuration.parse(XML11Configuration.java:775)
>>> at com.sun.org.apache.xerces.internal.parsers.XMLParser.
>>> parse(XMLParser.java:123)
>>> at com.sun.org.apache.xerces.internal.parsers.
>>> AbstractSAXParser.parse(AbstractSAXParser.java:1210)
>>> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$
>>> JAXPSAXParser.parse(SAXParserImpl.java:628)
>>> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.
>>> parse(SAXParserImpl.java:332)
>>> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>>> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>>> ... 28 more
>>>
>>> Could it be not configured correctly the SOLR collection?
>>>
>>> Thanks,
>>> Marcello
>>>
Re: Solr xml img parsing exception
Posted by Erik Hatcher <er...@gmail.com>.
Also there's a custom loader here that is the culprit: com.lsegroup.solr.handler.CwsExtractingDocumentLoader
On Nov 14, 2013, at 10:20, Erick Erickson <er...@gmail.com> wrote:
> It looks like bad data. The XML you're sending to Solr looks mal-formed, so
> I
> suspect this is completely outside of Solr's purview.
>
> Best,
> Erick
>
>
> On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi <ml...@sorint.it>wrote:
>
>> Hi,
>> I have installed a Solr 4.3 instance and we have configured manifoldcf to
>> pass web content to the shard collection, but during the crawling we have
>> noticed a lot of this exception:
>>
>> ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
>> org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
>> XML parse error
>> at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
>> CwsExtractingDocumentLoader.java:150)
>> at org.apache.solr.handler.ContentStreamHandlerBase.
>> handleRequestBody(ContentStreamHandlerBase.java:74)
>> at org.apache.solr.handler.RequestHandlerBase.handleRequest(
>> RequestHandlerBase.java:135)
>> at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
>> handleRequest(RequestHandlers.java:242)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>> at org.apache.solr.servlet.SolrDispatchFilter.execute(
>> SolrDispatchFilter.java:656)
>> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:359)
>> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:155)
>> at org.apache.catalina.core.ApplicationFilterChain.
>> internalDoFilter(ApplicationFilterChain.java:241)
>> at org.apache.catalina.core.ApplicationFilterChain.doFilter(
>> ApplicationFilterChain.java:208)
>> at org.apache.catalina.core.StandardWrapperValve.invoke(
>> StandardWrapperValve.java:221)
>> at org.apache.catalina.core.StandardContextValve.invoke(
>> StandardContextValve.java:107)
>> at org.apache.catalina.core.StandardHostValve.invoke(
>> StandardHostValve.java:155)
>> at org.apache.catalina.valves.ErrorReportValve.invoke(
>> ErrorReportValve.java:76)
>> at org.apache.catalina.valves.AccessLogValve.invoke(
>> AccessLogValve.java:934)
>> at org.apache.catalina.core.StandardEngineValve.invoke(
>> StandardEngineValve.java:90)
>> at org.apache.catalina.connector.CoyoteAdapter.service(
>> CoyoteAdapter.java:515)
>> at org.apache.coyote.http11.AbstractHttp11Processor.process(
>> AbstractHttp11Processor.java:1012)
>> at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
>> process(AbstractProtocol.java:642)
>> at org.apache.coyote.http11.Http11NioProtocol$
>> Http11ConnectionHandler.process(Http11NioProtocol.java:223)
>> at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
>> doRun(NioEndpoint.java:1597)
>> at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
>> run(NioEndpoint.java:1555)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:724)
>> Caused by: org.apache.tika.exception.TikaException: XML parse error
>> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>> at org.apache.tika.parser.CompositeParser.parse(
>> CompositeParser.java:242)
>> at org.apache.tika.parser.CompositeParser.parse(
>> CompositeParser.java:242)
>> at org.apache.tika.parser.AutoDetectParser.parse(
>> AutoDetectParser.java:120)
>> at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
>> CwsExtractingDocumentLoader.java:147)
>> ... 24 more
>> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
>> 105; The element type "img" must be terminated by the matching end-tag
>> "</img>".
>> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
>> createSAXParseException(ErrorHandlerWrapper.java:198)
>> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
>> fatalError(ErrorHandlerWrapper.java:177)
>> at com.sun.org.apache.xerces.internal.impl.
>> XMLErrorReporter.reportError(XMLErrorReporter.java:441)
>> at com.sun.org.apache.xerces.internal.impl.
>> XMLErrorReporter.reportError(XMLErrorReporter.java:368)
>> at com.sun.org.apache.xerces.internal.impl.XMLScanner.
>> reportFatalError(XMLScanner.java:1388)
>> at com.sun.org.apache.xerces.internal.impl.
>> XMLDocumentFragmentScannerImpl.scanEndElement(
>> XMLDocumentFragmentScannerImpl.java:1753)
>> at com.sun.org.apache.xerces.internal.impl.
>> XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
>> XMLDocumentFragmentScannerImpl.java:2951)
>> at com.sun.org.apache.xerces.internal.impl.
>> XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>> at com.sun.org.apache.xerces.internal.impl.
>> XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
>> at com.sun.org.apache.xerces.internal.impl.
>> XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl
>> .java:511)
>> at com.sun.org.apache.xerces.internal.parsers.
>> XML11Configuration.parse(XML11Configuration.java:846)
>> at com.sun.org.apache.xerces.internal.parsers.
>> XML11Configuration.parse(XML11Configuration.java:775)
>> at com.sun.org.apache.xerces.internal.parsers.XMLParser.
>> parse(XMLParser.java:123)
>> at com.sun.org.apache.xerces.internal.parsers.
>> AbstractSAXParser.parse(AbstractSAXParser.java:1210)
>> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$
>> JAXPSAXParser.parse(SAXParserImpl.java:628)
>> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.
>> parse(SAXParserImpl.java:332)
>> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>> ... 28 more
>>
>> Could it be not configured correctly the SOLR collection?
>>
>> Thanks,
>> Marcello
>>
Re: Solr xml img parsing exception
Posted by Erick Erickson <er...@gmail.com>.
It looks like bad data. The XML you're sending to Solr looks mal-formed, so
I
suspect this is completely outside of Solr's purview.
Best,
Erick
On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi <ml...@sorint.it>wrote:
> Hi,
> I have installed a Solr 4.3 instance and we have configured manifoldcf to
> pass web content to the shard collection, but during the crawling we have
> noticed a lot of this exception:
>
> ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
> org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
> XML parse error
> at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
> CwsExtractingDocumentLoader.java:150)
> at org.apache.solr.handler.ContentStreamHandlerBase.
> handleRequestBody(ContentStreamHandlerBase.java:74)
> at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:135)
> at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
> handleRequest(RequestHandlers.java:242)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
> at org.apache.solr.servlet.SolrDispatchFilter.execute(
> SolrDispatchFilter.java:656)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:359)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:155)
> at org.apache.catalina.core.ApplicationFilterChain.
> internalDoFilter(ApplicationFilterChain.java:241)
> at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> ApplicationFilterChain.java:208)
> at org.apache.catalina.core.StandardWrapperValve.invoke(
> StandardWrapperValve.java:221)
> at org.apache.catalina.core.StandardContextValve.invoke(
> StandardContextValve.java:107)
> at org.apache.catalina.core.StandardHostValve.invoke(
> StandardHostValve.java:155)
> at org.apache.catalina.valves.ErrorReportValve.invoke(
> ErrorReportValve.java:76)
> at org.apache.catalina.valves.AccessLogValve.invoke(
> AccessLogValve.java:934)
> at org.apache.catalina.core.StandardEngineValve.invoke(
> StandardEngineValve.java:90)
> at org.apache.catalina.connector.CoyoteAdapter.service(
> CoyoteAdapter.java:515)
> at org.apache.coyote.http11.AbstractHttp11Processor.process(
> AbstractHttp11Processor.java:1012)
> at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
> process(AbstractProtocol.java:642)
> at org.apache.coyote.http11.Http11NioProtocol$
> Http11ConnectionHandler.process(Http11NioProtocol.java:223)
> at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
> doRun(NioEndpoint.java:1597)
> at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
> run(NioEndpoint.java:1555)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> Caused by: org.apache.tika.exception.TikaException: XML parse error
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
> at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:242)
> at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:242)
> at org.apache.tika.parser.AutoDetectParser.parse(
> AutoDetectParser.java:120)
> at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
> CwsExtractingDocumentLoader.java:147)
> ... 24 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
> 105; The element type "img" must be terminated by the matching end-tag
> "</img>".
> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
> createSAXParseException(ErrorHandlerWrapper.java:198)
> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
> fatalError(ErrorHandlerWrapper.java:177)
> at com.sun.org.apache.xerces.internal.impl.
> XMLErrorReporter.reportError(XMLErrorReporter.java:441)
> at com.sun.org.apache.xerces.internal.impl.
> XMLErrorReporter.reportError(XMLErrorReporter.java:368)
> at com.sun.org.apache.xerces.internal.impl.XMLScanner.
> reportFatalError(XMLScanner.java:1388)
> at com.sun.org.apache.xerces.internal.impl.
> XMLDocumentFragmentScannerImpl.scanEndElement(
> XMLDocumentFragmentScannerImpl.java:1753)
> at com.sun.org.apache.xerces.internal.impl.
> XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
> XMLDocumentFragmentScannerImpl.java:2951)
> at com.sun.org.apache.xerces.internal.impl.
> XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
> at com.sun.org.apache.xerces.internal.impl.
> XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
> at com.sun.org.apache.xerces.internal.impl.
> XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl
> .java:511)
> at com.sun.org.apache.xerces.internal.parsers.
> XML11Configuration.parse(XML11Configuration.java:846)
> at com.sun.org.apache.xerces.internal.parsers.
> XML11Configuration.parse(XML11Configuration.java:775)
> at com.sun.org.apache.xerces.internal.parsers.XMLParser.
> parse(XMLParser.java:123)
> at com.sun.org.apache.xerces.internal.parsers.
> AbstractSAXParser.parse(AbstractSAXParser.java:1210)
> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$
> JAXPSAXParser.parse(SAXParserImpl.java:628)
> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.
> parse(SAXParserImpl.java:332)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
> ... 28 more
>
> Could it be not configured correctly the SOLR collection?
>
> Thanks,
> Marcello
>
Re: Solr xml img parsing exception
Posted by Marcello Lorenzi <ml...@sorint.it>.
Hi Jack,
we have analyzed the issue and there were duplicated jar into the tomcat
classpath for Tika. After the removal of the dulicated library now the
search engine works as expected.
Thanks for the support,
Marcello
On 11/14/2013 05:24 PM, Jack Krupansky wrote:
> The actual error appears to be:
>
> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
> 105; The element type "img" must be terminated by the matching end-tag
> "</img>".
>
> So, check the input document at line 91, column 105. There should be
> an <img> tag there, but SAX is complaining that there is no matching
> </img>.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Marcello Lorenzi
> Sent: Thursday, November 14, 2013 9:26 AM
> To: solr-user@lucene.apache.org
> Subject: Solr xml img parsing exception
>
> Hi,
> I have installed a Solr 4.3 instance and we have configured manifoldcf
> to pass web content to the shard collection, but during the crawling we
> have noticed a lot of this exception:
>
> ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
> org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: XML parse error
> at
> com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150)
>
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)
>
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
>
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
>
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
>
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107)
>
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
>
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76)
>
> at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90)
>
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515)
>
> at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012)
>
> at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642)
>
> at
> org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223)
>
> at
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597)
>
> at
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:724)
> Caused by: org.apache.tika.exception.TikaException: XML parse error
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147)
>
> ... 24 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
> 105; The element type "img" must be terminated by the matching end-tag
> "</img>".
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
>
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
>
> at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
>
> at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
>
> at
> com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
>
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753)
>
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951)
>
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>
> at
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
>
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:846)
>
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:775)
>
> at
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
>
> at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
>
> at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:628)
>
> at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332)
>
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
> ... 28 more
>
> Could it be not configured correctly the SOLR collection?
>
> Thanks,
> Marcello
>
Re: Solr xml img parsing exception
Posted by Jack Krupansky <ja...@basetechnology.com>.
The actual error appears to be:
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type "img" must be terminated by the matching end-tag
"</img>".
So, check the input document at line 91, column 105. There should be an
<img> tag there, but SAX is complaining that there is no matching </img>.
-- Jack Krupansky
-----Original Message-----
From: Marcello Lorenzi
Sent: Thursday, November 14, 2013 9:26 AM
To: solr-user@lucene.apache.org
Subject: Solr xml img parsing exception
Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf
to pass web content to the shard collection, but during the crawling we
have noticed a lot of this exception:
ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: XML parse error
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642)
at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147)
... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type "img" must be terminated by the matching end-tag
"</img>".
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:846)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:775)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:628)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
... 28 more
Could it be not configured correctly the SOLR collection?
Thanks,
Marcello