You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marcello Lorenzi <ml...@sorint.it> on 2013/11/14 15:26:23 UTC

Solr xml img parsing exception

Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf 
to pass web content to the shard collection, but during the crawling we 
have noticed a lot of this exception:

ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException; 
org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: XML parse error
         at 
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150)
         at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
         at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
         at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
         at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
         at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
         at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
         at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
         at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
         at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
         at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107)
         at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
         at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76)
         at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
         at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90)
         at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515)
         at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012)
         at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642)
         at 
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223)
         at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597)
         at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
         at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
         at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
         at 
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147)
         ... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: 
105; The element type "img" must be terminated by the matching end-tag 
"</img>".
         at 
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
         at 
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
         at 
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
         at 
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
         at 
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
         at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753)
         at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951)
         at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
         at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
         at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
         at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:846)
         at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:775)
         at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
         at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
         at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:628)
         at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332)
         at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
         ... 28 more

Could it be not configured correctly the SOLR collection?

Thanks,
Marcello

Re: Solr xml img parsing exception

Posted by Marcello Lorenzi <ml...@sorint.it>.
Hi Erik,
but in this case the custom loader receives an HTTP Error 500 by SOLR?

Thanks,
Marcello
On 11/14/2013 04:29 PM, Erik Hatcher wrote:
> Also there's a custom loader here that is the culprit:  com.lsegroup.solr.handler.CwsExtractingDocumentLoader
>
> On Nov 14, 2013, at 10:20, Erick Erickson <er...@gmail.com> wrote:
>
>> It looks like bad data. The XML you're sending to Solr looks mal-formed, so
>> I
>> suspect this is completely outside of Solr's purview.
>>
>> Best,
>> Erick
>>
>>
>> On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi <ml...@sorint.it>wrote:
>>
>>> Hi,
>>> I have installed a Solr 4.3 instance and we have configured manifoldcf to
>>> pass web content to the shard collection, but during the crawling we have
>>> noticed a lot of this exception:
>>>
>>> ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
>>> org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
>>> XML parse error
>>>         at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
>>> CwsExtractingDocumentLoader.java:150)
>>>         at org.apache.solr.handler.ContentStreamHandlerBase.
>>> handleRequestBody(ContentStreamHandlerBase.java:74)
>>>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(
>>> RequestHandlerBase.java:135)
>>>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
>>> handleRequest(RequestHandlers.java:242)
>>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>>>         at org.apache.solr.servlet.SolrDispatchFilter.execute(
>>> SolrDispatchFilter.java:656)
>>>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>>> SolrDispatchFilter.java:359)
>>>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>>> SolrDispatchFilter.java:155)
>>>         at org.apache.catalina.core.ApplicationFilterChain.
>>> internalDoFilter(ApplicationFilterChain.java:241)
>>>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(
>>> ApplicationFilterChain.java:208)
>>>         at org.apache.catalina.core.StandardWrapperValve.invoke(
>>> StandardWrapperValve.java:221)
>>>         at org.apache.catalina.core.StandardContextValve.invoke(
>>> StandardContextValve.java:107)
>>>         at org.apache.catalina.core.StandardHostValve.invoke(
>>> StandardHostValve.java:155)
>>>         at org.apache.catalina.valves.ErrorReportValve.invoke(
>>> ErrorReportValve.java:76)
>>>         at org.apache.catalina.valves.AccessLogValve.invoke(
>>> AccessLogValve.java:934)
>>>         at org.apache.catalina.core.StandardEngineValve.invoke(
>>> StandardEngineValve.java:90)
>>>         at org.apache.catalina.connector.CoyoteAdapter.service(
>>> CoyoteAdapter.java:515)
>>>         at org.apache.coyote.http11.AbstractHttp11Processor.process(
>>> AbstractHttp11Processor.java:1012)
>>>         at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
>>> process(AbstractProtocol.java:642)
>>>         at org.apache.coyote.http11.Http11NioProtocol$
>>> Http11ConnectionHandler.process(Http11NioProtocol.java:223)
>>>         at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
>>> doRun(NioEndpoint.java:1597)
>>>         at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
>>> run(NioEndpoint.java:1555)
>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1145)
>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:615)
>>>         at java.lang.Thread.run(Thread.java:724)
>>> Caused by: org.apache.tika.exception.TikaException: XML parse error
>>>         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>>>         at org.apache.tika.parser.CompositeParser.parse(
>>> CompositeParser.java:242)
>>>         at org.apache.tika.parser.CompositeParser.parse(
>>> CompositeParser.java:242)
>>>         at org.apache.tika.parser.AutoDetectParser.parse(
>>> AutoDetectParser.java:120)
>>>         at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
>>> CwsExtractingDocumentLoader.java:147)
>>>         ... 24 more
>>> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
>>> 105; The element type "img" must be terminated by the matching end-tag
>>> "</img>".
>>>         at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
>>> createSAXParseException(ErrorHandlerWrapper.java:198)
>>>         at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
>>> fatalError(ErrorHandlerWrapper.java:177)
>>>         at com.sun.org.apache.xerces.internal.impl.
>>> XMLErrorReporter.reportError(XMLErrorReporter.java:441)
>>>         at com.sun.org.apache.xerces.internal.impl.
>>> XMLErrorReporter.reportError(XMLErrorReporter.java:368)
>>>         at com.sun.org.apache.xerces.internal.impl.XMLScanner.
>>> reportFatalError(XMLScanner.java:1388)
>>>         at com.sun.org.apache.xerces.internal.impl.
>>> XMLDocumentFragmentScannerImpl.scanEndElement(
>>> XMLDocumentFragmentScannerImpl.java:1753)
>>>         at com.sun.org.apache.xerces.internal.impl.
>>> XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
>>> XMLDocumentFragmentScannerImpl.java:2951)
>>>         at com.sun.org.apache.xerces.internal.impl.
>>> XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>>>         at com.sun.org.apache.xerces.internal.impl.
>>> XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
>>>         at com.sun.org.apache.xerces.internal.impl.
>>> XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl
>>> .java:511)
>>>         at com.sun.org.apache.xerces.internal.parsers.
>>> XML11Configuration.parse(XML11Configuration.java:846)
>>>         at com.sun.org.apache.xerces.internal.parsers.
>>> XML11Configuration.parse(XML11Configuration.java:775)
>>>         at com.sun.org.apache.xerces.internal.parsers.XMLParser.
>>> parse(XMLParser.java:123)
>>>         at com.sun.org.apache.xerces.internal.parsers.
>>> AbstractSAXParser.parse(AbstractSAXParser.java:1210)
>>>         at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$
>>> JAXPSAXParser.parse(SAXParserImpl.java:628)
>>>         at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.
>>> parse(SAXParserImpl.java:332)
>>>         at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>>>         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>>>         ... 28 more
>>>
>>> Could it be not configured correctly the SOLR collection?
>>>
>>> Thanks,
>>> Marcello
>>>


Re: Solr xml img parsing exception

Posted by Erik Hatcher <er...@gmail.com>.
Also there's a custom loader here that is the culprit:  com.lsegroup.solr.handler.CwsExtractingDocumentLoader

On Nov 14, 2013, at 10:20, Erick Erickson <er...@gmail.com> wrote:

> It looks like bad data. The XML you're sending to Solr looks mal-formed, so
> I
> suspect this is completely outside of Solr's purview.
> 
> Best,
> Erick
> 
> 
> On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi <ml...@sorint.it>wrote:
> 
>> Hi,
>> I have installed a Solr 4.3 instance and we have configured manifoldcf to
>> pass web content to the shard collection, but during the crawling we have
>> noticed a lot of this exception:
>> 
>> ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
>> org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
>> XML parse error
>>        at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
>> CwsExtractingDocumentLoader.java:150)
>>        at org.apache.solr.handler.ContentStreamHandlerBase.
>> handleRequestBody(ContentStreamHandlerBase.java:74)
>>        at org.apache.solr.handler.RequestHandlerBase.handleRequest(
>> RequestHandlerBase.java:135)
>>        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
>> handleRequest(RequestHandlers.java:242)
>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>>        at org.apache.solr.servlet.SolrDispatchFilter.execute(
>> SolrDispatchFilter.java:656)
>>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:359)
>>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:155)
>>        at org.apache.catalina.core.ApplicationFilterChain.
>> internalDoFilter(ApplicationFilterChain.java:241)
>>        at org.apache.catalina.core.ApplicationFilterChain.doFilter(
>> ApplicationFilterChain.java:208)
>>        at org.apache.catalina.core.StandardWrapperValve.invoke(
>> StandardWrapperValve.java:221)
>>        at org.apache.catalina.core.StandardContextValve.invoke(
>> StandardContextValve.java:107)
>>        at org.apache.catalina.core.StandardHostValve.invoke(
>> StandardHostValve.java:155)
>>        at org.apache.catalina.valves.ErrorReportValve.invoke(
>> ErrorReportValve.java:76)
>>        at org.apache.catalina.valves.AccessLogValve.invoke(
>> AccessLogValve.java:934)
>>        at org.apache.catalina.core.StandardEngineValve.invoke(
>> StandardEngineValve.java:90)
>>        at org.apache.catalina.connector.CoyoteAdapter.service(
>> CoyoteAdapter.java:515)
>>        at org.apache.coyote.http11.AbstractHttp11Processor.process(
>> AbstractHttp11Processor.java:1012)
>>        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
>> process(AbstractProtocol.java:642)
>>        at org.apache.coyote.http11.Http11NioProtocol$
>> Http11ConnectionHandler.process(Http11NioProtocol.java:223)
>>        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
>> doRun(NioEndpoint.java:1597)
>>        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
>> run(NioEndpoint.java:1555)
>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>>        at java.lang.Thread.run(Thread.java:724)
>> Caused by: org.apache.tika.exception.TikaException: XML parse error
>>        at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>>        at org.apache.tika.parser.CompositeParser.parse(
>> CompositeParser.java:242)
>>        at org.apache.tika.parser.CompositeParser.parse(
>> CompositeParser.java:242)
>>        at org.apache.tika.parser.AutoDetectParser.parse(
>> AutoDetectParser.java:120)
>>        at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
>> CwsExtractingDocumentLoader.java:147)
>>        ... 24 more
>> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
>> 105; The element type "img" must be terminated by the matching end-tag
>> "</img>".
>>        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
>> createSAXParseException(ErrorHandlerWrapper.java:198)
>>        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
>> fatalError(ErrorHandlerWrapper.java:177)
>>        at com.sun.org.apache.xerces.internal.impl.
>> XMLErrorReporter.reportError(XMLErrorReporter.java:441)
>>        at com.sun.org.apache.xerces.internal.impl.
>> XMLErrorReporter.reportError(XMLErrorReporter.java:368)
>>        at com.sun.org.apache.xerces.internal.impl.XMLScanner.
>> reportFatalError(XMLScanner.java:1388)
>>        at com.sun.org.apache.xerces.internal.impl.
>> XMLDocumentFragmentScannerImpl.scanEndElement(
>> XMLDocumentFragmentScannerImpl.java:1753)
>>        at com.sun.org.apache.xerces.internal.impl.
>> XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
>> XMLDocumentFragmentScannerImpl.java:2951)
>>        at com.sun.org.apache.xerces.internal.impl.
>> XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>>        at com.sun.org.apache.xerces.internal.impl.
>> XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
>>        at com.sun.org.apache.xerces.internal.impl.
>> XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl
>> .java:511)
>>        at com.sun.org.apache.xerces.internal.parsers.
>> XML11Configuration.parse(XML11Configuration.java:846)
>>        at com.sun.org.apache.xerces.internal.parsers.
>> XML11Configuration.parse(XML11Configuration.java:775)
>>        at com.sun.org.apache.xerces.internal.parsers.XMLParser.
>> parse(XMLParser.java:123)
>>        at com.sun.org.apache.xerces.internal.parsers.
>> AbstractSAXParser.parse(AbstractSAXParser.java:1210)
>>        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$
>> JAXPSAXParser.parse(SAXParserImpl.java:628)
>>        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.
>> parse(SAXParserImpl.java:332)
>>        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>>        at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>>        ... 28 more
>> 
>> Could it be not configured correctly the SOLR collection?
>> 
>> Thanks,
>> Marcello
>> 

Re: Solr xml img parsing exception

Posted by Erick Erickson <er...@gmail.com>.
It looks like bad data. The XML you're sending to Solr looks mal-formed, so
I
suspect this is completely outside of Solr's purview.

Best,
Erick


On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi <ml...@sorint.it>wrote:

> Hi,
> I have installed a Solr 4.3 instance and we have configured manifoldcf to
> pass web content to the shard collection, but during the crawling we have
> noticed a lot of this exception:
>
> ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
> org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
> XML parse error
>         at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
> CwsExtractingDocumentLoader.java:150)
>         at org.apache.solr.handler.ContentStreamHandlerBase.
> handleRequestBody(ContentStreamHandlerBase.java:74)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:135)
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
> handleRequest(RequestHandlers.java:242)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(
> SolrDispatchFilter.java:656)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:359)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:155)
>         at org.apache.catalina.core.ApplicationFilterChain.
> internalDoFilter(ApplicationFilterChain.java:241)
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> ApplicationFilterChain.java:208)
>         at org.apache.catalina.core.StandardWrapperValve.invoke(
> StandardWrapperValve.java:221)
>         at org.apache.catalina.core.StandardContextValve.invoke(
> StandardContextValve.java:107)
>         at org.apache.catalina.core.StandardHostValve.invoke(
> StandardHostValve.java:155)
>         at org.apache.catalina.valves.ErrorReportValve.invoke(
> ErrorReportValve.java:76)
>         at org.apache.catalina.valves.AccessLogValve.invoke(
> AccessLogValve.java:934)
>         at org.apache.catalina.core.StandardEngineValve.invoke(
> StandardEngineValve.java:90)
>         at org.apache.catalina.connector.CoyoteAdapter.service(
> CoyoteAdapter.java:515)
>         at org.apache.coyote.http11.AbstractHttp11Processor.process(
> AbstractHttp11Processor.java:1012)
>         at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
> process(AbstractProtocol.java:642)
>         at org.apache.coyote.http11.Http11NioProtocol$
> Http11ConnectionHandler.process(Http11NioProtocol.java:223)
>         at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
> doRun(NioEndpoint.java:1597)
>         at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
> run(NioEndpoint.java:1555)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
> Caused by: org.apache.tika.exception.TikaException: XML parse error
>         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>         at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:242)
>         at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:242)
>         at org.apache.tika.parser.AutoDetectParser.parse(
> AutoDetectParser.java:120)
>         at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
> CwsExtractingDocumentLoader.java:147)
>         ... 24 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
> 105; The element type "img" must be terminated by the matching end-tag
> "</img>".
>         at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
> createSAXParseException(ErrorHandlerWrapper.java:198)
>         at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
> fatalError(ErrorHandlerWrapper.java:177)
>         at com.sun.org.apache.xerces.internal.impl.
> XMLErrorReporter.reportError(XMLErrorReporter.java:441)
>         at com.sun.org.apache.xerces.internal.impl.
> XMLErrorReporter.reportError(XMLErrorReporter.java:368)
>         at com.sun.org.apache.xerces.internal.impl.XMLScanner.
> reportFatalError(XMLScanner.java:1388)
>         at com.sun.org.apache.xerces.internal.impl.
> XMLDocumentFragmentScannerImpl.scanEndElement(
> XMLDocumentFragmentScannerImpl.java:1753)
>         at com.sun.org.apache.xerces.internal.impl.
> XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
> XMLDocumentFragmentScannerImpl.java:2951)
>         at com.sun.org.apache.xerces.internal.impl.
> XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>         at com.sun.org.apache.xerces.internal.impl.
> XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
>         at com.sun.org.apache.xerces.internal.impl.
> XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl
> .java:511)
>         at com.sun.org.apache.xerces.internal.parsers.
> XML11Configuration.parse(XML11Configuration.java:846)
>         at com.sun.org.apache.xerces.internal.parsers.
> XML11Configuration.parse(XML11Configuration.java:775)
>         at com.sun.org.apache.xerces.internal.parsers.XMLParser.
> parse(XMLParser.java:123)
>         at com.sun.org.apache.xerces.internal.parsers.
> AbstractSAXParser.parse(AbstractSAXParser.java:1210)
>         at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$
> JAXPSAXParser.parse(SAXParserImpl.java:628)
>         at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.
> parse(SAXParserImpl.java:332)
>         at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>         ... 28 more
>
> Could it be not configured correctly the SOLR collection?
>
> Thanks,
> Marcello
>

Re: Solr xml img parsing exception

Posted by Marcello Lorenzi <ml...@sorint.it>.
Hi Jack,
we have analyzed the issue and there were duplicated jar into the tomcat 
classpath for Tika. After the removal of the dulicated library now the 
search engine works as expected.

Thanks for the support,
Marcello

On 11/14/2013 05:24 PM, Jack Krupansky wrote:
> The actual error appears to be:
>
> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
> 105; The element type "img" must be terminated by the matching end-tag
> "</img>".
>
> So, check the input document at line 91, column 105. There should be 
> an <img> tag there, but SAX is complaining that there is no matching 
> </img>.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Marcello Lorenzi
> Sent: Thursday, November 14, 2013 9:26 AM
> To: solr-user@lucene.apache.org
> Subject: Solr xml img parsing exception
>
> Hi,
> I have installed a Solr 4.3 instance and we have configured manifoldcf
> to pass web content to the shard collection, but during the crawling we
> have noticed a lot of this exception:
>
> ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
> org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: XML parse error
>         at
> com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150) 
>
>         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) 
>
>         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) 
>
>         at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242) 
>
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) 
>
>         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) 
>
>         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) 
>
>         at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) 
>
>         at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) 
>
>         at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) 
>
>         at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107) 
>
>         at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) 
>
>         at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76) 
>
>         at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
>         at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90) 
>
>         at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515) 
>
>         at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012) 
>
>         at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642) 
>
>         at
> org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223) 
>
>         at
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597) 
>
>         at
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555) 
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
>
>         at java.lang.Thread.run(Thread.java:724)
> Caused by: org.apache.tika.exception.TikaException: XML parse error
>         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at
> com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147) 
>
>         ... 24 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
> 105; The element type "img" must be terminated by the matching end-tag
> "</img>".
>         at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198) 
>
>         at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) 
>
>         at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) 
>
>         at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) 
>
>         at
> com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388) 
>
>         at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753) 
>
>         at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951) 
>
>         at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) 
>
>         at
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116) 
>
>         at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) 
>
>         at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:846) 
>
>         at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:775) 
>
>         at
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123) 
>
>         at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210) 
>
>         at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:628) 
>
>         at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332) 
>
>         at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>         ... 28 more
>
> Could it be not configured correctly the SOLR collection?
>
> Thanks,
> Marcello
>


Re: Solr xml img parsing exception

Posted by Jack Krupansky <ja...@basetechnology.com>.
The actual error appears to be:

Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type "img" must be terminated by the matching end-tag
"</img>".

So, check the input document at line 91, column 105. There should be an 
<img> tag there, but SAX is complaining that there is no matching </img>.

-- Jack Krupansky

-----Original Message----- 
From: Marcello Lorenzi
Sent: Thursday, November 14, 2013 9:26 AM
To: solr-user@lucene.apache.org
Subject: Solr xml img parsing exception

Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf
to pass web content to the shard collection, but during the crawling we
have noticed a lot of this exception:

ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: XML parse error
         at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150)
         at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
         at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
         at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
         at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
         at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
         at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
         at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
         at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
         at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
         at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107)
         at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
         at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76)
         at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
         at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90)
         at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515)
         at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012)
         at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642)
         at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223)
         at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597)
         at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555)
         at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
         at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
         at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
         at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
         at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
         at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147)
         ... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type "img" must be terminated by the matching end-tag
"</img>".
         at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
         at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
         at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
         at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
         at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
         at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753)
         at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951)
         at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
         at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
         at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
         at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:846)
         at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:775)
         at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
         at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
         at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:628)
         at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332)
         at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
         at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
         ... 28 more

Could it be not configured correctly the SOLR collection?

Thanks,
Marcello