You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by eakarsu <ea...@gmail.com> on 2013/04/11 17:02:32 UTC

java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #478803, byte #606190)

Hello,

I am crawling with apache nutche some sites and index it with solr. It has
been working fine until a few days ago. The crawled data can have 200K or
more documents inside. When I send it to SOLR to index with

 bin/nutch solrindex http://xxxx.com:8080/solr  crawl/crawldb -linkdb
crawl/linkdb crawl/segments/*

nutch is getting "SORL server internal error". SOLR 4.1 logs are getting
this error below:

It is very tough to find which document are causing this issue. 

What I need is either to configure SOLR so that it will ignore documents
that has bad data inside and continue to index next documents coming from
nutch. Or even though I am new to SOLR, maybe, I can write update pre/post
processor plugin to SORL update job to ignore XML errors. Do we have
solution for this problem?

I appreciate your help

class java.io.CharConversionException] Invalid UTF-8 character 0xffff at
char #478803, byte #606190).%trace?..java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0xffff at char
#478803, byte #606190)
.at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
.at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
.at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
.at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
.at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
.at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
.at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
.at or
g.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
.at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
.at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
.at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
.at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
.at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
.at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
.at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
.at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
.at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
.at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
.at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorRepo
rtValve.java:99)
.at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:931)
.at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
.at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
.at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004)
.at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
.at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:312)
.at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
.at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
.at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff
at char #478803, byte #606190)
.at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
.at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
.at com.ctc.wstx.io.MergedReader.read(Merge
dReader.java:101)
.at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
.at
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
.at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
.at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
.at
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
.at
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
.at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
.... 25 more




--
View this message in context: http://lucene.472066.n3.nabble.com/java-io-CharConversionException-Invalid-UTF-8-character-0xffff-at-char-478803-byte-606190-tp4055323.html
Sent from the Solr - User mailing list archive at Nabble.com.