You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ramires <uy...@beriltech.com> on 2011/06/22 09:15:38 UTC

strange utf-8 problem

 I use solr 4 trunk  to index some sites with nutch 1-2-rc4.  When i try to
index 300k documents with solr4 i get error.
But when i use solr 1.4.1 version there is no problem like that. I install
solr4 to tomcat5,6 jetty7,8 there is no change.

I use apache-solr-core-1.4.0.jar apache-solr-solrj-1.4.0.jar for solr 1.4.1
becouse of javabin errors.

here is problematic chars.  "Sao Tom���nd Princip���STP"

SEVERE: java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0xffff at char
#681112, byte #700315)
        at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
        at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
        at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
        at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
        at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:266)
        at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:126)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1308)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1323)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:476)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:480)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:225)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:937)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:183)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:871)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:247)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
        at org.eclipse.jetty.server.Server.handle(Server.java:346)
        at
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:589)
        at
org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:1065)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:915)
        at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:220)
        at
org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:411)
        at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:535)
        at
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:529)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff
at char #681112, byte #700315)
        at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
        at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
        at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
        at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
        at
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
        at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
        at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
        at
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
        at
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
        at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
        ... 32 more


--
View this message in context: http://lucene.472066.n3.nabble.com/strange-utf-8-problem-tp3094473p3094473.html
Sent from the Solr - User mailing list archive at Nabble.com.