You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stefan Scheffler <ss...@avantgarde-labs.de> on 2012/09/24 10:25:45 UTC

Indexing Exception

Hello,
I have a strange Problem. While indexing a crawl to solr i got the 
following exception

java.lang.RuntimeException: [was class java.io.CharConversionException] 
Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
     at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
     at 
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
     at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
     at 
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
     at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
     at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
     at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
     at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
     at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
     at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
     at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
     at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
     at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
     at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
     at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
     at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
     at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
     at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
     at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
     at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
     at org.mortbay.jetty.Server.handle(Server.java:326)
     at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
     at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
     at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
     at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
     at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
     at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
     at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.CharConversionException: Invalid UTF-8 character 
0xfffe at char #6886708, byte #11578429)
...

It seems to be an encoding exception. Is there a way to avoid this?

Regards
Stefan

-- 
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de


RE: Indexing Exception

Posted by Markus Jelsma <ma...@openindex.io>.
It is hardcoded to process the `content` field only but it could be changed to process any string field. 
 
-----Original message-----
> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> Sent: Mon 24-Sep-2012 12:27
> To: user@nutch.apache.org
> Subject: Re: Indexing Exception
> 
> Hey,
> Thank you i used this method in the meantime for me and it worked fine.
> Is there a general way to do the encoding to utf8 to this field in 
> Nutchg as well?
> 
> On 24.09.2012 12:04, Markus Jelsma wrote:
> > Hi Stefan,
> >
> > You can take the stripNonCharCodepoints() method and pass your content through it. It should fix the problem.
> >
> > Cheers,
> >   
> > -----Original message-----
> >> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >> Sent: Mon 24-Sep-2012 11:23
> >> To: user@nutch.apache.org
> >> Subject: Re: Indexing Exception
> >>
> >> Hey Markus. you gave me the right hint.
> >> Additionally to the normally content field i added a field fullcontent,
> >> which simply holds the html document of the relevant content field,
> >> because we need this in a later proccessing step. This field is not
> >> encoded like the content field. I realised this with an own
> >> ParsingFilter, which stores it in to  the ParseResult and then an
> >> Indexingfilter merges it into the NutchDocument.
> >>
> >> Is there a way to do this better or just do the encoding to the
> >> fullcontent like to the content?
> >>
> >> Regards
> >> Stefan
> >> On 24.09.2012 10:41, Markus Jelsma wrote:
> >>> It was fixed for the content field with 1016. Can you pinpoint the problematic field?
> >>> https://issues.apache.org/jira/browse/NUTCH-1016
> >>>
> >>>    
> >>>    
> >>> -----Original message-----
> >>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >>>> Sent: Mon 24-Sep-2012 10:37
> >>>> To: user@nutch.apache.org
> >>>> Subject: Re: Indexing Exception
> >>>>
> >>>> nutch 1.5, solr 3.6
> >>>> On 24.09.2012 10:34, Markus Jelsma wrote:
> >>>>> Hi - What version?
> >>>>>
> >>>>>     
> >>>>>     
> >>>>> -----Original message-----
> >>>>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >>>>>> Sent: Mon 24-Sep-2012 10:29
> >>>>>> To: user@nutch.apache.org
> >>>>>> Subject: Indexing Exception
> >>>>>>
> >>>>>> Hello,
> >>>>>> I have a strange Problem. While indexing a crawl to solr i got the
> >>>>>> following exception
> >>>>>>
> >>>>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >>>>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >>>>>>         at
> >>>>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>>>>>         at
> >>>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>>>>>         at
> >>>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>>>>>         at
> >>>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>>>>>         at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >>>>>>         at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >>>>>>         at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >>>>>>         at
> >>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >>>>>>         at
> >>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>>>>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >>>>>>         at
> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >>>>>>         at
> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >>>>>>         at org.mortbay.jetty.Server.handle(Server.java:326)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
> >>>>>>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
> >>>>>>         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> >>>>>>         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> >>>>>>         at
> >>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> >>>>>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> >>>>>> 0xfffe at char #6886708, byte #11578429)
> >>>>>> ...
> >>>>>>
> >>>>>> It seems to be an encoding exception. Is there a way to avoid this?
> >>>>>>
> >>>>>> Regards
> >>>>>> Stefan
> >>>>>>
> >>>>>> -- 
> >>>>>> Stefan Scheffler
> >>>>>> Avantgarde Labs GmbH
> >>>>>> Löbauer Straße 19, 01099 Dresden
> >>>>>> Telefon: + 49 (0) 351 21590834
> >>>>>> Email: sscheffler@avantgarde-labs.de
> >>>>>>
> >>>>>>
> >>>> -- 
> >>>> Stefan Scheffler
> >>>> Avantgarde Labs GmbH
> >>>> Löbauer Straße 19, 01099 Dresden
> >>>> Telefon: + 49 (0) 351 21590834
> >>>> Email: sscheffler@avantgarde-labs.de
> >>>>
> >>>>
> >>
> >> -- 
> >> Stefan Scheffler
> >> Avantgarde Labs GmbH
> >> Löbauer Straße 19, 01099 Dresden
> >> Telefon: + 49 (0) 351 21590834
> >> Email: sscheffler@avantgarde-labs.de
> >>
> >>
> 
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheffler@avantgarde-labs.de
> 
> 

Re: Indexing Exception

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hey,
Thank you i used this method in the meantime for me and it worked fine.
Is there a general way to do the encoding to utf8 to this field in 
Nutchg as well?

On 24.09.2012 12:04, Markus Jelsma wrote:
> Hi Stefan,
>
> You can take the stripNonCharCodepoints() method and pass your content through it. It should fix the problem.
>
> Cheers,
>   
> -----Original message-----
>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>> Sent: Mon 24-Sep-2012 11:23
>> To: user@nutch.apache.org
>> Subject: Re: Indexing Exception
>>
>> Hey Markus. you gave me the right hint.
>> Additionally to the normally content field i added a field fullcontent,
>> which simply holds the html document of the relevant content field,
>> because we need this in a later proccessing step. This field is not
>> encoded like the content field. I realised this with an own
>> ParsingFilter, which stores it in to  the ParseResult and then an
>> Indexingfilter merges it into the NutchDocument.
>>
>> Is there a way to do this better or just do the encoding to the
>> fullcontent like to the content?
>>
>> Regards
>> Stefan
>> On 24.09.2012 10:41, Markus Jelsma wrote:
>>> It was fixed for the content field with 1016. Can you pinpoint the problematic field?
>>> https://issues.apache.org/jira/browse/NUTCH-1016
>>>
>>>    
>>>    
>>> -----Original message-----
>>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>>>> Sent: Mon 24-Sep-2012 10:37
>>>> To: user@nutch.apache.org
>>>> Subject: Re: Indexing Exception
>>>>
>>>> nutch 1.5, solr 3.6
>>>> On 24.09.2012 10:34, Markus Jelsma wrote:
>>>>> Hi - What version?
>>>>>
>>>>>     
>>>>>     
>>>>> -----Original message-----
>>>>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>>>>>> Sent: Mon 24-Sep-2012 10:29
>>>>>> To: user@nutch.apache.org
>>>>>> Subject: Indexing Exception
>>>>>>
>>>>>> Hello,
>>>>>> I have a strange Problem. While indexing a crawl to solr i got the
>>>>>> following exception
>>>>>>
>>>>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
>>>>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
>>>>>>         at
>>>>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>>>>>>         at
>>>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>>>>>>         at
>>>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>>>>>>         at
>>>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>>>>>         at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>>>>>>         at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>>>>>>         at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>>>>>>         at
>>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>>>>>>         at
>>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>>>>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>>>>>>         at
>>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>>>>>>         at
>>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>>>>>>         at
>>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>>>>         at
>>>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>>>>         at
>>>>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>>>>         at
>>>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>>>>>>         at
>>>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>>>>>>         at
>>>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>>>>>>         at
>>>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>>>>>         at
>>>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>>>>         at
>>>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>>>>>         at org.mortbay.jetty.Server.handle(Server.java:326)
>>>>>>         at
>>>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>>>>>>         at
>>>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>>>>>>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
>>>>>>         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>>>>>         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>>>>>         at
>>>>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>>>>>>         at
>>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>>>>>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
>>>>>> 0xfffe at char #6886708, byte #11578429)
>>>>>> ...
>>>>>>
>>>>>> It seems to be an encoding exception. Is there a way to avoid this?
>>>>>>
>>>>>> Regards
>>>>>> Stefan
>>>>>>
>>>>>> -- 
>>>>>> Stefan Scheffler
>>>>>> Avantgarde Labs GmbH
>>>>>> Löbauer Straße 19, 01099 Dresden
>>>>>> Telefon: + 49 (0) 351 21590834
>>>>>> Email: sscheffler@avantgarde-labs.de
>>>>>>
>>>>>>
>>>> -- 
>>>> Stefan Scheffler
>>>> Avantgarde Labs GmbH
>>>> Löbauer Straße 19, 01099 Dresden
>>>> Telefon: + 49 (0) 351 21590834
>>>> Email: sscheffler@avantgarde-labs.de
>>>>
>>>>
>>
>> -- 
>> Stefan Scheffler
>> Avantgarde Labs GmbH
>> Löbauer Straße 19, 01099 Dresden
>> Telefon: + 49 (0) 351 21590834
>> Email: sscheffler@avantgarde-labs.de
>>
>>


-- 
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de


RE: Indexing Exception

Posted by Markus Jelsma <ma...@openindex.io>.
Hi Stefan,

You can take the stripNonCharCodepoints() method and pass your content through it. It should fix the problem.

Cheers,
 
-----Original message-----
> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> Sent: Mon 24-Sep-2012 11:23
> To: user@nutch.apache.org
> Subject: Re: Indexing Exception
> 
> Hey Markus. you gave me the right hint.
> Additionally to the normally content field i added a field fullcontent, 
> which simply holds the html document of the relevant content field, 
> because we need this in a later proccessing step. This field is not 
> encoded like the content field. I realised this with an own 
> ParsingFilter, which stores it in to  the ParseResult and then an 
> Indexingfilter merges it into the NutchDocument.
> 
> Is there a way to do this better or just do the encoding to the 
> fullcontent like to the content?
> 
> Regards
> Stefan
> On 24.09.2012 10:41, Markus Jelsma wrote:
> > It was fixed for the content field with 1016. Can you pinpoint the problematic field?
> > https://issues.apache.org/jira/browse/NUTCH-1016
> >
> >   
> >   
> > -----Original message-----
> >> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >> Sent: Mon 24-Sep-2012 10:37
> >> To: user@nutch.apache.org
> >> Subject: Re: Indexing Exception
> >>
> >> nutch 1.5, solr 3.6
> >> On 24.09.2012 10:34, Markus Jelsma wrote:
> >>> Hi - What version?
> >>>
> >>>    
> >>>    
> >>> -----Original message-----
> >>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >>>> Sent: Mon 24-Sep-2012 10:29
> >>>> To: user@nutch.apache.org
> >>>> Subject: Indexing Exception
> >>>>
> >>>> Hello,
> >>>> I have a strange Problem. While indexing a crawl to solr i got the
> >>>> following exception
> >>>>
> >>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >>>>        at
> >>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>>>        at
> >>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>>>        at
> >>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>>>        at
> >>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>>>        at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >>>>        at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >>>>        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >>>>        at
> >>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >>>>        at
> >>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >>>>        at
> >>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >>>>        at
> >>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >>>>        at
> >>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>>>        at
> >>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>>>        at
> >>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >>>>        at
> >>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >>>>        at
> >>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >>>>        at
> >>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >>>>        at
> >>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >>>>        at
> >>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >>>>        at
> >>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >>>>        at org.mortbay.jetty.Server.handle(Server.java:326)
> >>>>        at
> >>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> >>>>        at
> >>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
> >>>>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
> >>>>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> >>>>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >>>>        at
> >>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> >>>>        at
> >>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> >>>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> >>>> 0xfffe at char #6886708, byte #11578429)
> >>>> ...
> >>>>
> >>>> It seems to be an encoding exception. Is there a way to avoid this?
> >>>>
> >>>> Regards
> >>>> Stefan
> >>>>
> >>>> -- 
> >>>> Stefan Scheffler
> >>>> Avantgarde Labs GmbH
> >>>> Löbauer Straße 19, 01099 Dresden
> >>>> Telefon: + 49 (0) 351 21590834
> >>>> Email: sscheffler@avantgarde-labs.de
> >>>>
> >>>>
> >>
> >> -- 
> >> Stefan Scheffler
> >> Avantgarde Labs GmbH
> >> Löbauer Straße 19, 01099 Dresden
> >> Telefon: + 49 (0) 351 21590834
> >> Email: sscheffler@avantgarde-labs.de
> >>
> >>
> 
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheffler@avantgarde-labs.de
> 
> 

Re: Indexing Exception

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hey Markus. you gave me the right hint.
Additionally to the normally content field i added a field fullcontent, 
which simply holds the html document of the relevant content field, 
because we need this in a later proccessing step. This field is not 
encoded like the content field. I realised this with an own 
ParsingFilter, which stores it in to  the ParseResult and then an 
Indexingfilter merges it into the NutchDocument.

Is there a way to do this better or just do the encoding to the 
fullcontent like to the content?

Regards
Stefan
On 24.09.2012 10:41, Markus Jelsma wrote:
> It was fixed for the content field with 1016. Can you pinpoint the problematic field?
> https://issues.apache.org/jira/browse/NUTCH-1016
>
>   
>   
> -----Original message-----
>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>> Sent: Mon 24-Sep-2012 10:37
>> To: user@nutch.apache.org
>> Subject: Re: Indexing Exception
>>
>> nutch 1.5, solr 3.6
>> On 24.09.2012 10:34, Markus Jelsma wrote:
>>> Hi - What version?
>>>
>>>    
>>>    
>>> -----Original message-----
>>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>>>> Sent: Mon 24-Sep-2012 10:29
>>>> To: user@nutch.apache.org
>>>> Subject: Indexing Exception
>>>>
>>>> Hello,
>>>> I have a strange Problem. While indexing a crawl to solr i got the
>>>> following exception
>>>>
>>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
>>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
>>>>        at
>>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>>>>        at
>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>>>>        at
>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>>>>        at
>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>>>        at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>>>>        at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>>>>        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>>>>        at
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>>>>        at
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>>>>        at
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>>>>        at
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>>>>        at
>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>>        at
>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>>        at
>>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>>        at
>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>>>>        at
>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>>>>        at
>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>>>>        at
>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>>>        at
>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>>        at
>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>>>        at org.mortbay.jetty.Server.handle(Server.java:326)
>>>>        at
>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>>>>        at
>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>>>>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
>>>>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>>>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>>>        at
>>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>>>>        at
>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>>>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
>>>> 0xfffe at char #6886708, byte #11578429)
>>>> ...
>>>>
>>>> It seems to be an encoding exception. Is there a way to avoid this?
>>>>
>>>> Regards
>>>> Stefan
>>>>
>>>> -- 
>>>> Stefan Scheffler
>>>> Avantgarde Labs GmbH
>>>> Löbauer Straße 19, 01099 Dresden
>>>> Telefon: + 49 (0) 351 21590834
>>>> Email: sscheffler@avantgarde-labs.de
>>>>
>>>>
>>
>> -- 
>> Stefan Scheffler
>> Avantgarde Labs GmbH
>> Löbauer Straße 19, 01099 Dresden
>> Telefon: + 49 (0) 351 21590834
>> Email: sscheffler@avantgarde-labs.de
>>
>>


-- 
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de


RE: Indexing Exception

Posted by Markus Jelsma <ma...@openindex.io>.
It was fixed for the content field with 1016. Can you pinpoint the problematic field?
https://issues.apache.org/jira/browse/NUTCH-1016

 
 
-----Original message-----
> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> Sent: Mon 24-Sep-2012 10:37
> To: user@nutch.apache.org
> Subject: Re: Indexing Exception
> 
> nutch 1.5, solr 3.6
> On 24.09.2012 10:34, Markus Jelsma wrote:
> > Hi - What version?
> >
> >   
> >   
> > -----Original message-----
> >> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >> Sent: Mon 24-Sep-2012 10:29
> >> To: user@nutch.apache.org
> >> Subject: Indexing Exception
> >>
> >> Hello,
> >> I have a strange Problem. While indexing a crawl to solr i got the
> >> following exception
> >>
> >> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >>       at
> >> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>       at
> >> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>       at
> >> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>       at
> >> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>       at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >>       at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >>       at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >>       at
> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >>       at
> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >>       at
> >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >>       at
> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >>       at
> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>       at
> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>       at
> >> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >>       at
> >> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >>       at
> >> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >>       at
> >> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >>       at
> >> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >>       at
> >> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >>       at
> >> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >>       at org.mortbay.jetty.Server.handle(Server.java:326)
> >>       at
> >> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> >>       at
> >> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
> >>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
> >>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> >>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >>       at
> >> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> >>       at
> >> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> >> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> >> 0xfffe at char #6886708, byte #11578429)
> >> ...
> >>
> >> It seems to be an encoding exception. Is there a way to avoid this?
> >>
> >> Regards
> >> Stefan
> >>
> >> -- 
> >> Stefan Scheffler
> >> Avantgarde Labs GmbH
> >> Löbauer Straße 19, 01099 Dresden
> >> Telefon: + 49 (0) 351 21590834
> >> Email: sscheffler@avantgarde-labs.de
> >>
> >>
> 
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheffler@avantgarde-labs.de
> 
> 

Re: Indexing Exception

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
nutch 1.5, solr 3.6
On 24.09.2012 10:34, Markus Jelsma wrote:
> Hi - What version?
>
>   
>   
> -----Original message-----
>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>> Sent: Mon 24-Sep-2012 10:29
>> To: user@nutch.apache.org
>> Subject: Indexing Exception
>>
>> Hello,
>> I have a strange Problem. While indexing a crawl to solr i got the
>> following exception
>>
>> java.lang.RuntimeException: [was class java.io.CharConversionException]
>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
>>       at
>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>>       at
>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>>       at
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>>       at
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>       at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>>       at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>>       at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>>       at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>>       at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>       at
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>       at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>>       at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>>       at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>>       at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>       at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>       at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>       at org.mortbay.jetty.Server.handle(Server.java:326)
>>       at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>>       at
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
>>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>       at
>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>>       at
>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
>> 0xfffe at char #6886708, byte #11578429)
>> ...
>>
>> It seems to be an encoding exception. Is there a way to avoid this?
>>
>> Regards
>> Stefan
>>
>> -- 
>> Stefan Scheffler
>> Avantgarde Labs GmbH
>> Löbauer Straße 19, 01099 Dresden
>> Telefon: + 49 (0) 351 21590834
>> Email: sscheffler@avantgarde-labs.de
>>
>>


-- 
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de


RE: Indexing Exception

Posted by Markus Jelsma <ma...@openindex.io>.
Hi - What version?

 
 
-----Original message-----
> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> Sent: Mon 24-Sep-2012 10:29
> To: user@nutch.apache.org
> Subject: Indexing Exception
> 
> Hello,
> I have a strange Problem. While indexing a crawl to solr i got the 
> following exception
> 
> java.lang.RuntimeException: [was class java.io.CharConversionException] 
> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
>      at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>      at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>      at 
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>      at 
> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>      at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>      at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>      at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>      at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>      at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>      at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>      at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>      at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>      at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>      at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>      at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>      at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>      at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>      at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>      at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>      at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>      at org.mortbay.jetty.Server.handle(Server.java:326)
>      at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>      at 
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>      at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
>      at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>      at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>      at 
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>      at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> Caused by: java.io.CharConversionException: Invalid UTF-8 character 
> 0xfffe at char #6886708, byte #11578429)
> ...
> 
> It seems to be an encoding exception. Is there a way to avoid this?
> 
> Regards
> Stefan
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheffler@avantgarde-labs.de
> 
>