You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stefan Scheffler <ss...@avantgarde-labs.de> on 2012/09/24 10:25:45 UTC
Indexing Exception
Hello,
I have a strange Problem. While indexing a crawl to solr i got the
following exception
java.lang.RuntimeException: [was class java.io.CharConversionException]
Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.CharConversionException: Invalid UTF-8 character
0xfffe at char #6886708, byte #11578429)
...
It seems to be an encoding exception. Is there a way to avoid this?
Regards
Stefan
--
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de
RE: Indexing Exception
Posted by Markus Jelsma <ma...@openindex.io>.
It is hardcoded to process the `content` field only but it could be changed to process any string field.
-----Original message-----
> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> Sent: Mon 24-Sep-2012 12:27
> To: user@nutch.apache.org
> Subject: Re: Indexing Exception
>
> Hey,
> Thank you i used this method in the meantime for me and it worked fine.
> Is there a general way to do the encoding to utf8 to this field in
> Nutchg as well?
>
> On 24.09.2012 12:04, Markus Jelsma wrote:
> > Hi Stefan,
> >
> > You can take the stripNonCharCodepoints() method and pass your content through it. It should fix the problem.
> >
> > Cheers,
> >
> > -----Original message-----
> >> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >> Sent: Mon 24-Sep-2012 11:23
> >> To: user@nutch.apache.org
> >> Subject: Re: Indexing Exception
> >>
> >> Hey Markus. you gave me the right hint.
> >> Additionally to the normally content field i added a field fullcontent,
> >> which simply holds the html document of the relevant content field,
> >> because we need this in a later proccessing step. This field is not
> >> encoded like the content field. I realised this with an own
> >> ParsingFilter, which stores it in to the ParseResult and then an
> >> Indexingfilter merges it into the NutchDocument.
> >>
> >> Is there a way to do this better or just do the encoding to the
> >> fullcontent like to the content?
> >>
> >> Regards
> >> Stefan
> >> On 24.09.2012 10:41, Markus Jelsma wrote:
> >>> It was fixed for the content field with 1016. Can you pinpoint the problematic field?
> >>> https://issues.apache.org/jira/browse/NUTCH-1016
> >>>
> >>>
> >>>
> >>> -----Original message-----
> >>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >>>> Sent: Mon 24-Sep-2012 10:37
> >>>> To: user@nutch.apache.org
> >>>> Subject: Re: Indexing Exception
> >>>>
> >>>> nutch 1.5, solr 3.6
> >>>> On 24.09.2012 10:34, Markus Jelsma wrote:
> >>>>> Hi - What version?
> >>>>>
> >>>>>
> >>>>>
> >>>>> -----Original message-----
> >>>>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >>>>>> Sent: Mon 24-Sep-2012 10:29
> >>>>>> To: user@nutch.apache.org
> >>>>>> Subject: Indexing Exception
> >>>>>>
> >>>>>> Hello,
> >>>>>> I have a strange Problem. While indexing a crawl to solr i got the
> >>>>>> following exception
> >>>>>>
> >>>>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >>>>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >>>>>> at
> >>>>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>>>>> at
> >>>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>>>>> at
> >>>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>>>>> at
> >>>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>>>>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >>>>>> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >>>>>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >>>>>> at
> >>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >>>>>> at
> >>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >>>>>> at
> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >>>>>> at
> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >>>>>> at
> >>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>>>>> at
> >>>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>>>>> at
> >>>>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >>>>>> at
> >>>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >>>>>> at
> >>>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >>>>>> at
> >>>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >>>>>> at
> >>>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >>>>>> at
> >>>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >>>>>> at
> >>>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >>>>>> at org.mortbay.jetty.Server.handle(Server.java:326)
> >>>>>> at
> >>>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> >>>>>> at
> >>>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
> >>>>>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
> >>>>>> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> >>>>>> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >>>>>> at
> >>>>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> >>>>>> at
> >>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> >>>>>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> >>>>>> 0xfffe at char #6886708, byte #11578429)
> >>>>>> ...
> >>>>>>
> >>>>>> It seems to be an encoding exception. Is there a way to avoid this?
> >>>>>>
> >>>>>> Regards
> >>>>>> Stefan
> >>>>>>
> >>>>>> --
> >>>>>> Stefan Scheffler
> >>>>>> Avantgarde Labs GmbH
> >>>>>> Löbauer Straße 19, 01099 Dresden
> >>>>>> Telefon: + 49 (0) 351 21590834
> >>>>>> Email: sscheffler@avantgarde-labs.de
> >>>>>>
> >>>>>>
> >>>> --
> >>>> Stefan Scheffler
> >>>> Avantgarde Labs GmbH
> >>>> Löbauer Straße 19, 01099 Dresden
> >>>> Telefon: + 49 (0) 351 21590834
> >>>> Email: sscheffler@avantgarde-labs.de
> >>>>
> >>>>
> >>
> >> --
> >> Stefan Scheffler
> >> Avantgarde Labs GmbH
> >> Löbauer Straße 19, 01099 Dresden
> >> Telefon: + 49 (0) 351 21590834
> >> Email: sscheffler@avantgarde-labs.de
> >>
> >>
>
>
> --
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheffler@avantgarde-labs.de
>
>
Re: Indexing Exception
Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hey,
Thank you i used this method in the meantime for me and it worked fine.
Is there a general way to do the encoding to utf8 to this field in
Nutchg as well?
On 24.09.2012 12:04, Markus Jelsma wrote:
> Hi Stefan,
>
> You can take the stripNonCharCodepoints() method and pass your content through it. It should fix the problem.
>
> Cheers,
>
> -----Original message-----
>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>> Sent: Mon 24-Sep-2012 11:23
>> To: user@nutch.apache.org
>> Subject: Re: Indexing Exception
>>
>> Hey Markus. you gave me the right hint.
>> Additionally to the normally content field i added a field fullcontent,
>> which simply holds the html document of the relevant content field,
>> because we need this in a later proccessing step. This field is not
>> encoded like the content field. I realised this with an own
>> ParsingFilter, which stores it in to the ParseResult and then an
>> Indexingfilter merges it into the NutchDocument.
>>
>> Is there a way to do this better or just do the encoding to the
>> fullcontent like to the content?
>>
>> Regards
>> Stefan
>> On 24.09.2012 10:41, Markus Jelsma wrote:
>>> It was fixed for the content field with 1016. Can you pinpoint the problematic field?
>>> https://issues.apache.org/jira/browse/NUTCH-1016
>>>
>>>
>>>
>>> -----Original message-----
>>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>>>> Sent: Mon 24-Sep-2012 10:37
>>>> To: user@nutch.apache.org
>>>> Subject: Re: Indexing Exception
>>>>
>>>> nutch 1.5, solr 3.6
>>>> On 24.09.2012 10:34, Markus Jelsma wrote:
>>>>> Hi - What version?
>>>>>
>>>>>
>>>>>
>>>>> -----Original message-----
>>>>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>>>>>> Sent: Mon 24-Sep-2012 10:29
>>>>>> To: user@nutch.apache.org
>>>>>> Subject: Indexing Exception
>>>>>>
>>>>>> Hello,
>>>>>> I have a strange Problem. While indexing a crawl to solr i got the
>>>>>> following exception
>>>>>>
>>>>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
>>>>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
>>>>>> at
>>>>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>>>>>> at
>>>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>>>>>> at
>>>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>>>>>> at
>>>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>>>>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>>>>>> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>>>>>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>>>>>> at
>>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>>>>>> at
>>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>>>>>> at
>>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>>>>>> at
>>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>>>>>> at
>>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>>>> at
>>>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>>>> at
>>>>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>>>> at
>>>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>>>>>> at
>>>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>>>>>> at
>>>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>>>>>> at
>>>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>>>>> at
>>>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>>>> at
>>>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>>>>> at org.mortbay.jetty.Server.handle(Server.java:326)
>>>>>> at
>>>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>>>>>> at
>>>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>>>>>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
>>>>>> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>>>>> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>>>>> at
>>>>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>>>>>> at
>>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>>>>>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
>>>>>> 0xfffe at char #6886708, byte #11578429)
>>>>>> ...
>>>>>>
>>>>>> It seems to be an encoding exception. Is there a way to avoid this?
>>>>>>
>>>>>> Regards
>>>>>> Stefan
>>>>>>
>>>>>> --
>>>>>> Stefan Scheffler
>>>>>> Avantgarde Labs GmbH
>>>>>> Löbauer Straße 19, 01099 Dresden
>>>>>> Telefon: + 49 (0) 351 21590834
>>>>>> Email: sscheffler@avantgarde-labs.de
>>>>>>
>>>>>>
>>>> --
>>>> Stefan Scheffler
>>>> Avantgarde Labs GmbH
>>>> Löbauer Straße 19, 01099 Dresden
>>>> Telefon: + 49 (0) 351 21590834
>>>> Email: sscheffler@avantgarde-labs.de
>>>>
>>>>
>>
>> --
>> Stefan Scheffler
>> Avantgarde Labs GmbH
>> Löbauer Straße 19, 01099 Dresden
>> Telefon: + 49 (0) 351 21590834
>> Email: sscheffler@avantgarde-labs.de
>>
>>
--
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de
RE: Indexing Exception
Posted by Markus Jelsma <ma...@openindex.io>.
Hi Stefan,
You can take the stripNonCharCodepoints() method and pass your content through it. It should fix the problem.
Cheers,
-----Original message-----
> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> Sent: Mon 24-Sep-2012 11:23
> To: user@nutch.apache.org
> Subject: Re: Indexing Exception
>
> Hey Markus. you gave me the right hint.
> Additionally to the normally content field i added a field fullcontent,
> which simply holds the html document of the relevant content field,
> because we need this in a later proccessing step. This field is not
> encoded like the content field. I realised this with an own
> ParsingFilter, which stores it in to the ParseResult and then an
> Indexingfilter merges it into the NutchDocument.
>
> Is there a way to do this better or just do the encoding to the
> fullcontent like to the content?
>
> Regards
> Stefan
> On 24.09.2012 10:41, Markus Jelsma wrote:
> > It was fixed for the content field with 1016. Can you pinpoint the problematic field?
> > https://issues.apache.org/jira/browse/NUTCH-1016
> >
> >
> >
> > -----Original message-----
> >> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >> Sent: Mon 24-Sep-2012 10:37
> >> To: user@nutch.apache.org
> >> Subject: Re: Indexing Exception
> >>
> >> nutch 1.5, solr 3.6
> >> On 24.09.2012 10:34, Markus Jelsma wrote:
> >>> Hi - What version?
> >>>
> >>>
> >>>
> >>> -----Original message-----
> >>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >>>> Sent: Mon 24-Sep-2012 10:29
> >>>> To: user@nutch.apache.org
> >>>> Subject: Indexing Exception
> >>>>
> >>>> Hello,
> >>>> I have a strange Problem. While indexing a crawl to solr i got the
> >>>> following exception
> >>>>
> >>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >>>> at
> >>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>>> at
> >>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>>> at
> >>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>>> at
> >>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >>>> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >>>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >>>> at
> >>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >>>> at
> >>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >>>> at
> >>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >>>> at
> >>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >>>> at
> >>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>>> at
> >>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>>> at
> >>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >>>> at
> >>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >>>> at
> >>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >>>> at
> >>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >>>> at
> >>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >>>> at
> >>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >>>> at
> >>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >>>> at org.mortbay.jetty.Server.handle(Server.java:326)
> >>>> at
> >>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> >>>> at
> >>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
> >>>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
> >>>> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> >>>> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >>>> at
> >>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> >>>> at
> >>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> >>>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> >>>> 0xfffe at char #6886708, byte #11578429)
> >>>> ...
> >>>>
> >>>> It seems to be an encoding exception. Is there a way to avoid this?
> >>>>
> >>>> Regards
> >>>> Stefan
> >>>>
> >>>> --
> >>>> Stefan Scheffler
> >>>> Avantgarde Labs GmbH
> >>>> Löbauer Straße 19, 01099 Dresden
> >>>> Telefon: + 49 (0) 351 21590834
> >>>> Email: sscheffler@avantgarde-labs.de
> >>>>
> >>>>
> >>
> >> --
> >> Stefan Scheffler
> >> Avantgarde Labs GmbH
> >> Löbauer Straße 19, 01099 Dresden
> >> Telefon: + 49 (0) 351 21590834
> >> Email: sscheffler@avantgarde-labs.de
> >>
> >>
>
>
> --
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheffler@avantgarde-labs.de
>
>
Re: Indexing Exception
Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hey Markus. you gave me the right hint.
Additionally to the normally content field i added a field fullcontent,
which simply holds the html document of the relevant content field,
because we need this in a later proccessing step. This field is not
encoded like the content field. I realised this with an own
ParsingFilter, which stores it in to the ParseResult and then an
Indexingfilter merges it into the NutchDocument.
Is there a way to do this better or just do the encoding to the
fullcontent like to the content?
Regards
Stefan
On 24.09.2012 10:41, Markus Jelsma wrote:
> It was fixed for the content field with 1016. Can you pinpoint the problematic field?
> https://issues.apache.org/jira/browse/NUTCH-1016
>
>
>
> -----Original message-----
>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>> Sent: Mon 24-Sep-2012 10:37
>> To: user@nutch.apache.org
>> Subject: Re: Indexing Exception
>>
>> nutch 1.5, solr 3.6
>> On 24.09.2012 10:34, Markus Jelsma wrote:
>>> Hi - What version?
>>>
>>>
>>>
>>> -----Original message-----
>>>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>>>> Sent: Mon 24-Sep-2012 10:29
>>>> To: user@nutch.apache.org
>>>> Subject: Indexing Exception
>>>>
>>>> Hello,
>>>> I have a strange Problem. While indexing a crawl to solr i got the
>>>> following exception
>>>>
>>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
>>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
>>>> at
>>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>>>> at
>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>>>> at
>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>>>> at
>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>>>> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>>>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>>>> at
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>>>> at
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>>>> at
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>>>> at
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>>>> at
>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>> at
>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>> at
>>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>> at
>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>>>> at
>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>>>> at
>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>>>> at
>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>>> at
>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>> at
>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>>> at org.mortbay.jetty.Server.handle(Server.java:326)
>>>> at
>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>>>> at
>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>>>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
>>>> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>>> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>>> at
>>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>>>> at
>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>>>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
>>>> 0xfffe at char #6886708, byte #11578429)
>>>> ...
>>>>
>>>> It seems to be an encoding exception. Is there a way to avoid this?
>>>>
>>>> Regards
>>>> Stefan
>>>>
>>>> --
>>>> Stefan Scheffler
>>>> Avantgarde Labs GmbH
>>>> Löbauer Straße 19, 01099 Dresden
>>>> Telefon: + 49 (0) 351 21590834
>>>> Email: sscheffler@avantgarde-labs.de
>>>>
>>>>
>>
>> --
>> Stefan Scheffler
>> Avantgarde Labs GmbH
>> Löbauer Straße 19, 01099 Dresden
>> Telefon: + 49 (0) 351 21590834
>> Email: sscheffler@avantgarde-labs.de
>>
>>
--
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de
RE: Indexing Exception
Posted by Markus Jelsma <ma...@openindex.io>.
It was fixed for the content field with 1016. Can you pinpoint the problematic field?
https://issues.apache.org/jira/browse/NUTCH-1016
-----Original message-----
> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> Sent: Mon 24-Sep-2012 10:37
> To: user@nutch.apache.org
> Subject: Re: Indexing Exception
>
> nutch 1.5, solr 3.6
> On 24.09.2012 10:34, Markus Jelsma wrote:
> > Hi - What version?
> >
> >
> >
> > -----Original message-----
> >> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> >> Sent: Mon 24-Sep-2012 10:29
> >> To: user@nutch.apache.org
> >> Subject: Indexing Exception
> >>
> >> Hello,
> >> I have a strange Problem. While indexing a crawl to solr i got the
> >> following exception
> >>
> >> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >> at
> >> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >> at
> >> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >> at
> >> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >> at
> >> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >> at
> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >> at
> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >> at
> >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >> at
> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >> at
> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >> at
> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >> at
> >> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >> at
> >> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >> at
> >> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >> at
> >> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >> at
> >> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >> at
> >> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >> at
> >> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >> at org.mortbay.jetty.Server.handle(Server.java:326)
> >> at
> >> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> >> at
> >> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
> >> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
> >> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> >> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >> at
> >> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> >> at
> >> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> >> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> >> 0xfffe at char #6886708, byte #11578429)
> >> ...
> >>
> >> It seems to be an encoding exception. Is there a way to avoid this?
> >>
> >> Regards
> >> Stefan
> >>
> >> --
> >> Stefan Scheffler
> >> Avantgarde Labs GmbH
> >> Löbauer Straße 19, 01099 Dresden
> >> Telefon: + 49 (0) 351 21590834
> >> Email: sscheffler@avantgarde-labs.de
> >>
> >>
>
>
> --
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheffler@avantgarde-labs.de
>
>
Re: Indexing Exception
Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
nutch 1.5, solr 3.6
On 24.09.2012 10:34, Markus Jelsma wrote:
> Hi - What version?
>
>
>
> -----Original message-----
>> From:Stefan Scheffler <ss...@avantgarde-labs.de>
>> Sent: Mon 24-Sep-2012 10:29
>> To: user@nutch.apache.org
>> Subject: Indexing Exception
>>
>> Hello,
>> I have a strange Problem. While indexing a crawl to solr i got the
>> following exception
>>
>> java.lang.RuntimeException: [was class java.io.CharConversionException]
>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
>> at
>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>> at
>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>> at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>> at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>> at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>> at
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>> at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>> at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>> at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>> at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>> at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>> at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>> at org.mortbay.jetty.Server.handle(Server.java:326)
>> at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>> at
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
>> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>> at
>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>> at
>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
>> 0xfffe at char #6886708, byte #11578429)
>> ...
>>
>> It seems to be an encoding exception. Is there a way to avoid this?
>>
>> Regards
>> Stefan
>>
>> --
>> Stefan Scheffler
>> Avantgarde Labs GmbH
>> Löbauer Straße 19, 01099 Dresden
>> Telefon: + 49 (0) 351 21590834
>> Email: sscheffler@avantgarde-labs.de
>>
>>
--
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de
RE: Indexing Exception
Posted by Markus Jelsma <ma...@openindex.io>.
Hi - What version?
-----Original message-----
> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> Sent: Mon 24-Sep-2012 10:29
> To: user@nutch.apache.org
> Subject: Indexing Exception
>
> Hello,
> I have a strange Problem. While indexing a crawl to solr i got the
> following exception
>
> java.lang.RuntimeException: [was class java.io.CharConversionException]
> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> at
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> at
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> at
> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
> at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> at
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> 0xfffe at char #6886708, byte #11578429)
> ...
>
> It seems to be an encoding exception. Is there a way to avoid this?
>
> Regards
> Stefan
>
> --
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheffler@avantgarde-labs.de
>
>