You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/08/25 02:42:41 UTC

invalid utf8 chars when indexing or cleaning

Lately, I have seen many tasks and jobs fail in Solr when doing nutch index and nutch clean.
Messages during indexing look like this.
17/08/24 19:18:59 INFO mapreduce.Job:  map 100% reduce 99%
17/08/24 19:19:36 INFO mapreduce.Job: Task Id : attempt_1502929850483_1329_r_000007_2, Status : FAILED
Error: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [
com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #104705, byte #219135)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
        at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)

Messages during cleaning look like this.
17/08/22 09:24:01 INFO mapreduce.Job:  map 100% reduce 92%17/08/22 09:25:57 INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status : FAILEDError: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #16099, byte #16383)        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)        at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)        at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245)
Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.

Re: invalid utf8 chars when indexing or cleaning

Posted by Jorge Betancourt <be...@gmail.com>.

 From the logs looks like the error is coming from the Solr side, do you 
mind checking/sharing the logs on your Solr server? Can you pin point which 
URL is causing the issue?
Best Regards, Jorge

On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey <mc...@yahoo.com.invalid> 
wrote:
Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 
bug that was fixed in version 1.4.
Some more bits of information: the indexer job rarely fails (only 1 of the 
last 99 segments) but the cleaning job fails every time now. Once again, 
this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and 
Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of 
mismatch of versions?


To: User <us...@nutch.apache.org>
Sent: Thursday, August 24, 2017 7:42 PM
Subject: invalid utf8 chars when indexing or cleaning

Lately, I have seen many tasks and jobs fail in Solr when doing nutch index 
and nutch clean.
Messages during indexing look like this.
17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99%
17/08/24 19:19:36 INFO mapreduce.Job: Task Id : 
attempt_1502929850483_1329_r_000007_2, Status : FAILED
Error: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://codero4.neocortix.com:8984/solr/popular: [
com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char 
#104705, byte #219135)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
at 
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)

Messages during cleaning look like this.
17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57 
INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status 
: FAILEDError: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://codero4.neocortix.com:8984/solr/popular: 
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char 
#16099, byte #16383) at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) 
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) 
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) 
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150) 
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) 
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) 
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) 
at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222) 
at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187) 
at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178) 
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at 
org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120) 
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245)
Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1. 
I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing 
this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.

Re: invalid utf8 chars when indexing or cleaning

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.

Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 bug that was fixed in version 1.4.
Some more bits of information: the indexer job rarely fails (only 1 of the last 99 segments) but the cleaning job fails every time now. Once again, this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of mismatch of versions?

     To: User <us...@nutch.apache.org> 
 Sent: Thursday, August 24, 2017 7:42 PM
 Subject: invalid utf8 chars when indexing or cleaning

Lately, I have seen many tasks and jobs fail in Solr when doing nutch index and nutch clean.
Messages during indexing look like this.
17/08/24 19:18:59 INFO mapreduce.Job:  map 100% reduce 99%
17/08/24 19:19:36 INFO mapreduce.Job: Task Id : attempt_1502929850483_1329_r_000007_2, Status : FAILED
Error: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [
com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #104705, byte #219135)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
        at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)

Messages during cleaning look like this.
17/08/22 09:24:01 INFO mapreduce.Job:  map 100% reduce 92%17/08/22 09:25:57 INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status : FAILEDError: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #16099, byte #16383)        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)        at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)        at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245)
Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.