You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kshitij Shukla <ks...@cisinlabs.com> on 2016/01/25 08:23:06 UTC

[CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception

Hello everyone,

During a very large crawl when indexing to Solr this will yield the 
following exception:

**************************************************
root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin# 
/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch 
index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D 
mapred.compress.map.output=true -D 
solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
16/01/25 11:44:53 INFO Configuration.deprecation: 
mapred.output.key.comparator.class is deprecated. Instead, use 
mapreduce.job.output.key.comparator.class
16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in: 
/tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation 
mode: [true]
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
16/01/25 11:44:54 INFO plugin.PluginRepository:     HTTP Framework 
(lib-http)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Parse Plug-in 
(parse-html)
16/01/25 11:44:54 INFO plugin.PluginRepository:     MetaTags 
(parse-metatags)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Indexing Filter 
(index-html)
16/01/25 11:44:54 INFO plugin.PluginRepository:     the nutch core 
extension points (nutch-extensionpoints)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic Indexing 
Filter (index-basic)
16/01/25 11:44:54 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Anchor Indexing 
Filter (index-anchor)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic URL Normalizer 
(urlnormalizer-basic)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Language 
Identification Parser/Filter (language-identifier)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Metadata Indexing 
Filter (index-metadata)
16/01/25 11:44:54 INFO plugin.PluginRepository:     CyberNeko HTML 
Parser (lib-nekohtml)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Subcollection 
indexing and query filter (subcollection)
16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter 
(indexer-solr)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Rel-Tag microformat 
Parser/Indexer/Querier (microformats-reltag)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Http / Https 
Protocol Plug-in (protocol-httpclient)
16/01/25 11:44:54 INFO plugin.PluginRepository:     JavaScript Parser 
(parse-js)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Tika Parser Plug-in 
(parse-tika)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Top Level Domain 
Plugin (tld)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Filter 
Framework (lib-regex-filter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Normalizer 
(urlnormalizer-regex)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Link Analysis 
Scoring Plug-in (scoring-link)
16/01/25 11:44:54 INFO plugin.PluginRepository:     OPIC Scoring Plug-in 
(scoring-opic)
16/01/25 11:44:54 INFO plugin.PluginRepository:     More Indexing Filter 
(index-more)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Http Protocol 
Plug-in (protocol-http)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Creative Commons 
Plugins (creativecommons)
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
16/01/25 11:44:54 INFO plugin.PluginRepository:     Parse Filter 
(org.apache.nutch.parse.ParseFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Cleaning 
Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Content Parser 
(org.apache.nutch.parse.Parser)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Filter 
(org.apache.nutch.net.URLFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Scoring 
(org.apache.nutch.scoring.ScoringFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Normalizer 
(org.apache.nutch.net.URLNormalizer)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Protocol 
(org.apache.nutch.protocol.Protocol)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Writer 
(org.apache.nutch.indexer.IndexWriter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Indexing 
Filter (org.apache.nutch.indexer.IndexingFilter)
16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding 
org.apache.nutch.indexer.html.HtmlIndexingFilter
16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length 
for indexing set to: 100
16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication 
is: off
16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for 
job: job_1453472314066_0007
16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application 
application_1453472314066_0007
16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job: 
http://cism479:8088/proxy/application_1453472314066_0007/
16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running 
in uber mode : false
16/01/25 11:45:29 INFO mapreduce.Job:  map 0% reduce 0%
16/01/25 11:49:24 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job:  map 0% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job: Task Id : 
attempt_1453472314066_0007_m_000000_0, Status : FAILED
Error: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at 
char #1296459, byte #1310719)
     at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
     at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
     at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
     at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
     at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
     at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
     at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
     at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
     at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
     at 
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
     at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
     at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
     at 
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
     at 
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 11:52:27 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job:  map 100% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job: Task Id : 
attempt_1453472314066_0007_m_000000_1, Status : FAILED
Error: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at 
char #1296459, byte #1310719)
     at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
     at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
     at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
     at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
     at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
     at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
     at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
     at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
     at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
     at 
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
     at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
     at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
     at 
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
     at 
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 11:53:02 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job:  map 100% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job: Task Id : 
attempt_1453472314066_0007_m_000000_2, Status : FAILED
Error: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at 
char #1296459, byte #1310719)
     at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
     at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
     at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
     at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
     at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
     at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
     at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
     at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
     at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
     at 
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
     at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
     at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
     at 
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
     at 
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 11:54:53 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:56:22 INFO mapreduce.Job:  map 100% reduce 0%
16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed 
with state FAILED due to: Task failed task_1453472314066_0007_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
     File System Counters
         FILE: Number of bytes read=0
         FILE: Number of bytes written=116194
         FILE: Number of read operations=0
         FILE: Number of large read operations=0
         FILE: Number of write operations=0
         HDFS: Number of bytes read=1033
         HDFS: Number of bytes written=0
         HDFS: Number of read operations=1
         HDFS: Number of large read operations=0
         HDFS: Number of write operations=0
     Job Counters
         Failed map tasks=4
         Launched map tasks=5
         Other local map tasks=3
         Data-local map tasks=2
         Total time spent by all maps in occupied slots (ms)=3168342
         Total time spent by all reduces in occupied slots (ms)=0
         Total time spent by all map tasks (ms)=1056114
         Total vcore-seconds taken by all map tasks=1056114
         Total megabyte-seconds taken by all map tasks=3244382208
     Map-Reduce Framework
         Map input records=2762511
         Map output records=17629
         Input split bytes=1033
         Spilled Records=0
         Failed Shuffles=0
         Merged Map outputs=0
         GC time elapsed (ms)=2995
         CPU time spent (ms)=116860
         Physical memory (bytes) snapshot=1272868864
         Virtual memory (bytes) snapshot=5104431104
         Total committed heap usage (bytes)=1017118720
     IndexerJob
         DocumentCount=17629
     File Input Format Counters
         Bytes Read=0
     File Output Format Counters
         Bytes Written=0
16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob: 
java.lang.RuntimeException: job failed: name=[1]Indexer, 
jobid=job_1453472314066_0007
     at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
     at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
     at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
     at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
     at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:497)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
*******************************************************
-- 

Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)

Thanks and Regards,

Kshitij Shukla
Software developer

*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
intended recipient, you should delete this message and are notified that 
any disclosure, copying or distribution of this message, or taking any 
action based on it, is strictly prohibited by Law.

Please don't print this e-mail unless you really need to.

-- 

------------------------------

*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*

Central India's largest Technology company.

*Ensuring the success of our clients and partners through our highly 
optimized Technology solutions.*

www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin 
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> | 
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
intended recipient, you should delete this message and are notified that 
any disclosure, copying or distribution of this message, or taking any 
action based on it, is strictly prohibited by Law.

RE: [CIS-CMMI-3] Re: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception

Posted by Markus Jelsma <ma...@openindex.io>.

You could try removing the content or title condition:

        if (e.getKey().equals("content") || e.getKey().equals("title")) {
          val2 = SolrUtils.stripNonCharCodepoints(val);
        }

Then all fields will get stripped. But usually it only happens on the content field, strange.
Markus

 
 
-----Original message-----
> From:Kshitij Shukla <ks...@cisinlabs.com>
> Sent: Monday 25th January 2016 14:23
> To: user@nutch.apache.org
> Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> 
> I have been trying to get the name of the field, but the error its 
> showing is kind of generic error and doesnt have any field name 
> associated with it. I tried to get the name in hadoop log, nutch log and 
> solr logs. But i didn't find any field name.
> 
> Thanks
> 
> On Monday 25 January 2016 06:10 PM, Markus Jelsma wrote:
> > That is odd! Is it on your content or title field?
> > Markus
> >   
> > -----Original message-----
> >> From:Kshitij Shukla <ks...@cisinlabs.com>
> >> Sent: Monday 25th January 2016 11:41
> >> To: user@nutch.apache.org
> >> Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> >>
> >> Thanks for your response Markus, I checked the code and I found the
> >> workaround you suggested in this file :
> >>
> >> *Source:*
> >> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
> >>
> >> and the method was called in this file:
> >>
> >> *Invoked:*
> >> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
> >> like this
> >>           if (e.getKey().equals("content") || e.getKey().equals("title")) {
> >>                       val2 = SolrUtils.stripNonCharCodepoints(val);
> >>           }
> >>
> >> So if the method is there and apparently invoked at right place. So what
> >> do you think where the problem could be?
> >>
> >> Thanks again for your help.
> >>
> >> On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
> >>> Hi - this is NUTCH-1016, which was never ported to 2.x.
> >>>
> >>> https://issues.apache.org/jira/browse/NUTCH-1016
> >>>
> >>>    
> >>>    
> >>> -----Original message-----
> >>>> From:Kshitij Shukla <ks...@cisinlabs.com>
> >>>> Sent: Monday 25th January 2016 8:23
> >>>> To: user@nutch.apache.org
> >>>> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> >>>>
> >>>> Hello everyone,
> >>>>
> >>>> During a very large crawl when indexing to Solr this will yield the
> >>>> following exception:
> >>>>
> >>>> **************************************************
> >>>> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
> >>>> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
> >>>> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> >>>> mapred.reduce.tasks.speculative.execution=false -D
> >>>> mapred.map.tasks.speculative.execution=false -D
> >>>> mapred.compress.map.output=true -D
> >>>> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
> >>>> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
> >>>> 16/01/25 11:44:53 INFO Configuration.deprecation:
> >>>> mapred.output.key.comparator.class is deprecated. Instead, use
> >>>> mapreduce.job.output.key.comparator.class
> >>>> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
> >>>> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
> >>>> mode: [true]
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     HTTP Framework
> >>>> (lib-http)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Parse Plug-in
> >>>> (parse-html)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     MetaTags
> >>>> (parse-metatags)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Indexing Filter
> >>>> (index-html)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     the nutch core
> >>>> extension points (nutch-extensionpoints)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic Indexing
> >>>> Filter (index-basic)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Anchor Indexing
> >>>> Filter (index-anchor)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic URL Normalizer
> >>>> (urlnormalizer-basic)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Language
> >>>> Identification Parser/Filter (language-identifier)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Metadata Indexing
> >>>> Filter (index-metadata)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     CyberNeko HTML
> >>>> Parser (lib-nekohtml)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Subcollection
> >>>> indexing and query filter (subcollection)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
> >>>> (indexer-solr)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Rel-Tag microformat
> >>>> Parser/Indexer/Querier (microformats-reltag)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http / Https
> >>>> Protocol Plug-in (protocol-httpclient)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     JavaScript Parser
> >>>> (parse-js)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Tika Parser Plug-in
> >>>> (parse-tika)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Top Level Domain
> >>>> Plugin (tld)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Filter
> >>>> Framework (lib-regex-filter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Normalizer
> >>>> (urlnormalizer-regex)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Link Analysis
> >>>> Scoring Plug-in (scoring-link)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     OPIC Scoring Plug-in
> >>>> (scoring-opic)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     More Indexing Filter
> >>>> (index-more)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http Protocol
> >>>> Plug-in (protocol-http)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Creative Commons
> >>>> Plugins (creativecommons)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Parse Filter
> >>>> (org.apache.nutch.parse.ParseFilter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Cleaning
> >>>> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Content Parser
> >>>> (org.apache.nutch.parse.Parser)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Filter
> >>>> (org.apache.nutch.net.URLFilter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Scoring
> >>>> (org.apache.nutch.scoring.ScoringFilter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Normalizer
> >>>> (org.apache.nutch.net.URLNormalizer)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Protocol
> >>>> (org.apache.nutch.protocol.Protocol)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Writer
> >>>> (org.apache.nutch.indexer.IndexWriter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Indexing
> >>>> Filter (org.apache.nutch.indexer.IndexingFilter)
> >>>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >>>> org.apache.nutch.indexer.html.HtmlIndexingFilter
> >>>> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
> >>>> for indexing set to: 100
> >>>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
> >>>> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
> >>>> is: off
> >>>> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
> >>>> job: job_1453472314066_0007
> >>>> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
> >>>> application_1453472314066_0007
> >>>> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
> >>>> http://cism479:8088/proxy/application_1453472314066_0007/
> >>>> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
> >>>> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
> >>>> in uber mode : false
> >>>> 16/01/25 11:45:29 INFO mapreduce.Job:  map 0% reduce 0%
> >>>> 16/01/25 11:49:24 INFO mapreduce.Job:  map 50% reduce 0%
> >>>> 16/01/25 11:49:29 INFO mapreduce.Job:  map 0% reduce 0%
> >>>> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
> >>>> attempt_1453472314066_0007_m_000000_0, Status : FAILED
> >>>> Error:
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >>>> char #1296459, byte #1310719)
> >>>>        at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >>>>        at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >>>>        at
> >>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >>>>        at
> >>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >>>>        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >>>>        at
> >>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >>>>        at
> >>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >>>>        at
> >>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >>>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >>>>        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >>>>        at java.security.AccessController.doPrivileged(Native Method)
> >>>>        at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>>        at
> >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >>>>        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>>>
> >>>> 16/01/25 11:52:27 INFO mapreduce.Job:  map 50% reduce 0%
> >>>> 16/01/25 11:53:01 INFO mapreduce.Job:  map 100% reduce 0%
> >>>> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
> >>>> attempt_1453472314066_0007_m_000000_1, Status : FAILED
> >>>> Error:
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >>>> char #1296459, byte #1310719)
> >>>>        at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >>>>        at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >>>>        at
> >>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >>>>        at
> >>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >>>>        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >>>>        at
> >>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >>>>        at
> >>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >>>>        at
> >>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >>>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >>>>        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >>>>        at java.security.AccessController.doPrivileged(Native Method)
> >>>>        at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>>        at
> >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >>>>        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>>>
> >>>> 16/01/25 11:53:02 INFO mapreduce.Job:  map 50% reduce 0%
> >>>> 16/01/25 11:54:52 INFO mapreduce.Job:  map 100% reduce 0%
> >>>> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
> >>>> attempt_1453472314066_0007_m_000000_2, Status : FAILED
> >>>> Error:
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >>>> char #1296459, byte #1310719)
> >>>>        at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >>>>        at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >>>>        at
> >>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >>>>        at
> >>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >>>>        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >>>>        at
> >>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >>>>        at
> >>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >>>>        at
> >>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >>>>        at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >>>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >>>>        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >>>>        at java.security.AccessController.doPrivileged(Native Method)
> >>>>        at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>>        at
> >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >>>>        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>>>
> >>>> 16/01/25 11:54:53 INFO mapreduce.Job:  map 50% reduce 0%
> >>>> 16/01/25 11:56:22 INFO mapreduce.Job:  map 100% reduce 0%
> >>>> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
> >>>> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
> >>>> Job failed as tasks failed. failedMaps:1 failedReduces:0
> >>>>
> >>>> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
> >>>>        File System Counters
> >>>>            FILE: Number of bytes read=0
> >>>>            FILE: Number of bytes written=116194
> >>>>            FILE: Number of read operations=0
> >>>>            FILE: Number of large read operations=0
> >>>>            FILE: Number of write operations=0
> >>>>            HDFS: Number of bytes read=1033
> >>>>            HDFS: Number of bytes written=0
> >>>>            HDFS: Number of read operations=1
> >>>>            HDFS: Number of large read operations=0
> >>>>            HDFS: Number of write operations=0
> >>>>        Job Counters
> >>>>            Failed map tasks=4
> >>>>            Launched map tasks=5
> >>>>            Other local map tasks=3
> >>>>            Data-local map tasks=2
> >>>>            Total time spent by all maps in occupied slots (ms)=3168342
> >>>>            Total time spent by all reduces in occupied slots (ms)=0
> >>>>            Total time spent by all map tasks (ms)=1056114
> >>>>            Total vcore-seconds taken by all map tasks=1056114
> >>>>            Total megabyte-seconds taken by all map tasks=3244382208
> >>>>        Map-Reduce Framework
> >>>>            Map input records=2762511
> >>>>            Map output records=17629
> >>>>            Input split bytes=1033
> >>>>            Spilled Records=0
> >>>>            Failed Shuffles=0
> >>>>            Merged Map outputs=0
> >>>>            GC time elapsed (ms)=2995
> >>>>            CPU time spent (ms)=116860
> >>>>            Physical memory (bytes) snapshot=1272868864
> >>>>            Virtual memory (bytes) snapshot=5104431104
> >>>>            Total committed heap usage (bytes)=1017118720
> >>>>        IndexerJob
> >>>>            DocumentCount=17629
> >>>>        File Input Format Counters
> >>>>            Bytes Read=0
> >>>>        File Output Format Counters
> >>>>            Bytes Written=0
> >>>> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
> >>>> java.lang.RuntimeException: job failed: name=[1]Indexer,
> >>>> jobid=job_1453472314066_0007
> >>>>        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
> >>>>        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> >>>>        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> >>>>        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> >>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >>>>        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> >>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>        at
> >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >>>>        at
> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>>>        at java.lang.reflect.Method.invoke(Method.java:497)
> >>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> >>>> *******************************************************
> >>>> -- 
> >>>>
> >>>> Please let me know if you have any questions , concerns or updates.
> >>>> Have a great day ahead :)
> >>>>
> >>>> Thanks and Regards,
> >>>>
> >>>> Kshitij Shukla
> >>>> Software developer
> >>>>
> >>>> *Cyber Infrastructure(CIS)
> >>>> **/The RightSourcing Specialists with 1250 man years of experience!/*
> >>>>
> >>>> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
> >>>> intended recipient, you should delete this message and are notified that
> >>>> any disclosure, copying or distribution of this message, or taking any
> >>>> action based on it, is strictly prohibited by Law.
> >>>>
> >>>> Please don't print this e-mail unless you really need to.
> >>>>
> >>>> -- 
> >>>>
> >>>> ------------------------------
> >>>>
> >>>> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> >>>>
> >>>> Central India's largest Technology company.
> >>>>
> >>>> *Ensuring the success of our clients and partners through our highly
> >>>> optimized Technology solutions.*
> >>>>
> >>>> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> >>>> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> >>>> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> >>>>
> >>>> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
> >>>> intended recipient, you should delete this message and are notified that
> >>>> any disclosure, copying or distribution of this message, or taking any
> >>>> action based on it, is strictly prohibited by Law.
> >>>>
> >>
> >> -- 
> >>
> >> Please let me know if you have any questions , concerns or updates.
> >> Have a great day ahead :)
> >>
> >> Thanks and Regards,
> >>
> >> Kshitij Shukla
> >> Software developer
> >>
> >> *Cyber Infrastructure(CIS)
> >> **/The RightSourcing Specialists with 1250 man years of experience!/*
> >>
> >> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
> >> Please don't print this e-mail unless you really need to.
> >>
> >> -- 
> >>
> >> ------------------------------
> >>
> >> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> >>
> >> Central India's largest Technology company.
> >>
> >> *Ensuring the success of our clients and partners through our highly
> >> optimized Technology solutions.*
> >>
> >> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> >> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> >> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> >>
> >> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
> 
> 
> -- 
> 
> Please let me know if you have any questions , concerns or updates.
> Have a great day ahead :)
> 
> Thanks and Regards,
> 
> Kshitij Shukla
> Software developer
> 
> *Cyber Infrastructure(CIS)
> **/The RightSourcing Specialists with 1250 man years of experience!/*
> 
> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
> intended recipient, you should delete this message and are notified that 
> any disclosure, copying or distribution of this message, or taking any 
> action based on it, is strictly prohibited by Law.
> 
> Please don't print this e-mail unless you really need to.
> 
> -- 
> 
> ------------------------------
> 
> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> 
> Central India's largest Technology company.
> 
> *Ensuring the success of our clients and partners through our highly 
> optimized Technology solutions.*
> 
> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin 
> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> | 
> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> 
> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
> intended recipient, you should delete this message and are notified that 
> any disclosure, copying or distribution of this message, or taking any 
> action based on it, is strictly prohibited by Law.
>

[CIS-CMMI-3] Re: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception

Posted by Kshitij Shukla <ks...@cisinlabs.com>.

I have been trying to get the name of the field, but the error its 
showing is kind of generic error and doesnt have any field name 
associated with it. I tried to get the name in hadoop log, nutch log and 
solr logs. But i didn't find any field name.

Thanks

On Monday 25 January 2016 06:10 PM, Markus Jelsma wrote:
> That is odd! Is it on your content or title field?
> Markus
>   
> -----Original message-----
>> From:Kshitij Shukla <ks...@cisinlabs.com>
>> Sent: Monday 25th January 2016 11:41
>> To: user@nutch.apache.org
>> Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
>>
>> Thanks for your response Markus, I checked the code and I found the
>> workaround you suggested in this file :
>>
>> *Source:*
>> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
>>
>> and the method was called in this file:
>>
>> *Invoked:*
>> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
>> like this
>>           if (e.getKey().equals("content") || e.getKey().equals("title")) {
>>                       val2 = SolrUtils.stripNonCharCodepoints(val);
>>           }
>>
>> So if the method is there and apparently invoked at right place. So what
>> do you think where the problem could be?
>>
>> Thanks again for your help.
>>
>> On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
>>> Hi - this is NUTCH-1016, which was never ported to 2.x.
>>>
>>> https://issues.apache.org/jira/browse/NUTCH-1016
>>>
>>>    
>>>    
>>> -----Original message-----
>>>> From:Kshitij Shukla <ks...@cisinlabs.com>
>>>> Sent: Monday 25th January 2016 8:23
>>>> To: user@nutch.apache.org
>>>> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
>>>>
>>>> Hello everyone,
>>>>
>>>> During a very large crawl when indexing to Solr this will yield the
>>>> following exception:
>>>>
>>>> **************************************************
>>>> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
>>>> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
>>>> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>>>> mapred.reduce.tasks.speculative.execution=false -D
>>>> mapred.map.tasks.speculative.execution=false -D
>>>> mapred.compress.map.output=true -D
>>>> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
>>>> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
>>>> 16/01/25 11:44:53 INFO Configuration.deprecation:
>>>> mapred.output.key.comparator.class is deprecated. Instead, use
>>>> mapreduce.job.output.key.comparator.class
>>>> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
>>>> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
>>>> mode: [true]
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     HTTP Framework
>>>> (lib-http)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Parse Plug-in
>>>> (parse-html)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     MetaTags
>>>> (parse-metatags)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Indexing Filter
>>>> (index-html)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic Indexing
>>>> Filter (index-basic)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Anchor Indexing
>>>> Filter (index-anchor)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic URL Normalizer
>>>> (urlnormalizer-basic)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Language
>>>> Identification Parser/Filter (language-identifier)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Metadata Indexing
>>>> Filter (index-metadata)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Subcollection
>>>> indexing and query filter (subcollection)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
>>>> (indexer-solr)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Rel-Tag microformat
>>>> Parser/Indexer/Querier (microformats-reltag)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http / Https
>>>> Protocol Plug-in (protocol-httpclient)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     JavaScript Parser
>>>> (parse-js)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Tika Parser Plug-in
>>>> (parse-tika)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Top Level Domain
>>>> Plugin (tld)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Filter
>>>> Framework (lib-regex-filter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Normalizer
>>>> (urlnormalizer-regex)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Link Analysis
>>>> Scoring Plug-in (scoring-link)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     OPIC Scoring Plug-in
>>>> (scoring-opic)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     More Indexing Filter
>>>> (index-more)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http Protocol
>>>> Plug-in (protocol-http)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Creative Commons
>>>> Plugins (creativecommons)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Parse Filter
>>>> (org.apache.nutch.parse.ParseFilter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Cleaning
>>>> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Content Parser
>>>> (org.apache.nutch.parse.Parser)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Filter
>>>> (org.apache.nutch.net.URLFilter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Normalizer
>>>> (org.apache.nutch.net.URLNormalizer)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Writer
>>>> (org.apache.nutch.indexer.IndexWriter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
>>>> org.apache.nutch.indexer.html.HtmlIndexingFilter
>>>> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
>>>> for indexing set to: 100
>>>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
>>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>>>> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
>>>> is: off
>>>> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
>>>> job: job_1453472314066_0007
>>>> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
>>>> application_1453472314066_0007
>>>> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
>>>> http://cism479:8088/proxy/application_1453472314066_0007/
>>>> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
>>>> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
>>>> in uber mode : false
>>>> 16/01/25 11:45:29 INFO mapreduce.Job:  map 0% reduce 0%
>>>> 16/01/25 11:49:24 INFO mapreduce.Job:  map 50% reduce 0%
>>>> 16/01/25 11:49:29 INFO mapreduce.Job:  map 0% reduce 0%
>>>> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
>>>> attempt_1453472314066_0007_m_000000_0, Status : FAILED
>>>> Error:
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>>>> char #1296459, byte #1310719)
>>>>        at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>>>>        at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>>>>        at
>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>>>>        at
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>>>>        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>>>>        at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>>>>        at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>>>>        at
>>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>>>>        at
>>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>>>>        at
>>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>>>>        at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>>>>        at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>>>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>>>        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>>        at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>        at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>>        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>>>
>>>> 16/01/25 11:52:27 INFO mapreduce.Job:  map 50% reduce 0%
>>>> 16/01/25 11:53:01 INFO mapreduce.Job:  map 100% reduce 0%
>>>> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
>>>> attempt_1453472314066_0007_m_000000_1, Status : FAILED
>>>> Error:
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>>>> char #1296459, byte #1310719)
>>>>        at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>>>>        at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>>>>        at
>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>>>>        at
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>>>>        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>>>>        at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>>>>        at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>>>>        at
>>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>>>>        at
>>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>>>>        at
>>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>>>>        at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>>>>        at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>>>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>>>        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>>        at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>        at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>>        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>>>
>>>> 16/01/25 11:53:02 INFO mapreduce.Job:  map 50% reduce 0%
>>>> 16/01/25 11:54:52 INFO mapreduce.Job:  map 100% reduce 0%
>>>> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
>>>> attempt_1453472314066_0007_m_000000_2, Status : FAILED
>>>> Error:
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>>>> char #1296459, byte #1310719)
>>>>        at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>>>>        at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>>>>        at
>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>>>>        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>>>>        at
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>>>>        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>>>>        at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>>>>        at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>>>>        at
>>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>>>>        at
>>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>>>>        at
>>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>>>>        at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>>>>        at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>>>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>>>        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>>        at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>        at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>>        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>>>
>>>> 16/01/25 11:54:53 INFO mapreduce.Job:  map 50% reduce 0%
>>>> 16/01/25 11:56:22 INFO mapreduce.Job:  map 100% reduce 0%
>>>> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
>>>> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
>>>> Job failed as tasks failed. failedMaps:1 failedReduces:0
>>>>
>>>> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
>>>>        File System Counters
>>>>            FILE: Number of bytes read=0
>>>>            FILE: Number of bytes written=116194
>>>>            FILE: Number of read operations=0
>>>>            FILE: Number of large read operations=0
>>>>            FILE: Number of write operations=0
>>>>            HDFS: Number of bytes read=1033
>>>>            HDFS: Number of bytes written=0
>>>>            HDFS: Number of read operations=1
>>>>            HDFS: Number of large read operations=0
>>>>            HDFS: Number of write operations=0
>>>>        Job Counters
>>>>            Failed map tasks=4
>>>>            Launched map tasks=5
>>>>            Other local map tasks=3
>>>>            Data-local map tasks=2
>>>>            Total time spent by all maps in occupied slots (ms)=3168342
>>>>            Total time spent by all reduces in occupied slots (ms)=0
>>>>            Total time spent by all map tasks (ms)=1056114
>>>>            Total vcore-seconds taken by all map tasks=1056114
>>>>            Total megabyte-seconds taken by all map tasks=3244382208
>>>>        Map-Reduce Framework
>>>>            Map input records=2762511
>>>>            Map output records=17629
>>>>            Input split bytes=1033
>>>>            Spilled Records=0
>>>>            Failed Shuffles=0
>>>>            Merged Map outputs=0
>>>>            GC time elapsed (ms)=2995
>>>>            CPU time spent (ms)=116860
>>>>            Physical memory (bytes) snapshot=1272868864
>>>>            Virtual memory (bytes) snapshot=5104431104
>>>>            Total committed heap usage (bytes)=1017118720
>>>>        IndexerJob
>>>>            DocumentCount=17629
>>>>        File Input Format Counters
>>>>            Bytes Read=0
>>>>        File Output Format Counters
>>>>            Bytes Written=0
>>>> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
>>>> java.lang.RuntimeException: job failed: name=[1]Indexer,
>>>> jobid=job_1453472314066_0007
>>>>        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>>>>        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
>>>>        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
>>>>        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
>>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>>        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>        at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>        at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>        at java.lang.reflect.Method.invoke(Method.java:497)
>>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>>>> *******************************************************
>>>> -- 
>>>>
>>>> Please let me know if you have any questions , concerns or updates.
>>>> Have a great day ahead :)
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Kshitij Shukla
>>>> Software developer
>>>>
>>>> *Cyber Infrastructure(CIS)
>>>> **/The RightSourcing Specialists with 1250 man years of experience!/*
>>>>
>>>> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
>>>> intended recipient, you should delete this message and are notified that
>>>> any disclosure, copying or distribution of this message, or taking any
>>>> action based on it, is strictly prohibited by Law.
>>>>
>>>> Please don't print this e-mail unless you really need to.
>>>>
>>>> -- 
>>>>
>>>> ------------------------------
>>>>
>>>> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>>>>
>>>> Central India's largest Technology company.
>>>>
>>>> *Ensuring the success of our clients and partners through our highly
>>>> optimized Technology solutions.*
>>>>
>>>> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
>>>> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
>>>> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>>>>
>>>> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
>>>> intended recipient, you should delete this message and are notified that
>>>> any disclosure, copying or distribution of this message, or taking any
>>>> action based on it, is strictly prohibited by Law.
>>>>
>>
>> -- 
>>
>> Please let me know if you have any questions , concerns or updates.
>> Have a great day ahead :)
>>
>> Thanks and Regards,
>>
>> Kshitij Shukla
>> Software developer
>>
>> *Cyber Infrastructure(CIS)
>> **/The RightSourcing Specialists with 1250 man years of experience!/*
>>
>> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
>> intended recipient, you should delete this message and are notified that
>> any disclosure, copying or distribution of this message, or taking any
>> action based on it, is strictly prohibited by Law.
>>
>> Please don't print this e-mail unless you really need to.
>>
>> -- 
>>
>> ------------------------------
>>
>> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>>
>> Central India's largest Technology company.
>>
>> *Ensuring the success of our clients and partners through our highly
>> optimized Technology solutions.*
>>
>> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
>> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
>> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>>
>> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
>> intended recipient, you should delete this message and are notified that
>> any disclosure, copying or distribution of this message, or taking any
>> action based on it, is strictly prohibited by Law.
>>


-- 

Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)

Thanks and Regards,

Kshitij Shukla
Software developer

*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
intended recipient, you should delete this message and are notified that 
any disclosure, copying or distribution of this message, or taking any 
action based on it, is strictly prohibited by Law.

Please don't print this e-mail unless you really need to.

-- 

------------------------------

*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*

Central India's largest Technology company.

*Ensuring the success of our clients and partners through our highly 
optimized Technology solutions.*

www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin 
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> | 
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
intended recipient, you should delete this message and are notified that 
any disclosure, copying or distribution of this message, or taking any 
action based on it, is strictly prohibited by Law.

RE: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception

Posted by Markus Jelsma <ma...@openindex.io>.

That is odd! Is it on your content or title field?
Markus
 
-----Original message-----
> From:Kshitij Shukla <ks...@cisinlabs.com>
> Sent: Monday 25th January 2016 11:41
> To: user@nutch.apache.org
> Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> 
> Thanks for your response Markus, I checked the code and I found the 
> workaround you suggested in this file :
> 
> *Source:* 
> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
> 
> and the method was called in this file:
> 
> *Invoked:* 
> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
> like this
>          if (e.getKey().equals("content") || e.getKey().equals("title")) {
>                      val2 = SolrUtils.stripNonCharCodepoints(val);
>          }
> 
> So if the method is there and apparently invoked at right place. So what 
> do you think where the problem could be?
> 
> Thanks again for your help.
> 
> On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
> > Hi - this is NUTCH-1016, which was never ported to 2.x.
> >
> > https://issues.apache.org/jira/browse/NUTCH-1016
> >
> >   
> >   
> > -----Original message-----
> >> From:Kshitij Shukla <ks...@cisinlabs.com>
> >> Sent: Monday 25th January 2016 8:23
> >> To: user@nutch.apache.org
> >> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> >>
> >> Hello everyone,
> >>
> >> During a very large crawl when indexing to Solr this will yield the
> >> following exception:
> >>
> >> **************************************************
> >> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
> >> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
> >> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> >> mapred.reduce.tasks.speculative.execution=false -D
> >> mapred.map.tasks.speculative.execution=false -D
> >> mapred.compress.map.output=true -D
> >> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
> >> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
> >> 16/01/25 11:44:53 INFO Configuration.deprecation:
> >> mapred.output.key.comparator.class is deprecated. Instead, use
> >> mapreduce.job.output.key.comparator.class
> >> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
> >> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
> >> mode: [true]
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     HTTP Framework
> >> (lib-http)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Parse Plug-in
> >> (parse-html)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     MetaTags
> >> (parse-metatags)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Indexing Filter
> >> (index-html)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     the nutch core
> >> extension points (nutch-extensionpoints)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic Indexing
> >> Filter (index-basic)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Anchor Indexing
> >> Filter (index-anchor)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic URL Normalizer
> >> (urlnormalizer-basic)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Language
> >> Identification Parser/Filter (language-identifier)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Metadata Indexing
> >> Filter (index-metadata)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     CyberNeko HTML
> >> Parser (lib-nekohtml)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Subcollection
> >> indexing and query filter (subcollection)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
> >> (indexer-solr)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Rel-Tag microformat
> >> Parser/Indexer/Querier (microformats-reltag)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http / Https
> >> Protocol Plug-in (protocol-httpclient)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     JavaScript Parser
> >> (parse-js)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Tika Parser Plug-in
> >> (parse-tika)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Top Level Domain
> >> Plugin (tld)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Filter
> >> Framework (lib-regex-filter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Normalizer
> >> (urlnormalizer-regex)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Link Analysis
> >> Scoring Plug-in (scoring-link)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     OPIC Scoring Plug-in
> >> (scoring-opic)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     More Indexing Filter
> >> (index-more)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http Protocol
> >> Plug-in (protocol-http)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Creative Commons
> >> Plugins (creativecommons)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Parse Filter
> >> (org.apache.nutch.parse.ParseFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Cleaning
> >> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Content Parser
> >> (org.apache.nutch.parse.Parser)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Filter
> >> (org.apache.nutch.net.URLFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Scoring
> >> (org.apache.nutch.scoring.ScoringFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Normalizer
> >> (org.apache.nutch.net.URLNormalizer)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Protocol
> >> (org.apache.nutch.protocol.Protocol)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Writer
> >> (org.apache.nutch.indexer.IndexWriter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Indexing
> >> Filter (org.apache.nutch.indexer.IndexingFilter)
> >> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >> org.apache.nutch.indexer.html.HtmlIndexingFilter
> >> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
> >> for indexing set to: 100
> >> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >> org.apache.nutch.indexer.basic.BasicIndexingFilter
> >> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
> >> is: off
> >> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
> >> job: job_1453472314066_0007
> >> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
> >> application_1453472314066_0007
> >> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
> >> http://cism479:8088/proxy/application_1453472314066_0007/
> >> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
> >> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
> >> in uber mode : false
> >> 16/01/25 11:45:29 INFO mapreduce.Job:  map 0% reduce 0%
> >> 16/01/25 11:49:24 INFO mapreduce.Job:  map 50% reduce 0%
> >> 16/01/25 11:49:29 INFO mapreduce.Job:  map 0% reduce 0%
> >> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
> >> attempt_1453472314066_0007_m_000000_0, Status : FAILED
> >> Error:
> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >> char #1296459, byte #1310719)
> >>       at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >>       at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >>       at
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >>       at
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >>       at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >>       at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >>       at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >>       at
> >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >>       at
> >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >>       at
> >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >>       at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >>       at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >>       at java.security.AccessController.doPrivileged(Native Method)
> >>       at javax.security.auth.Subject.doAs(Subject.java:422)
> >>       at
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>
> >> 16/01/25 11:52:27 INFO mapreduce.Job:  map 50% reduce 0%
> >> 16/01/25 11:53:01 INFO mapreduce.Job:  map 100% reduce 0%
> >> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
> >> attempt_1453472314066_0007_m_000000_1, Status : FAILED
> >> Error:
> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >> char #1296459, byte #1310719)
> >>       at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >>       at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >>       at
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >>       at
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >>       at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >>       at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >>       at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >>       at
> >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >>       at
> >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >>       at
> >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >>       at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >>       at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >>       at java.security.AccessController.doPrivileged(Native Method)
> >>       at javax.security.auth.Subject.doAs(Subject.java:422)
> >>       at
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>
> >> 16/01/25 11:53:02 INFO mapreduce.Job:  map 50% reduce 0%
> >> 16/01/25 11:54:52 INFO mapreduce.Job:  map 100% reduce 0%
> >> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
> >> attempt_1453472314066_0007_m_000000_2, Status : FAILED
> >> Error:
> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >> char #1296459, byte #1310719)
> >>       at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >>       at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >>       at
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >>       at
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >>       at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >>       at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >>       at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >>       at
> >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >>       at
> >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >>       at
> >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >>       at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >>       at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >>       at java.security.AccessController.doPrivileged(Native Method)
> >>       at javax.security.auth.Subject.doAs(Subject.java:422)
> >>       at
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>
> >> 16/01/25 11:54:53 INFO mapreduce.Job:  map 50% reduce 0%
> >> 16/01/25 11:56:22 INFO mapreduce.Job:  map 100% reduce 0%
> >> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
> >> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
> >> Job failed as tasks failed. failedMaps:1 failedReduces:0
> >>
> >> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
> >>       File System Counters
> >>           FILE: Number of bytes read=0
> >>           FILE: Number of bytes written=116194
> >>           FILE: Number of read operations=0
> >>           FILE: Number of large read operations=0
> >>           FILE: Number of write operations=0
> >>           HDFS: Number of bytes read=1033
> >>           HDFS: Number of bytes written=0
> >>           HDFS: Number of read operations=1
> >>           HDFS: Number of large read operations=0
> >>           HDFS: Number of write operations=0
> >>       Job Counters
> >>           Failed map tasks=4
> >>           Launched map tasks=5
> >>           Other local map tasks=3
> >>           Data-local map tasks=2
> >>           Total time spent by all maps in occupied slots (ms)=3168342
> >>           Total time spent by all reduces in occupied slots (ms)=0
> >>           Total time spent by all map tasks (ms)=1056114
> >>           Total vcore-seconds taken by all map tasks=1056114
> >>           Total megabyte-seconds taken by all map tasks=3244382208
> >>       Map-Reduce Framework
> >>           Map input records=2762511
> >>           Map output records=17629
> >>           Input split bytes=1033
> >>           Spilled Records=0
> >>           Failed Shuffles=0
> >>           Merged Map outputs=0
> >>           GC time elapsed (ms)=2995
> >>           CPU time spent (ms)=116860
> >>           Physical memory (bytes) snapshot=1272868864
> >>           Virtual memory (bytes) snapshot=5104431104
> >>           Total committed heap usage (bytes)=1017118720
> >>       IndexerJob
> >>           DocumentCount=17629
> >>       File Input Format Counters
> >>           Bytes Read=0
> >>       File Output Format Counters
> >>           Bytes Written=0
> >> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
> >> java.lang.RuntimeException: job failed: name=[1]Indexer,
> >> jobid=job_1453472314066_0007
> >>       at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
> >>       at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> >>       at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> >>       at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> >>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >>       at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> >>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>       at
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >>       at
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>       at java.lang.reflect.Method.invoke(Method.java:497)
> >>       at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> >> *******************************************************
> >> -- 
> >>
> >> Please let me know if you have any questions , concerns or updates.
> >> Have a great day ahead :)
> >>
> >> Thanks and Regards,
> >>
> >> Kshitij Shukla
> >> Software developer
> >>
> >> *Cyber Infrastructure(CIS)
> >> **/The RightSourcing Specialists with 1250 man years of experience!/*
> >>
> >> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
> >> Please don't print this e-mail unless you really need to.
> >>
> >> -- 
> >>
> >> ------------------------------
> >>
> >> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> >>
> >> Central India's largest Technology company.
> >>
> >> *Ensuring the success of our clients and partners through our highly
> >> optimized Technology solutions.*
> >>
> >> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> >> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> >> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> >>
> >> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
> 
> 
> -- 
> 
> Please let me know if you have any questions , concerns or updates.
> Have a great day ahead :)
> 
> Thanks and Regards,
> 
> Kshitij Shukla
> Software developer
> 
> *Cyber Infrastructure(CIS)
> **/The RightSourcing Specialists with 1250 man years of experience!/*
> 
> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
> intended recipient, you should delete this message and are notified that 
> any disclosure, copying or distribution of this message, or taking any 
> action based on it, is strictly prohibited by Law.
> 
> Please don't print this e-mail unless you really need to.
> 
> -- 
> 
> ------------------------------
> 
> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> 
> Central India's largest Technology company.
> 
> *Ensuring the success of our clients and partners through our highly 
> optimized Technology solutions.*
> 
> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin 
> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> | 
> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> 
> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
> intended recipient, you should delete this message and are notified that 
> any disclosure, copying or distribution of this message, or taking any 
> action based on it, is strictly prohibited by Law.
>

[CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception

Posted by Kshitij Shukla <ks...@cisinlabs.com>.

Thanks for your response Markus, I checked the code and I found the 
workaround you suggested in this file :

*Source:* 
/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java

and the method was called in this file:

*Invoked:* 
/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
like this
         if (e.getKey().equals("content") || e.getKey().equals("title")) {
                     val2 = SolrUtils.stripNonCharCodepoints(val);
         }

So if the method is there and apparently invoked at right place. So what 
do you think where the problem could be?

Thanks again for your help.

On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
> Hi - this is NUTCH-1016, which was never ported to 2.x.
>
> https://issues.apache.org/jira/browse/NUTCH-1016
>
>   
>   
> -----Original message-----
>> From:Kshitij Shukla <ks...@cisinlabs.com>
>> Sent: Monday 25th January 2016 8:23
>> To: user@nutch.apache.org
>> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
>>
>> Hello everyone,
>>
>> During a very large crawl when indexing to Solr this will yield the
>> following exception:
>>
>> **************************************************
>> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
>> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
>> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D
>> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
>> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
>> 16/01/25 11:44:53 INFO Configuration.deprecation:
>> mapred.output.key.comparator.class is deprecated. Instead, use
>> mapreduce.job.output.key.comparator.class
>> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
>> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
>> mode: [true]
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     HTTP Framework
>> (lib-http)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Parse Plug-in
>> (parse-html)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     MetaTags
>> (parse-metatags)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Indexing Filter
>> (index-html)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     the nutch core
>> extension points (nutch-extensionpoints)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic Indexing
>> Filter (index-basic)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Anchor Indexing
>> Filter (index-anchor)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic URL Normalizer
>> (urlnormalizer-basic)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Language
>> Identification Parser/Filter (language-identifier)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Metadata Indexing
>> Filter (index-metadata)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     CyberNeko HTML
>> Parser (lib-nekohtml)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Subcollection
>> indexing and query filter (subcollection)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
>> (indexer-solr)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http / Https
>> Protocol Plug-in (protocol-httpclient)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     JavaScript Parser
>> (parse-js)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Tika Parser Plug-in
>> (parse-tika)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Top Level Domain
>> Plugin (tld)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Filter
>> Framework (lib-regex-filter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Normalizer
>> (urlnormalizer-regex)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Link Analysis
>> Scoring Plug-in (scoring-link)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     OPIC Scoring Plug-in
>> (scoring-opic)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     More Indexing Filter
>> (index-more)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http Protocol
>> Plug-in (protocol-http)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Creative Commons
>> Plugins (creativecommons)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Parse Filter
>> (org.apache.nutch.parse.ParseFilter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Cleaning
>> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Writer
>> (org.apache.nutch.indexer.IndexWriter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
>> org.apache.nutch.indexer.html.HtmlIndexingFilter
>> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
>> for indexing set to: 100
>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
>> is: off
>> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
>> job: job_1453472314066_0007
>> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
>> application_1453472314066_0007
>> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
>> http://cism479:8088/proxy/application_1453472314066_0007/
>> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
>> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
>> in uber mode : false
>> 16/01/25 11:45:29 INFO mapreduce.Job:  map 0% reduce 0%
>> 16/01/25 11:49:24 INFO mapreduce.Job:  map 50% reduce 0%
>> 16/01/25 11:49:29 INFO mapreduce.Job:  map 0% reduce 0%
>> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
>> attempt_1453472314066_0007_m_000000_0, Status : FAILED
>> Error:
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>> char #1296459, byte #1310719)
>>       at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>>       at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>>       at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>>       at
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>>       at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>>       at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>>       at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>>       at
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>>       at
>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>>       at
>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>>       at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>>       at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>       at java.security.AccessController.doPrivileged(Native Method)
>>       at javax.security.auth.Subject.doAs(Subject.java:422)
>>       at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>
>> 16/01/25 11:52:27 INFO mapreduce.Job:  map 50% reduce 0%
>> 16/01/25 11:53:01 INFO mapreduce.Job:  map 100% reduce 0%
>> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
>> attempt_1453472314066_0007_m_000000_1, Status : FAILED
>> Error:
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>> char #1296459, byte #1310719)
>>       at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>>       at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>>       at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>>       at
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>>       at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>>       at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>>       at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>>       at
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>>       at
>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>>       at
>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>>       at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>>       at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>       at java.security.AccessController.doPrivileged(Native Method)
>>       at javax.security.auth.Subject.doAs(Subject.java:422)
>>       at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>
>> 16/01/25 11:53:02 INFO mapreduce.Job:  map 50% reduce 0%
>> 16/01/25 11:54:52 INFO mapreduce.Job:  map 100% reduce 0%
>> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
>> attempt_1453472314066_0007_m_000000_2, Status : FAILED
>> Error:
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>> char #1296459, byte #1310719)
>>       at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>>       at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>>       at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>>       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>>       at
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>>       at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>>       at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>>       at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>>       at
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>>       at
>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>>       at
>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>>       at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>>       at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>       at java.security.AccessController.doPrivileged(Native Method)
>>       at javax.security.auth.Subject.doAs(Subject.java:422)
>>       at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>
>> 16/01/25 11:54:53 INFO mapreduce.Job:  map 50% reduce 0%
>> 16/01/25 11:56:22 INFO mapreduce.Job:  map 100% reduce 0%
>> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
>> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
>> Job failed as tasks failed. failedMaps:1 failedReduces:0
>>
>> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
>>       File System Counters
>>           FILE: Number of bytes read=0
>>           FILE: Number of bytes written=116194
>>           FILE: Number of read operations=0
>>           FILE: Number of large read operations=0
>>           FILE: Number of write operations=0
>>           HDFS: Number of bytes read=1033
>>           HDFS: Number of bytes written=0
>>           HDFS: Number of read operations=1
>>           HDFS: Number of large read operations=0
>>           HDFS: Number of write operations=0
>>       Job Counters
>>           Failed map tasks=4
>>           Launched map tasks=5
>>           Other local map tasks=3
>>           Data-local map tasks=2
>>           Total time spent by all maps in occupied slots (ms)=3168342
>>           Total time spent by all reduces in occupied slots (ms)=0
>>           Total time spent by all map tasks (ms)=1056114
>>           Total vcore-seconds taken by all map tasks=1056114
>>           Total megabyte-seconds taken by all map tasks=3244382208
>>       Map-Reduce Framework
>>           Map input records=2762511
>>           Map output records=17629
>>           Input split bytes=1033
>>           Spilled Records=0
>>           Failed Shuffles=0
>>           Merged Map outputs=0
>>           GC time elapsed (ms)=2995
>>           CPU time spent (ms)=116860
>>           Physical memory (bytes) snapshot=1272868864
>>           Virtual memory (bytes) snapshot=5104431104
>>           Total committed heap usage (bytes)=1017118720
>>       IndexerJob
>>           DocumentCount=17629
>>       File Input Format Counters
>>           Bytes Read=0
>>       File Output Format Counters
>>           Bytes Written=0
>> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
>> java.lang.RuntimeException: job failed: name=[1]Indexer,
>> jobid=job_1453472314066_0007
>>       at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>>       at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
>>       at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
>>       at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>       at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>       at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>       at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>       at java.lang.reflect.Method.invoke(Method.java:497)
>>       at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>> *******************************************************
>> -- 
>>
>> Please let me know if you have any questions , concerns or updates.
>> Have a great day ahead :)
>>
>> Thanks and Regards,
>>
>> Kshitij Shukla
>> Software developer
>>
>> *Cyber Infrastructure(CIS)
>> **/The RightSourcing Specialists with 1250 man years of experience!/*
>>
>> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
>> intended recipient, you should delete this message and are notified that
>> any disclosure, copying or distribution of this message, or taking any
>> action based on it, is strictly prohibited by Law.
>>
>> Please don't print this e-mail unless you really need to.
>>
>> -- 
>>
>> ------------------------------
>>
>> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>>
>> Central India's largest Technology company.
>>
>> *Ensuring the success of our clients and partners through our highly
>> optimized Technology solutions.*
>>
>> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
>> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
>> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>>
>> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
>> intended recipient, you should delete this message and are notified that
>> any disclosure, copying or distribution of this message, or taking any
>> action based on it, is strictly prohibited by Law.
>>


-- 

Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)

Thanks and Regards,

Kshitij Shukla
Software developer

*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
intended recipient, you should delete this message and are notified that 
any disclosure, copying or distribution of this message, or taking any 
action based on it, is strictly prohibited by Law.

Please don't print this e-mail unless you really need to.

-- 

------------------------------

*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*

Central India's largest Technology company.

*Ensuring the success of our clients and partners through our highly 
optimized Technology solutions.*

www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin 
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> | 
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
intended recipient, you should delete this message and are notified that 
any disclosure, copying or distribution of this message, or taking any 
action based on it, is strictly prohibited by Law.

RE: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - this is NUTCH-1016, which was never ported to 2.x.

https://issues.apache.org/jira/browse/NUTCH-1016

 
 
-----Original message-----
> From:Kshitij Shukla <ks...@cisinlabs.com>
> Sent: Monday 25th January 2016 8:23
> To: user@nutch.apache.org
> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> 
> Hello everyone,
> 
> During a very large crawl when indexing to Solr this will yield the 
> following exception:
> 
> **************************************************
> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin# 
> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch 
> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
> mapred.reduce.tasks.speculative.execution=false -D 
> mapred.map.tasks.speculative.execution=false -D 
> mapred.compress.map.output=true -D 
> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
> 16/01/25 11:44:53 INFO Configuration.deprecation: 
> mapred.output.key.comparator.class is deprecated. Instead, use 
> mapreduce.job.output.key.comparator.class
> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in: 
> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation 
> mode: [true]
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     HTTP Framework 
> (lib-http)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Parse Plug-in 
> (parse-html)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     MetaTags 
> (parse-metatags)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Indexing Filter 
> (index-html)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     the nutch core 
> extension points (nutch-extensionpoints)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic Indexing 
> Filter (index-basic)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Anchor Indexing 
> Filter (index-anchor)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic URL Normalizer 
> (urlnormalizer-basic)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Language 
> Identification Parser/Filter (language-identifier)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Metadata Indexing 
> Filter (index-metadata)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     CyberNeko HTML 
> Parser (lib-nekohtml)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Subcollection 
> indexing and query filter (subcollection)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter 
> (indexer-solr)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Rel-Tag microformat 
> Parser/Indexer/Querier (microformats-reltag)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http / Https 
> Protocol Plug-in (protocol-httpclient)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     JavaScript Parser 
> (parse-js)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Tika Parser Plug-in 
> (parse-tika)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Top Level Domain 
> Plugin (tld)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Filter 
> Framework (lib-regex-filter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Normalizer 
> (urlnormalizer-regex)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Link Analysis 
> Scoring Plug-in (scoring-link)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     OPIC Scoring Plug-in 
> (scoring-opic)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     More Indexing Filter 
> (index-more)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http Protocol 
> Plug-in (protocol-http)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Creative Commons 
> Plugins (creativecommons)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Parse Filter 
> (org.apache.nutch.parse.ParseFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Cleaning 
> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Content Parser 
> (org.apache.nutch.parse.Parser)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Filter 
> (org.apache.nutch.net.URLFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Scoring 
> (org.apache.nutch.scoring.ScoringFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Normalizer 
> (org.apache.nutch.net.URLNormalizer)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Protocol 
> (org.apache.nutch.protocol.Protocol)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Writer 
> (org.apache.nutch.indexer.IndexWriter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Indexing 
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding 
> org.apache.nutch.indexer.html.HtmlIndexingFilter
> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length 
> for indexing set to: 100
> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication 
> is: off
> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for 
> job: job_1453472314066_0007
> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application 
> application_1453472314066_0007
> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job: 
> http://cism479:8088/proxy/application_1453472314066_0007/
> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running 
> in uber mode : false
> 16/01/25 11:45:29 INFO mapreduce.Job:  map 0% reduce 0%
> 16/01/25 11:49:24 INFO mapreduce.Job:  map 50% reduce 0%
> 16/01/25 11:49:29 INFO mapreduce.Job:  map 0% reduce 0%
> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id : 
> attempt_1453472314066_0007_m_000000_0, Status : FAILED
> Error: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at 
> char #1296459, byte #1310719)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>      at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>      at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>      at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>      at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>      at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>      at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> 
> 16/01/25 11:52:27 INFO mapreduce.Job:  map 50% reduce 0%
> 16/01/25 11:53:01 INFO mapreduce.Job:  map 100% reduce 0%
> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id : 
> attempt_1453472314066_0007_m_000000_1, Status : FAILED
> Error: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at 
> char #1296459, byte #1310719)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>      at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>      at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>      at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>      at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>      at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>      at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> 
> 16/01/25 11:53:02 INFO mapreduce.Job:  map 50% reduce 0%
> 16/01/25 11:54:52 INFO mapreduce.Job:  map 100% reduce 0%
> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id : 
> attempt_1453472314066_0007_m_000000_2, Status : FAILED
> Error: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at 
> char #1296459, byte #1310719)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>      at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>      at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>      at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>      at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>      at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>      at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> 
> 16/01/25 11:54:53 INFO mapreduce.Job:  map 50% reduce 0%
> 16/01/25 11:56:22 INFO mapreduce.Job:  map 100% reduce 0%
> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed 
> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
> Job failed as tasks failed. failedMaps:1 failedReduces:0
> 
> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
>      File System Counters
>          FILE: Number of bytes read=0
>          FILE: Number of bytes written=116194
>          FILE: Number of read operations=0
>          FILE: Number of large read operations=0
>          FILE: Number of write operations=0
>          HDFS: Number of bytes read=1033
>          HDFS: Number of bytes written=0
>          HDFS: Number of read operations=1
>          HDFS: Number of large read operations=0
>          HDFS: Number of write operations=0
>      Job Counters
>          Failed map tasks=4
>          Launched map tasks=5
>          Other local map tasks=3
>          Data-local map tasks=2
>          Total time spent by all maps in occupied slots (ms)=3168342
>          Total time spent by all reduces in occupied slots (ms)=0
>          Total time spent by all map tasks (ms)=1056114
>          Total vcore-seconds taken by all map tasks=1056114
>          Total megabyte-seconds taken by all map tasks=3244382208
>      Map-Reduce Framework
>          Map input records=2762511
>          Map output records=17629
>          Input split bytes=1033
>          Spilled Records=0
>          Failed Shuffles=0
>          Merged Map outputs=0
>          GC time elapsed (ms)=2995
>          CPU time spent (ms)=116860
>          Physical memory (bytes) snapshot=1272868864
>          Virtual memory (bytes) snapshot=5104431104
>          Total committed heap usage (bytes)=1017118720
>      IndexerJob
>          DocumentCount=17629
>      File Input Format Counters
>          Bytes Read=0
>      File Output Format Counters
>          Bytes Written=0
> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob: 
> java.lang.RuntimeException: job failed: name=[1]Indexer, 
> jobid=job_1453472314066_0007
>      at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>      at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
>      at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
>      at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>      at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>      at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:497)
>      at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> *******************************************************
> -- 
> 
> Please let me know if you have any questions , concerns or updates.
> Have a great day ahead :)
> 
> Thanks and Regards,
> 
> Kshitij Shukla
> Software developer
> 
> *Cyber Infrastructure(CIS)
> **/The RightSourcing Specialists with 1250 man years of experience!/*
> 
> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
> intended recipient, you should delete this message and are notified that 
> any disclosure, copying or distribution of this message, or taking any 
> action based on it, is strictly prohibited by Law.
> 
> Please don't print this e-mail unless you really need to.
> 
> -- 
> 
> ------------------------------
> 
> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> 
> Central India's largest Technology company.
> 
> *Ensuring the success of our clients and partners through our highly 
> optimized Technology solutions.*
> 
> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin 
> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> | 
> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> 
> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
> intended recipient, you should delete this message and are notified that 
> any disclosure, copying or distribution of this message, or taking any 
> action based on it, is strictly prohibited by Law.
>