You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kshitij Shukla <ks...@cisinlabs.com> on 2016/01/25 08:23:06 UTC
[CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
Hello everyone,
During a very large crawl when indexing to Solr this will yield the
following exception:
**************************************************
root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
16/01/25 11:44:53 INFO Configuration.deprecation:
mapred.output.key.comparator.class is deprecated. Instead, use
mapreduce.job.output.key.comparator.class
16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in
(parse-html)
16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags
(parse-metatags)
16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter
(index-html)
16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing
Filter (index-basic)
16/01/25 11:44:54 INFO plugin.PluginRepository: XML Libraries (lib-xml)
16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing
Filter (index-anchor)
16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer
(urlnormalizer-basic)
16/01/25 11:44:54 INFO plugin.PluginRepository: Language
Identification Parser/Filter (language-identifier)
16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing
Filter (index-metadata)
16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML
Parser (lib-nekohtml)
16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection
indexing and query filter (subcollection)
16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
(indexer-solr)
16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat
Parser/Indexer/Querier (microformats-reltag)
16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https
Protocol Plug-in (protocol-httpclient)
16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser
(parse-js)
16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in
(parse-tika)
16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain
Plugin (tld)
16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter
Framework (lib-regex-filter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer
(urlnormalizer-regex)
16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis
Scoring Plug-in (scoring-link)
16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in
(scoring-opic)
16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter
(index-more)
16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol
Plug-in (protocol-http)
16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons
Plugins (creativecommons)
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter
(org.apache.nutch.parse.ParseFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning
Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser
(org.apache.nutch.parse.Parser)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter
(org.apache.nutch.net.URLFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer
(org.apache.nutch.indexer.IndexWriter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
org.apache.nutch.indexer.html.HtmlIndexingFilter
16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
for indexing set to: 100
16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
is: off
16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1453472314066_0007
16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
application_1453472314066_0007
16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
http://cism479:8088/proxy/application_1453472314066_0007/
16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
in uber mode : false
16/01/25 11:45:29 INFO mapreduce.Job: map 0% reduce 0%
16/01/25 11:49:24 INFO mapreduce.Job: map 50% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job: map 0% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
attempt_1453472314066_0007_m_000000_0, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
char #1296459, byte #1310719)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
16/01/25 11:52:27 INFO mapreduce.Job: map 50% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job: map 100% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
attempt_1453472314066_0007_m_000000_1, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
char #1296459, byte #1310719)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
16/01/25 11:53:02 INFO mapreduce.Job: map 50% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job: map 100% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
attempt_1453472314066_0007_m_000000_2, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
char #1296459, byte #1310719)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
16/01/25 11:54:53 INFO mapreduce.Job: map 50% reduce 0%
16/01/25 11:56:22 INFO mapreduce.Job: map 100% reduce 0%
16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
with state FAILED due to: Task failed task_1453472314066_0007_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=116194
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1033
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=4
Launched map tasks=5
Other local map tasks=3
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=3168342
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=1056114
Total vcore-seconds taken by all map tasks=1056114
Total megabyte-seconds taken by all map tasks=3244382208
Map-Reduce Framework
Map input records=2762511
Map output records=17629
Input split bytes=1033
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=2995
CPU time spent (ms)=116860
Physical memory (bytes) snapshot=1272868864
Virtual memory (bytes) snapshot=5104431104
Total committed heap usage (bytes)=1017118720
IndexerJob
DocumentCount=17629
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
java.lang.RuntimeException: job failed: name=[1]Indexer,
jobid=job_1453472314066_0007
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
*******************************************************
--
Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)
Thanks and Regards,
Kshitij Shukla
Software developer
*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
Please don't print this e-mail unless you really need to.
--
------------------------------
*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
Central India's largest Technology company.
*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*
www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
RE: [CIS-CMMI-3] Re: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8
character 0xffff at char exception
Posted by Markus Jelsma <ma...@openindex.io>.
You could try removing the content or title condition:
if (e.getKey().equals("content") || e.getKey().equals("title")) {
val2 = SolrUtils.stripNonCharCodepoints(val);
}
Then all fields will get stripped. But usually it only happens on the content field, strange.
Markus
-----Original message-----
> From:Kshitij Shukla <ks...@cisinlabs.com>
> Sent: Monday 25th January 2016 14:23
> To: user@nutch.apache.org
> Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
>
> I have been trying to get the name of the field, but the error its
> showing is kind of generic error and doesnt have any field name
> associated with it. I tried to get the name in hadoop log, nutch log and
> solr logs. But i didn't find any field name.
>
> Thanks
>
> On Monday 25 January 2016 06:10 PM, Markus Jelsma wrote:
> > That is odd! Is it on your content or title field?
> > Markus
> >
> > -----Original message-----
> >> From:Kshitij Shukla <ks...@cisinlabs.com>
> >> Sent: Monday 25th January 2016 11:41
> >> To: user@nutch.apache.org
> >> Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> >>
> >> Thanks for your response Markus, I checked the code and I found the
> >> workaround you suggested in this file :
> >>
> >> *Source:*
> >> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
> >>
> >> and the method was called in this file:
> >>
> >> *Invoked:*
> >> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
> >> like this
> >> if (e.getKey().equals("content") || e.getKey().equals("title")) {
> >> val2 = SolrUtils.stripNonCharCodepoints(val);
> >> }
> >>
> >> So if the method is there and apparently invoked at right place. So what
> >> do you think where the problem could be?
> >>
> >> Thanks again for your help.
> >>
> >> On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
> >>> Hi - this is NUTCH-1016, which was never ported to 2.x.
> >>>
> >>> https://issues.apache.org/jira/browse/NUTCH-1016
> >>>
> >>>
> >>>
> >>> -----Original message-----
> >>>> From:Kshitij Shukla <ks...@cisinlabs.com>
> >>>> Sent: Monday 25th January 2016 8:23
> >>>> To: user@nutch.apache.org
> >>>> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> >>>>
> >>>> Hello everyone,
> >>>>
> >>>> During a very large crawl when indexing to Solr this will yield the
> >>>> following exception:
> >>>>
> >>>> **************************************************
> >>>> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
> >>>> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
> >>>> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> >>>> mapred.reduce.tasks.speculative.execution=false -D
> >>>> mapred.map.tasks.speculative.execution=false -D
> >>>> mapred.compress.map.output=true -D
> >>>> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
> >>>> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
> >>>> 16/01/25 11:44:53 INFO Configuration.deprecation:
> >>>> mapred.output.key.comparator.class is deprecated. Instead, use
> >>>> mapreduce.job.output.key.comparator.class
> >>>> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
> >>>> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
> >>>> mode: [true]
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework
> >>>> (lib-http)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in
> >>>> (parse-html)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags
> >>>> (parse-metatags)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter
> >>>> (index-html)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core
> >>>> extension points (nutch-extensionpoints)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing
> >>>> Filter (index-basic)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: XML Libraries (lib-xml)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing
> >>>> Filter (index-anchor)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer
> >>>> (urlnormalizer-basic)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Language
> >>>> Identification Parser/Filter (language-identifier)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing
> >>>> Filter (index-metadata)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML
> >>>> Parser (lib-nekohtml)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection
> >>>> indexing and query filter (subcollection)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
> >>>> (indexer-solr)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat
> >>>> Parser/Indexer/Querier (microformats-reltag)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https
> >>>> Protocol Plug-in (protocol-httpclient)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser
> >>>> (parse-js)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in
> >>>> (parse-tika)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain
> >>>> Plugin (tld)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter
> >>>> Framework (lib-regex-filter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer
> >>>> (urlnormalizer-regex)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis
> >>>> Scoring Plug-in (scoring-link)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in
> >>>> (scoring-opic)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter
> >>>> (index-more)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol
> >>>> Plug-in (protocol-http)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons
> >>>> Plugins (creativecommons)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter
> >>>> (org.apache.nutch.parse.ParseFilter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning
> >>>> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser
> >>>> (org.apache.nutch.parse.Parser)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter
> >>>> (org.apache.nutch.net.URLFilter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring
> >>>> (org.apache.nutch.scoring.ScoringFilter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer
> >>>> (org.apache.nutch.net.URLNormalizer)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol
> >>>> (org.apache.nutch.protocol.Protocol)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer
> >>>> (org.apache.nutch.indexer.IndexWriter)
> >>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing
> >>>> Filter (org.apache.nutch.indexer.IndexingFilter)
> >>>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >>>> org.apache.nutch.indexer.html.HtmlIndexingFilter
> >>>> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
> >>>> for indexing set to: 100
> >>>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
> >>>> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
> >>>> is: off
> >>>> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
> >>>> job: job_1453472314066_0007
> >>>> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
> >>>> application_1453472314066_0007
> >>>> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
> >>>> http://cism479:8088/proxy/application_1453472314066_0007/
> >>>> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
> >>>> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
> >>>> in uber mode : false
> >>>> 16/01/25 11:45:29 INFO mapreduce.Job: map 0% reduce 0%
> >>>> 16/01/25 11:49:24 INFO mapreduce.Job: map 50% reduce 0%
> >>>> 16/01/25 11:49:29 INFO mapreduce.Job: map 0% reduce 0%
> >>>> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
> >>>> attempt_1453472314066_0007_m_000000_0, Status : FAILED
> >>>> Error:
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >>>> char #1296459, byte #1310719)
> >>>> at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >>>> at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >>>> at
> >>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >>>> at
> >>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >>>> at
> >>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >>>> at
> >>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >>>> at
> >>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >>>> at java.security.AccessController.doPrivileged(Native Method)
> >>>> at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>> at
> >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>>>
> >>>> 16/01/25 11:52:27 INFO mapreduce.Job: map 50% reduce 0%
> >>>> 16/01/25 11:53:01 INFO mapreduce.Job: map 100% reduce 0%
> >>>> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
> >>>> attempt_1453472314066_0007_m_000000_1, Status : FAILED
> >>>> Error:
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >>>> char #1296459, byte #1310719)
> >>>> at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >>>> at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >>>> at
> >>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >>>> at
> >>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >>>> at
> >>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >>>> at
> >>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >>>> at
> >>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >>>> at java.security.AccessController.doPrivileged(Native Method)
> >>>> at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>> at
> >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>>>
> >>>> 16/01/25 11:53:02 INFO mapreduce.Job: map 50% reduce 0%
> >>>> 16/01/25 11:54:52 INFO mapreduce.Job: map 100% reduce 0%
> >>>> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
> >>>> attempt_1453472314066_0007_m_000000_2, Status : FAILED
> >>>> Error:
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >>>> char #1296459, byte #1310719)
> >>>> at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >>>> at
> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >>>> at
> >>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >>>> at
> >>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >>>> at
> >>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >>>> at
> >>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >>>> at
> >>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >>>> at
> >>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >>>> at java.security.AccessController.doPrivileged(Native Method)
> >>>> at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>> at
> >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>>>
> >>>> 16/01/25 11:54:53 INFO mapreduce.Job: map 50% reduce 0%
> >>>> 16/01/25 11:56:22 INFO mapreduce.Job: map 100% reduce 0%
> >>>> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
> >>>> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
> >>>> Job failed as tasks failed. failedMaps:1 failedReduces:0
> >>>>
> >>>> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
> >>>> File System Counters
> >>>> FILE: Number of bytes read=0
> >>>> FILE: Number of bytes written=116194
> >>>> FILE: Number of read operations=0
> >>>> FILE: Number of large read operations=0
> >>>> FILE: Number of write operations=0
> >>>> HDFS: Number of bytes read=1033
> >>>> HDFS: Number of bytes written=0
> >>>> HDFS: Number of read operations=1
> >>>> HDFS: Number of large read operations=0
> >>>> HDFS: Number of write operations=0
> >>>> Job Counters
> >>>> Failed map tasks=4
> >>>> Launched map tasks=5
> >>>> Other local map tasks=3
> >>>> Data-local map tasks=2
> >>>> Total time spent by all maps in occupied slots (ms)=3168342
> >>>> Total time spent by all reduces in occupied slots (ms)=0
> >>>> Total time spent by all map tasks (ms)=1056114
> >>>> Total vcore-seconds taken by all map tasks=1056114
> >>>> Total megabyte-seconds taken by all map tasks=3244382208
> >>>> Map-Reduce Framework
> >>>> Map input records=2762511
> >>>> Map output records=17629
> >>>> Input split bytes=1033
> >>>> Spilled Records=0
> >>>> Failed Shuffles=0
> >>>> Merged Map outputs=0
> >>>> GC time elapsed (ms)=2995
> >>>> CPU time spent (ms)=116860
> >>>> Physical memory (bytes) snapshot=1272868864
> >>>> Virtual memory (bytes) snapshot=5104431104
> >>>> Total committed heap usage (bytes)=1017118720
> >>>> IndexerJob
> >>>> DocumentCount=17629
> >>>> File Input Format Counters
> >>>> Bytes Read=0
> >>>> File Output Format Counters
> >>>> Bytes Written=0
> >>>> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
> >>>> java.lang.RuntimeException: job failed: name=[1]Indexer,
> >>>> jobid=job_1453472314066_0007
> >>>> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
> >>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> >>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> >>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>> at
> >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >>>> at
> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>>> at java.lang.reflect.Method.invoke(Method.java:497)
> >>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> >>>> *******************************************************
> >>>> --
> >>>>
> >>>> Please let me know if you have any questions , concerns or updates.
> >>>> Have a great day ahead :)
> >>>>
> >>>> Thanks and Regards,
> >>>>
> >>>> Kshitij Shukla
> >>>> Software developer
> >>>>
> >>>> *Cyber Infrastructure(CIS)
> >>>> **/The RightSourcing Specialists with 1250 man years of experience!/*
> >>>>
> >>>> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> >>>> intended recipient, you should delete this message and are notified that
> >>>> any disclosure, copying or distribution of this message, or taking any
> >>>> action based on it, is strictly prohibited by Law.
> >>>>
> >>>> Please don't print this e-mail unless you really need to.
> >>>>
> >>>> --
> >>>>
> >>>> ------------------------------
> >>>>
> >>>> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> >>>>
> >>>> Central India's largest Technology company.
> >>>>
> >>>> *Ensuring the success of our clients and partners through our highly
> >>>> optimized Technology solutions.*
> >>>>
> >>>> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> >>>> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> >>>> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> >>>>
> >>>> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> >>>> intended recipient, you should delete this message and are notified that
> >>>> any disclosure, copying or distribution of this message, or taking any
> >>>> action based on it, is strictly prohibited by Law.
> >>>>
> >>
> >> --
> >>
> >> Please let me know if you have any questions , concerns or updates.
> >> Have a great day ahead :)
> >>
> >> Thanks and Regards,
> >>
> >> Kshitij Shukla
> >> Software developer
> >>
> >> *Cyber Infrastructure(CIS)
> >> **/The RightSourcing Specialists with 1250 man years of experience!/*
> >>
> >> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
> >> Please don't print this e-mail unless you really need to.
> >>
> >> --
> >>
> >> ------------------------------
> >>
> >> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> >>
> >> Central India's largest Technology company.
> >>
> >> *Ensuring the success of our clients and partners through our highly
> >> optimized Technology solutions.*
> >>
> >> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> >> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> >> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> >>
> >> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
>
>
> --
>
> Please let me know if you have any questions , concerns or updates.
> Have a great day ahead :)
>
> Thanks and Regards,
>
> Kshitij Shukla
> Software developer
>
> *Cyber Infrastructure(CIS)
> **/The RightSourcing Specialists with 1250 man years of experience!/*
>
> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> intended recipient, you should delete this message and are notified that
> any disclosure, copying or distribution of this message, or taking any
> action based on it, is strictly prohibited by Law.
>
> Please don't print this e-mail unless you really need to.
>
> --
>
> ------------------------------
>
> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>
> Central India's largest Technology company.
>
> *Ensuring the success of our clients and partners through our highly
> optimized Technology solutions.*
>
> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>
> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> intended recipient, you should delete this message and are notified that
> any disclosure, copying or distribution of this message, or taking any
> action based on it, is strictly prohibited by Law.
>
[CIS-CMMI-3] Re: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at
char exception
Posted by Kshitij Shukla <ks...@cisinlabs.com>.
I have been trying to get the name of the field, but the error its
showing is kind of generic error and doesnt have any field name
associated with it. I tried to get the name in hadoop log, nutch log and
solr logs. But i didn't find any field name.
Thanks
On Monday 25 January 2016 06:10 PM, Markus Jelsma wrote:
> That is odd! Is it on your content or title field?
> Markus
>
> -----Original message-----
>> From:Kshitij Shukla <ks...@cisinlabs.com>
>> Sent: Monday 25th January 2016 11:41
>> To: user@nutch.apache.org
>> Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
>>
>> Thanks for your response Markus, I checked the code and I found the
>> workaround you suggested in this file :
>>
>> *Source:*
>> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
>>
>> and the method was called in this file:
>>
>> *Invoked:*
>> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
>> like this
>> if (e.getKey().equals("content") || e.getKey().equals("title")) {
>> val2 = SolrUtils.stripNonCharCodepoints(val);
>> }
>>
>> So if the method is there and apparently invoked at right place. So what
>> do you think where the problem could be?
>>
>> Thanks again for your help.
>>
>> On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
>>> Hi - this is NUTCH-1016, which was never ported to 2.x.
>>>
>>> https://issues.apache.org/jira/browse/NUTCH-1016
>>>
>>>
>>>
>>> -----Original message-----
>>>> From:Kshitij Shukla <ks...@cisinlabs.com>
>>>> Sent: Monday 25th January 2016 8:23
>>>> To: user@nutch.apache.org
>>>> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
>>>>
>>>> Hello everyone,
>>>>
>>>> During a very large crawl when indexing to Solr this will yield the
>>>> following exception:
>>>>
>>>> **************************************************
>>>> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
>>>> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
>>>> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>>>> mapred.reduce.tasks.speculative.execution=false -D
>>>> mapred.map.tasks.speculative.execution=false -D
>>>> mapred.compress.map.output=true -D
>>>> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
>>>> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
>>>> 16/01/25 11:44:53 INFO Configuration.deprecation:
>>>> mapred.output.key.comparator.class is deprecated. Instead, use
>>>> mapreduce.job.output.key.comparator.class
>>>> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
>>>> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
>>>> mode: [true]
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework
>>>> (lib-http)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in
>>>> (parse-html)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags
>>>> (parse-metatags)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter
>>>> (index-html)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing
>>>> Filter (index-basic)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: XML Libraries (lib-xml)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing
>>>> Filter (index-anchor)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer
>>>> (urlnormalizer-basic)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Language
>>>> Identification Parser/Filter (language-identifier)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing
>>>> Filter (index-metadata)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection
>>>> indexing and query filter (subcollection)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
>>>> (indexer-solr)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat
>>>> Parser/Indexer/Querier (microformats-reltag)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https
>>>> Protocol Plug-in (protocol-httpclient)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser
>>>> (parse-js)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in
>>>> (parse-tika)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain
>>>> Plugin (tld)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter
>>>> Framework (lib-regex-filter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer
>>>> (urlnormalizer-regex)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis
>>>> Scoring Plug-in (scoring-link)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in
>>>> (scoring-opic)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter
>>>> (index-more)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol
>>>> Plug-in (protocol-http)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons
>>>> Plugins (creativecommons)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter
>>>> (org.apache.nutch.parse.ParseFilter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning
>>>> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser
>>>> (org.apache.nutch.parse.Parser)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter
>>>> (org.apache.nutch.net.URLFilter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer
>>>> (org.apache.nutch.net.URLNormalizer)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer
>>>> (org.apache.nutch.indexer.IndexWriter)
>>>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
>>>> org.apache.nutch.indexer.html.HtmlIndexingFilter
>>>> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
>>>> for indexing set to: 100
>>>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
>>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>>>> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
>>>> is: off
>>>> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
>>>> job: job_1453472314066_0007
>>>> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
>>>> application_1453472314066_0007
>>>> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
>>>> http://cism479:8088/proxy/application_1453472314066_0007/
>>>> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
>>>> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
>>>> in uber mode : false
>>>> 16/01/25 11:45:29 INFO mapreduce.Job: map 0% reduce 0%
>>>> 16/01/25 11:49:24 INFO mapreduce.Job: map 50% reduce 0%
>>>> 16/01/25 11:49:29 INFO mapreduce.Job: map 0% reduce 0%
>>>> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
>>>> attempt_1453472314066_0007_m_000000_0, Status : FAILED
>>>> Error:
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>>>> char #1296459, byte #1310719)
>>>> at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>>>> at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>>>> at
>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>>>> at
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>>>> at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>>>> at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>>>> at
>>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>>>> at
>>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>>>> at
>>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>>>> at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>>>> at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>>>
>>>> 16/01/25 11:52:27 INFO mapreduce.Job: map 50% reduce 0%
>>>> 16/01/25 11:53:01 INFO mapreduce.Job: map 100% reduce 0%
>>>> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
>>>> attempt_1453472314066_0007_m_000000_1, Status : FAILED
>>>> Error:
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>>>> char #1296459, byte #1310719)
>>>> at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>>>> at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>>>> at
>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>>>> at
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>>>> at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>>>> at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>>>> at
>>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>>>> at
>>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>>>> at
>>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>>>> at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>>>> at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>>>
>>>> 16/01/25 11:53:02 INFO mapreduce.Job: map 50% reduce 0%
>>>> 16/01/25 11:54:52 INFO mapreduce.Job: map 100% reduce 0%
>>>> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
>>>> attempt_1453472314066_0007_m_000000_2, Status : FAILED
>>>> Error:
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>>>> char #1296459, byte #1310719)
>>>> at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>>>> at
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>>>> at
>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>>>> at
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>>>> at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>>>> at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>>>> at
>>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>>>> at
>>>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>>>> at
>>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>>>> at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>>>> at
>>>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>>>
>>>> 16/01/25 11:54:53 INFO mapreduce.Job: map 50% reduce 0%
>>>> 16/01/25 11:56:22 INFO mapreduce.Job: map 100% reduce 0%
>>>> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
>>>> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
>>>> Job failed as tasks failed. failedMaps:1 failedReduces:0
>>>>
>>>> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
>>>> File System Counters
>>>> FILE: Number of bytes read=0
>>>> FILE: Number of bytes written=116194
>>>> FILE: Number of read operations=0
>>>> FILE: Number of large read operations=0
>>>> FILE: Number of write operations=0
>>>> HDFS: Number of bytes read=1033
>>>> HDFS: Number of bytes written=0
>>>> HDFS: Number of read operations=1
>>>> HDFS: Number of large read operations=0
>>>> HDFS: Number of write operations=0
>>>> Job Counters
>>>> Failed map tasks=4
>>>> Launched map tasks=5
>>>> Other local map tasks=3
>>>> Data-local map tasks=2
>>>> Total time spent by all maps in occupied slots (ms)=3168342
>>>> Total time spent by all reduces in occupied slots (ms)=0
>>>> Total time spent by all map tasks (ms)=1056114
>>>> Total vcore-seconds taken by all map tasks=1056114
>>>> Total megabyte-seconds taken by all map tasks=3244382208
>>>> Map-Reduce Framework
>>>> Map input records=2762511
>>>> Map output records=17629
>>>> Input split bytes=1033
>>>> Spilled Records=0
>>>> Failed Shuffles=0
>>>> Merged Map outputs=0
>>>> GC time elapsed (ms)=2995
>>>> CPU time spent (ms)=116860
>>>> Physical memory (bytes) snapshot=1272868864
>>>> Virtual memory (bytes) snapshot=5104431104
>>>> Total committed heap usage (bytes)=1017118720
>>>> IndexerJob
>>>> DocumentCount=17629
>>>> File Input Format Counters
>>>> Bytes Read=0
>>>> File Output Format Counters
>>>> Bytes Written=0
>>>> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
>>>> java.lang.RuntimeException: job failed: name=[1]Indexer,
>>>> jobid=job_1453472314066_0007
>>>> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>>>> *******************************************************
>>>> --
>>>>
>>>> Please let me know if you have any questions , concerns or updates.
>>>> Have a great day ahead :)
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Kshitij Shukla
>>>> Software developer
>>>>
>>>> *Cyber Infrastructure(CIS)
>>>> **/The RightSourcing Specialists with 1250 man years of experience!/*
>>>>
>>>> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
>>>> intended recipient, you should delete this message and are notified that
>>>> any disclosure, copying or distribution of this message, or taking any
>>>> action based on it, is strictly prohibited by Law.
>>>>
>>>> Please don't print this e-mail unless you really need to.
>>>>
>>>> --
>>>>
>>>> ------------------------------
>>>>
>>>> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>>>>
>>>> Central India's largest Technology company.
>>>>
>>>> *Ensuring the success of our clients and partners through our highly
>>>> optimized Technology solutions.*
>>>>
>>>> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
>>>> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
>>>> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>>>>
>>>> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
>>>> intended recipient, you should delete this message and are notified that
>>>> any disclosure, copying or distribution of this message, or taking any
>>>> action based on it, is strictly prohibited by Law.
>>>>
>>
>> --
>>
>> Please let me know if you have any questions , concerns or updates.
>> Have a great day ahead :)
>>
>> Thanks and Regards,
>>
>> Kshitij Shukla
>> Software developer
>>
>> *Cyber Infrastructure(CIS)
>> **/The RightSourcing Specialists with 1250 man years of experience!/*
>>
>> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
>> intended recipient, you should delete this message and are notified that
>> any disclosure, copying or distribution of this message, or taking any
>> action based on it, is strictly prohibited by Law.
>>
>> Please don't print this e-mail unless you really need to.
>>
>> --
>>
>> ------------------------------
>>
>> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>>
>> Central India's largest Technology company.
>>
>> *Ensuring the success of our clients and partners through our highly
>> optimized Technology solutions.*
>>
>> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
>> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
>> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>>
>> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
>> intended recipient, you should delete this message and are notified that
>> any disclosure, copying or distribution of this message, or taking any
>> action based on it, is strictly prohibited by Law.
>>
--
Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)
Thanks and Regards,
Kshitij Shukla
Software developer
*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
Please don't print this e-mail unless you really need to.
--
------------------------------
*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
Central India's largest Technology company.
*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*
www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
RE: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at
char exception
Posted by Markus Jelsma <ma...@openindex.io>.
That is odd! Is it on your content or title field?
Markus
-----Original message-----
> From:Kshitij Shukla <ks...@cisinlabs.com>
> Sent: Monday 25th January 2016 11:41
> To: user@nutch.apache.org
> Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
>
> Thanks for your response Markus, I checked the code and I found the
> workaround you suggested in this file :
>
> *Source:*
> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
>
> and the method was called in this file:
>
> *Invoked:*
> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
> like this
> if (e.getKey().equals("content") || e.getKey().equals("title")) {
> val2 = SolrUtils.stripNonCharCodepoints(val);
> }
>
> So if the method is there and apparently invoked at right place. So what
> do you think where the problem could be?
>
> Thanks again for your help.
>
> On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
> > Hi - this is NUTCH-1016, which was never ported to 2.x.
> >
> > https://issues.apache.org/jira/browse/NUTCH-1016
> >
> >
> >
> > -----Original message-----
> >> From:Kshitij Shukla <ks...@cisinlabs.com>
> >> Sent: Monday 25th January 2016 8:23
> >> To: user@nutch.apache.org
> >> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> >>
> >> Hello everyone,
> >>
> >> During a very large crawl when indexing to Solr this will yield the
> >> following exception:
> >>
> >> **************************************************
> >> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
> >> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
> >> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> >> mapred.reduce.tasks.speculative.execution=false -D
> >> mapred.map.tasks.speculative.execution=false -D
> >> mapred.compress.map.output=true -D
> >> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
> >> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
> >> 16/01/25 11:44:53 INFO Configuration.deprecation:
> >> mapred.output.key.comparator.class is deprecated. Instead, use
> >> mapreduce.job.output.key.comparator.class
> >> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
> >> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
> >> mode: [true]
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework
> >> (lib-http)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in
> >> (parse-html)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags
> >> (parse-metatags)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter
> >> (index-html)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core
> >> extension points (nutch-extensionpoints)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing
> >> Filter (index-basic)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: XML Libraries (lib-xml)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing
> >> Filter (index-anchor)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer
> >> (urlnormalizer-basic)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Language
> >> Identification Parser/Filter (language-identifier)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing
> >> Filter (index-metadata)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML
> >> Parser (lib-nekohtml)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection
> >> indexing and query filter (subcollection)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
> >> (indexer-solr)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat
> >> Parser/Indexer/Querier (microformats-reltag)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https
> >> Protocol Plug-in (protocol-httpclient)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser
> >> (parse-js)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in
> >> (parse-tika)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain
> >> Plugin (tld)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter
> >> Framework (lib-regex-filter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer
> >> (urlnormalizer-regex)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis
> >> Scoring Plug-in (scoring-link)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in
> >> (scoring-opic)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter
> >> (index-more)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol
> >> Plug-in (protocol-http)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons
> >> Plugins (creativecommons)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter
> >> (org.apache.nutch.parse.ParseFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning
> >> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser
> >> (org.apache.nutch.parse.Parser)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter
> >> (org.apache.nutch.net.URLFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring
> >> (org.apache.nutch.scoring.ScoringFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer
> >> (org.apache.nutch.net.URLNormalizer)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol
> >> (org.apache.nutch.protocol.Protocol)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer
> >> (org.apache.nutch.indexer.IndexWriter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing
> >> Filter (org.apache.nutch.indexer.IndexingFilter)
> >> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >> org.apache.nutch.indexer.html.HtmlIndexingFilter
> >> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
> >> for indexing set to: 100
> >> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >> org.apache.nutch.indexer.basic.BasicIndexingFilter
> >> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
> >> is: off
> >> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
> >> job: job_1453472314066_0007
> >> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
> >> application_1453472314066_0007
> >> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
> >> http://cism479:8088/proxy/application_1453472314066_0007/
> >> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
> >> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
> >> in uber mode : false
> >> 16/01/25 11:45:29 INFO mapreduce.Job: map 0% reduce 0%
> >> 16/01/25 11:49:24 INFO mapreduce.Job: map 50% reduce 0%
> >> 16/01/25 11:49:29 INFO mapreduce.Job: map 0% reduce 0%
> >> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
> >> attempt_1453472314066_0007_m_000000_0, Status : FAILED
> >> Error:
> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >> char #1296459, byte #1310719)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >> at
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >> at
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >> at
> >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >> at
> >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >> at
> >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >> at java.security.AccessController.doPrivileged(Native Method)
> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >> at
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>
> >> 16/01/25 11:52:27 INFO mapreduce.Job: map 50% reduce 0%
> >> 16/01/25 11:53:01 INFO mapreduce.Job: map 100% reduce 0%
> >> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
> >> attempt_1453472314066_0007_m_000000_1, Status : FAILED
> >> Error:
> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >> char #1296459, byte #1310719)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >> at
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >> at
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >> at
> >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >> at
> >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >> at
> >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >> at java.security.AccessController.doPrivileged(Native Method)
> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >> at
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>
> >> 16/01/25 11:53:02 INFO mapreduce.Job: map 50% reduce 0%
> >> 16/01/25 11:54:52 INFO mapreduce.Job: map 100% reduce 0%
> >> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
> >> attempt_1453472314066_0007_m_000000_2, Status : FAILED
> >> Error:
> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >> char #1296459, byte #1310719)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >> at
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >> at
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >> at
> >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >> at
> >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >> at
> >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >> at java.security.AccessController.doPrivileged(Native Method)
> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >> at
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>
> >> 16/01/25 11:54:53 INFO mapreduce.Job: map 50% reduce 0%
> >> 16/01/25 11:56:22 INFO mapreduce.Job: map 100% reduce 0%
> >> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
> >> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
> >> Job failed as tasks failed. failedMaps:1 failedReduces:0
> >>
> >> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
> >> File System Counters
> >> FILE: Number of bytes read=0
> >> FILE: Number of bytes written=116194
> >> FILE: Number of read operations=0
> >> FILE: Number of large read operations=0
> >> FILE: Number of write operations=0
> >> HDFS: Number of bytes read=1033
> >> HDFS: Number of bytes written=0
> >> HDFS: Number of read operations=1
> >> HDFS: Number of large read operations=0
> >> HDFS: Number of write operations=0
> >> Job Counters
> >> Failed map tasks=4
> >> Launched map tasks=5
> >> Other local map tasks=3
> >> Data-local map tasks=2
> >> Total time spent by all maps in occupied slots (ms)=3168342
> >> Total time spent by all reduces in occupied slots (ms)=0
> >> Total time spent by all map tasks (ms)=1056114
> >> Total vcore-seconds taken by all map tasks=1056114
> >> Total megabyte-seconds taken by all map tasks=3244382208
> >> Map-Reduce Framework
> >> Map input records=2762511
> >> Map output records=17629
> >> Input split bytes=1033
> >> Spilled Records=0
> >> Failed Shuffles=0
> >> Merged Map outputs=0
> >> GC time elapsed (ms)=2995
> >> CPU time spent (ms)=116860
> >> Physical memory (bytes) snapshot=1272868864
> >> Virtual memory (bytes) snapshot=5104431104
> >> Total committed heap usage (bytes)=1017118720
> >> IndexerJob
> >> DocumentCount=17629
> >> File Input Format Counters
> >> Bytes Read=0
> >> File Output Format Counters
> >> Bytes Written=0
> >> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
> >> java.lang.RuntimeException: job failed: name=[1]Indexer,
> >> jobid=job_1453472314066_0007
> >> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
> >> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> >> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> >> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >> at
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:497)
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> >> *******************************************************
> >> --
> >>
> >> Please let me know if you have any questions , concerns or updates.
> >> Have a great day ahead :)
> >>
> >> Thanks and Regards,
> >>
> >> Kshitij Shukla
> >> Software developer
> >>
> >> *Cyber Infrastructure(CIS)
> >> **/The RightSourcing Specialists with 1250 man years of experience!/*
> >>
> >> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
> >> Please don't print this e-mail unless you really need to.
> >>
> >> --
> >>
> >> ------------------------------
> >>
> >> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> >>
> >> Central India's largest Technology company.
> >>
> >> *Ensuring the success of our clients and partners through our highly
> >> optimized Technology solutions.*
> >>
> >> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> >> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> >> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> >>
> >> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
>
>
> --
>
> Please let me know if you have any questions , concerns or updates.
> Have a great day ahead :)
>
> Thanks and Regards,
>
> Kshitij Shukla
> Software developer
>
> *Cyber Infrastructure(CIS)
> **/The RightSourcing Specialists with 1250 man years of experience!/*
>
> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> intended recipient, you should delete this message and are notified that
> any disclosure, copying or distribution of this message, or taking any
> action based on it, is strictly prohibited by Law.
>
> Please don't print this e-mail unless you really need to.
>
> --
>
> ------------------------------
>
> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>
> Central India's largest Technology company.
>
> *Ensuring the success of our clients and partners through our highly
> optimized Technology solutions.*
>
> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>
> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> intended recipient, you should delete this message and are notified that
> any disclosure, copying or distribution of this message, or taking any
> action based on it, is strictly prohibited by Law.
>
[CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
Posted by Kshitij Shukla <ks...@cisinlabs.com>.
Thanks for your response Markus, I checked the code and I found the
workaround you suggested in this file :
*Source:*
/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
and the method was called in this file:
*Invoked:*
/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
like this
if (e.getKey().equals("content") || e.getKey().equals("title")) {
val2 = SolrUtils.stripNonCharCodepoints(val);
}
So if the method is there and apparently invoked at right place. So what
do you think where the problem could be?
Thanks again for your help.
On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
> Hi - this is NUTCH-1016, which was never ported to 2.x.
>
> https://issues.apache.org/jira/browse/NUTCH-1016
>
>
>
> -----Original message-----
>> From:Kshitij Shukla <ks...@cisinlabs.com>
>> Sent: Monday 25th January 2016 8:23
>> To: user@nutch.apache.org
>> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
>>
>> Hello everyone,
>>
>> During a very large crawl when indexing to Solr this will yield the
>> following exception:
>>
>> **************************************************
>> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
>> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
>> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D
>> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
>> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
>> 16/01/25 11:44:53 INFO Configuration.deprecation:
>> mapred.output.key.comparator.class is deprecated. Instead, use
>> mapreduce.job.output.key.comparator.class
>> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
>> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
>> mode: [true]
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework
>> (lib-http)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in
>> (parse-html)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags
>> (parse-metatags)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter
>> (index-html)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core
>> extension points (nutch-extensionpoints)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing
>> Filter (index-basic)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: XML Libraries (lib-xml)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing
>> Filter (index-anchor)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer
>> (urlnormalizer-basic)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Language
>> Identification Parser/Filter (language-identifier)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing
>> Filter (index-metadata)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML
>> Parser (lib-nekohtml)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection
>> indexing and query filter (subcollection)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
>> (indexer-solr)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https
>> Protocol Plug-in (protocol-httpclient)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser
>> (parse-js)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in
>> (parse-tika)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain
>> Plugin (tld)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter
>> Framework (lib-regex-filter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer
>> (urlnormalizer-regex)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis
>> Scoring Plug-in (scoring-link)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in
>> (scoring-opic)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter
>> (index-more)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol
>> Plug-in (protocol-http)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons
>> Plugins (creativecommons)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter
>> (org.apache.nutch.parse.ParseFilter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning
>> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer
>> (org.apache.nutch.indexer.IndexWriter)
>> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
>> org.apache.nutch.indexer.html.HtmlIndexingFilter
>> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
>> for indexing set to: 100
>> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
>> is: off
>> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
>> job: job_1453472314066_0007
>> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
>> application_1453472314066_0007
>> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
>> http://cism479:8088/proxy/application_1453472314066_0007/
>> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
>> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
>> in uber mode : false
>> 16/01/25 11:45:29 INFO mapreduce.Job: map 0% reduce 0%
>> 16/01/25 11:49:24 INFO mapreduce.Job: map 50% reduce 0%
>> 16/01/25 11:49:29 INFO mapreduce.Job: map 0% reduce 0%
>> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
>> attempt_1453472314066_0007_m_000000_0, Status : FAILED
>> Error:
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>> char #1296459, byte #1310719)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>> at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>> at
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>> at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>> at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>> at
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>> at
>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>> at
>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>> at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>> at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>
>> 16/01/25 11:52:27 INFO mapreduce.Job: map 50% reduce 0%
>> 16/01/25 11:53:01 INFO mapreduce.Job: map 100% reduce 0%
>> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
>> attempt_1453472314066_0007_m_000000_1, Status : FAILED
>> Error:
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>> char #1296459, byte #1310719)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>> at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>> at
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>> at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>> at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>> at
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>> at
>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>> at
>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>> at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>> at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>
>> 16/01/25 11:53:02 INFO mapreduce.Job: map 50% reduce 0%
>> 16/01/25 11:54:52 INFO mapreduce.Job: map 100% reduce 0%
>> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
>> attempt_1453472314066_0007_m_000000_2, Status : FAILED
>> Error:
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
>> char #1296459, byte #1310719)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>> at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>> at
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>> at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>> at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>> at
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>> at
>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>> at
>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>> at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>> at
>> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>>
>> 16/01/25 11:54:53 INFO mapreduce.Job: map 50% reduce 0%
>> 16/01/25 11:56:22 INFO mapreduce.Job: map 100% reduce 0%
>> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
>> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
>> Job failed as tasks failed. failedMaps:1 failedReduces:0
>>
>> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
>> File System Counters
>> FILE: Number of bytes read=0
>> FILE: Number of bytes written=116194
>> FILE: Number of read operations=0
>> FILE: Number of large read operations=0
>> FILE: Number of write operations=0
>> HDFS: Number of bytes read=1033
>> HDFS: Number of bytes written=0
>> HDFS: Number of read operations=1
>> HDFS: Number of large read operations=0
>> HDFS: Number of write operations=0
>> Job Counters
>> Failed map tasks=4
>> Launched map tasks=5
>> Other local map tasks=3
>> Data-local map tasks=2
>> Total time spent by all maps in occupied slots (ms)=3168342
>> Total time spent by all reduces in occupied slots (ms)=0
>> Total time spent by all map tasks (ms)=1056114
>> Total vcore-seconds taken by all map tasks=1056114
>> Total megabyte-seconds taken by all map tasks=3244382208
>> Map-Reduce Framework
>> Map input records=2762511
>> Map output records=17629
>> Input split bytes=1033
>> Spilled Records=0
>> Failed Shuffles=0
>> Merged Map outputs=0
>> GC time elapsed (ms)=2995
>> CPU time spent (ms)=116860
>> Physical memory (bytes) snapshot=1272868864
>> Virtual memory (bytes) snapshot=5104431104
>> Total committed heap usage (bytes)=1017118720
>> IndexerJob
>> DocumentCount=17629
>> File Input Format Counters
>> Bytes Read=0
>> File Output Format Counters
>> Bytes Written=0
>> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
>> java.lang.RuntimeException: job failed: name=[1]Indexer,
>> jobid=job_1453472314066_0007
>> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>> *******************************************************
>> --
>>
>> Please let me know if you have any questions , concerns or updates.
>> Have a great day ahead :)
>>
>> Thanks and Regards,
>>
>> Kshitij Shukla
>> Software developer
>>
>> *Cyber Infrastructure(CIS)
>> **/The RightSourcing Specialists with 1250 man years of experience!/*
>>
>> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
>> intended recipient, you should delete this message and are notified that
>> any disclosure, copying or distribution of this message, or taking any
>> action based on it, is strictly prohibited by Law.
>>
>> Please don't print this e-mail unless you really need to.
>>
>> --
>>
>> ------------------------------
>>
>> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>>
>> Central India's largest Technology company.
>>
>> *Ensuring the success of our clients and partners through our highly
>> optimized Technology solutions.*
>>
>> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
>> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
>> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>>
>> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
>> intended recipient, you should delete this message and are notified that
>> any disclosure, copying or distribution of this message, or taking any
>> action based on it, is strictly prohibited by Law.
>>
--
Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)
Thanks and Regards,
Kshitij Shukla
Software developer
*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
Please don't print this e-mail unless you really need to.
--
------------------------------
*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
Central India's largest Technology company.
*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*
www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
RE: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
Posted by Markus Jelsma <ma...@openindex.io>.
Hi - this is NUTCH-1016, which was never ported to 2.x.
https://issues.apache.org/jira/browse/NUTCH-1016
-----Original message-----
> From:Kshitij Shukla <ks...@cisinlabs.com>
> Sent: Monday 25th January 2016 8:23
> To: user@nutch.apache.org
> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
>
> Hello everyone,
>
> During a very large crawl when indexing to Solr this will yield the
> following exception:
>
> **************************************************
> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D
> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
> 16/01/25 11:44:53 INFO Configuration.deprecation:
> mapred.output.key.comparator.class is deprecated. Instead, use
> mapreduce.job.output.key.comparator.class
> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
> 16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework
> (lib-http)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in
> (parse-html)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags
> (parse-metatags)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter
> (index-html)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core
> extension points (nutch-extensionpoints)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing
> Filter (index-basic)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: XML Libraries (lib-xml)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing
> Filter (index-anchor)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer
> (urlnormalizer-basic)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Language
> Identification Parser/Filter (language-identifier)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing
> Filter (index-metadata)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML
> Parser (lib-nekohtml)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection
> indexing and query filter (subcollection)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
> (indexer-solr)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat
> Parser/Indexer/Querier (microformats-reltag)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https
> Protocol Plug-in (protocol-httpclient)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser
> (parse-js)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in
> (parse-tika)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain
> Plugin (tld)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter
> Framework (lib-regex-filter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer
> (urlnormalizer-regex)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis
> Scoring Plug-in (scoring-link)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in
> (scoring-opic)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter
> (index-more)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol
> Plug-in (protocol-http)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons
> Plugins (creativecommons)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter
> (org.apache.nutch.parse.ParseFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning
> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer
> (org.apache.nutch.indexer.IndexWriter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> org.apache.nutch.indexer.html.HtmlIndexingFilter
> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
> for indexing set to: 100
> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
> is: off
> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
> job: job_1453472314066_0007
> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
> application_1453472314066_0007
> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
> http://cism479:8088/proxy/application_1453472314066_0007/
> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
> in uber mode : false
> 16/01/25 11:45:29 INFO mapreduce.Job: map 0% reduce 0%
> 16/01/25 11:49:24 INFO mapreduce.Job: map 50% reduce 0%
> 16/01/25 11:49:29 INFO mapreduce.Job: map 0% reduce 0%
> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
> attempt_1453472314066_0007_m_000000_0, Status : FAILED
> Error:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> char #1296459, byte #1310719)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> at
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> at
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> at
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> 16/01/25 11:52:27 INFO mapreduce.Job: map 50% reduce 0%
> 16/01/25 11:53:01 INFO mapreduce.Job: map 100% reduce 0%
> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
> attempt_1453472314066_0007_m_000000_1, Status : FAILED
> Error:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> char #1296459, byte #1310719)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> at
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> at
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> at
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> 16/01/25 11:53:02 INFO mapreduce.Job: map 50% reduce 0%
> 16/01/25 11:54:52 INFO mapreduce.Job: map 100% reduce 0%
> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
> attempt_1453472314066_0007_m_000000_2, Status : FAILED
> Error:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> char #1296459, byte #1310719)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> at
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> at
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> at
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> 16/01/25 11:54:53 INFO mapreduce.Job: map 50% reduce 0%
> 16/01/25 11:56:22 INFO mapreduce.Job: map 100% reduce 0%
> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
> Job failed as tasks failed. failedMaps:1 failedReduces:0
>
> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
> File System Counters
> FILE: Number of bytes read=0
> FILE: Number of bytes written=116194
> FILE: Number of read operations=0
> FILE: Number of large read operations=0
> FILE: Number of write operations=0
> HDFS: Number of bytes read=1033
> HDFS: Number of bytes written=0
> HDFS: Number of read operations=1
> HDFS: Number of large read operations=0
> HDFS: Number of write operations=0
> Job Counters
> Failed map tasks=4
> Launched map tasks=5
> Other local map tasks=3
> Data-local map tasks=2
> Total time spent by all maps in occupied slots (ms)=3168342
> Total time spent by all reduces in occupied slots (ms)=0
> Total time spent by all map tasks (ms)=1056114
> Total vcore-seconds taken by all map tasks=1056114
> Total megabyte-seconds taken by all map tasks=3244382208
> Map-Reduce Framework
> Map input records=2762511
> Map output records=17629
> Input split bytes=1033
> Spilled Records=0
> Failed Shuffles=0
> Merged Map outputs=0
> GC time elapsed (ms)=2995
> CPU time spent (ms)=116860
> Physical memory (bytes) snapshot=1272868864
> Virtual memory (bytes) snapshot=5104431104
> Total committed heap usage (bytes)=1017118720
> IndexerJob
> DocumentCount=17629
> File Input Format Counters
> Bytes Read=0
> File Output Format Counters
> Bytes Written=0
> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
> java.lang.RuntimeException: job failed: name=[1]Indexer,
> jobid=job_1453472314066_0007
> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> *******************************************************
> --
>
> Please let me know if you have any questions , concerns or updates.
> Have a great day ahead :)
>
> Thanks and Regards,
>
> Kshitij Shukla
> Software developer
>
> *Cyber Infrastructure(CIS)
> **/The RightSourcing Specialists with 1250 man years of experience!/*
>
> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> intended recipient, you should delete this message and are notified that
> any disclosure, copying or distribution of this message, or taking any
> action based on it, is strictly prohibited by Law.
>
> Please don't print this e-mail unless you really need to.
>
> --
>
> ------------------------------
>
> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>
> Central India's largest Technology company.
>
> *Ensuring the success of our clients and partners through our highly
> optimized Technology solutions.*
>
> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>
> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> intended recipient, you should delete this message and are notified that
> any disclosure, copying or distribution of this message, or taking any
> action based on it, is strictly prohibited by Law.
>