You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Shubham Gupta (JIRA)" <ji...@apache.org> on 2016/09/26 04:33:20 UTC

[jira] [Comment Edited] (NUTCH-2315) UpdateDb jobs fails everytime (Nutch 2.3.1)

    [ https://issues.apache.org/jira/browse/NUTCH-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15521993#comment-15521993 ] 

Shubham Gupta edited comment on NUTCH-2315 at 9/26/16 4:33 AM:
---------------------------------------------------------------

Now, The Malformed Url Exception is not coming.
But the following exception leads to the failure of update job :

Error: java.lang.RuntimeException: com.mongodb.WriteConcernException: { "serverUsed" : "host:37006" , "ok" : 1 , "n" : 0 , "updatedExisting" : false , "err" : "insertDocument :: caused by :: 17280 Btree::insert: key too large to index, failing .$_id_ 2772 { : \" “but when someone is on the s register, not the 15,000 [suspicious] people, the first few hundred on list – should we wait for them to act or sho...\" }" , "code" : 17280} at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:76) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:236) at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: com.mongodb.WriteConcernException: { "serverUsed" : "host:37006" , "ok" : 1 , "n" : 0 , "updatedExisting" : false , "err" : "insertDocument :: caused by :: 17280 Btree::insert: key too large to index, _id_ 2772 { : \" “but when someone is on the s register, not the 15,000 [suspicious] people, the first few hundred on list – should we wait for them to act or sho...\" }" , "code" : 17280} at com.mongodb.CommandResult.getWriteException(CommandResult.java:90) at com.mongodb.CommandResult.getException(CommandResult.java:79) at com.mongodb.DBCollectionImpl.translateBulkWriteException(DBCollectionImpl.java:316) at com.mongodb.DBCollectionImpl.update(DBCollectionImpl.java:274) at com.mongodb.DBCollection.update(DBCollection.java:214) at com.mongodb.DBCollection.update(DBCollection.java:247) at org.apache.gora.mongodb.store.MongoStore.performPut(MongoStore.java:361) at org.apache.gora.mongodb.store.MongoStore.put(MongoStore.java:326) at org.apache.gora.mongodb.store.MongoStore.put(MongoStore.java:70) at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:67)


The id stored in MongoDB is not the auto generated ObjectId by Mongo but the reversed url crawled by Nutch. Maybe that is causing this exception to arise.
So, what changes must be  made to avoid this.


was (Author: shubham.gupta):
Now, The Malformed Url Exception is not coming.
But the following exception leads to the failure of update job :

Error: java.lang.RuntimeException: com.mongodb.WriteConcernException: { "serverUsed" : "host:37006" , "ok" : 1 , "n" : 0 , "updatedExisting" : false , "err" : "insertDocument :: caused by :: 17280 Btree::insert: key too large to index, failing .$_id_ 2772 { : \" “but when someone is on the s register, not the 15,000 [suspicious] people, the first few hundred on list – should we wait for them to act or sho...\" }" , "code" : 17280} at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:76) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:236) at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: com.mongodb.WriteConcernException: { "serverUsed" : "host:37006" , "ok" : 1 , "n" : 0 , "updatedExisting" : false , "err" : "insertDocument :: caused by :: 17280 Btree::insert: key too large to index, _id_ 2772 { : \" “but when someone is on the s register, not the 15,000 [suspicious] people, the first few hundred on list – should we wait for them to act or sho...\" }" , "code" : 17280} at com.mongodb.CommandResult.getWriteException(CommandResult.java:90) at com.mongodb.CommandResult.getException(CommandResult.java:79) at com.mongodb.DBCollectionImpl.translateBulkWriteException(DBCollectionImpl.java:316) at com.mongodb.DBCollectionImpl.update(DBCollectionImpl.java:274) at com.mongodb.DBCollection.update(DBCollection.java:214) at com.mongodb.DBCollection.update(DBCollection.java:247) at org.apache.gora.mongodb.store.MongoStore.performPut(MongoStore.java:361) at org.apache.gora.mongodb.store.MongoStore.put(MongoStore.java:326) at org.apache.gora.mongodb.store.MongoStore.put(MongoStore.java:70) at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:67)

> UpdateDb jobs fails everytime (Nutch 2.3.1)
> -------------------------------------------
>
>                 Key: NUTCH-2315
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2315
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3.1
>         Environment: I am using it with Hadoop 2.7.1 + Mongo DB + Yarn + Gora 0.61
>            Reporter: Shubham Gupta
>              Labels: newbie
>             Fix For: 2.4
>
>         Attachments: NUTCH-2315-2.3.1-1.patch
>
>
> Hey,
> Whenever I run the update job, the following error occurs:
> INFO mapreduce.Job: Task Id : attempt_1473832356852_0107_m_000000_2, Status : FAILED
> Error: java.net.MalformedURLException: no protocol: http%3A%2F%2Fwww.smh.com.au%2Fact-news%2Fcanberra-weather-warm-april-expected-after-record-breaking-march-temperatures-20160401-gnw2pg.html&title=Canberra+weather%3A+warm+April+expected+after+record+breaking+March+temperatures&source=The+Sydney+Morning+Herald&summary=Canberra+can+expect+warmer+than+average+temperatures+to+continue+for+April+after+enjoying+its+equal+second+warmest+March+on+record
> 	at java.net.URL.<init>(URL.java:586)
> 	at java.net.URL.<init>(URL.java:483)
> 	at java.net.URL.<init>(URL.java:432)
> 	at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
> 	at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
> 	at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> 16/09/15 12:44:35 INFO mapreduce.Job:  map 100% reduce 100%
> 16/09/15 12:44:36 INFO mapreduce.Job: Job job_1473832356852_0107 failed with state FAILED due to: Task failed task_1473832356852_0107_m_000000
> Job failed as tasks failed. failedMaps:1 failedReduces:0
> 16/09/15 12:44:36 INFO mapreduce.Job: Counters: 8
> 	Job Counters 
> 		Failed map tasks=4
> 		Launched map tasks=4
> 		Other local map tasks=4
> 		Total time spent by all maps in occupied slots (ms)=388304
> 		Total time spent by all reduces in occupied slots (ms)=0
> 		Total time spent by all map tasks (ms)=55472
> 		Total vcore-seconds taken by all map tasks=55472
> 		Total megabyte-seconds taken by all map tasks=198145984
> Exception in thread "main" java.lang.RuntimeException: job failed: name=[rss]update-table, jobid=job_1473832356852_0107
> 	at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
> 	at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
> 	at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
> 	at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> 	at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)