You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "shubham.gupta" <sh...@orkash.com> on 2016/09/15 06:58:41 UTC
UpdateDb job fails everytime
Hey,
Whenever the update job is executed the following errors occur:
INFO mapreduce.Job: Task Id : attempt_1473832356852_0104_m_000000_2,
Status : FAILED
Error: java.net.MalformedURLException: no protocol:
http%3A%2F%2Fwww.smh.com.au%2Fact-news%2Fcanberra-weather-warm-april-expected-after-record-breaking-march-temperatures-20160401-gnw2pg.html&title=Canberra+weather%3A+warm+April+expected+after+record+breaking+March+temperatures&source=The+Sydney+Morning+Herald&summary=Canberra+can+expect+warmer+than+average+temperatures+to+continue+for+April+after+enjoying+its+equal+second+warmest+March+on+record
at java.net.URL.<init>(URL.java:586)
at java.net.URL.<init>(URL.java:483)
at java.net.URL.<init>(URL.java:432)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Job Counters
Failed map tasks=4
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=417438
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=59634
Total vcore-seconds taken by all map tasks=59634
Total megabyte-seconds taken by all map tasks=213012648
Exception in thread "main" java.lang.RuntimeException: job failed:
name=[]update-table, jobid=job_1473832356852_0104
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
at
org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
This leads to no new updation of urls in the corresponding tables.
Please help.
Thanks in advance
--
Shubham
Re: UpdateDb job fails everytime
Posted by Sebastian Nagel <wa...@googlemail.com>.
Sorry, the correct link is:
https://issues.apache.org/jira/browse/NUTCH
On 09/15/2016 01:34 PM, Sebastian Nagel wrote:
> Hi,
>
> this looks like a bug in Nutch 2.x.
>
> Please, open an issue at http://issues.apache.org/jira/NUTCH
> and add information about the exact Nutch version and the
> configuration. Invalid URLs should normally be filtered out
> or corrected by URL normalizers during the parsing step.
>
> Thanks,
> Sebastian
>
> On 09/15/2016 08:58 AM, shubham.gupta wrote:
>> Hey,
>>
>> Whenever the update job is executed the following errors occur:
>>
>> INFO mapreduce.Job: Task Id : attempt_1473832356852_0104_m_000000_2, Status : FAILED
>> Error: java.net.MalformedURLException: no protocol:
>> http%3A%2F%2Fwww.smh.com.au%2Fact-news%2Fcanberra-weather-warm-april-expected-after-record-breaking-march-temperatures-20160401-gnw2pg.html&title=Canberra+weather%3A+warm+April+expected+after+record+breaking+March+temperatures&source=The+Sydney+Morning+Herald&summary=Canberra+can+expect+warmer+than+average+temperatures+to+continue+for+April+after+enjoying+its+equal+second+warmest+March+on+record
>>
>> at java.net.URL.<init>(URL.java:586)
>> at java.net.URL.<init>(URL.java:483)
>> at java.net.URL.<init>(URL.java:432)
>> at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
>> at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
>> at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>
>>
>> Job Counters
>> Failed map tasks=4
>> Launched map tasks=4
>> Other local map tasks=4
>> Total time spent by all maps in occupied slots (ms)=417438
>> Total time spent by all reduces in occupied slots (ms)=0
>> Total time spent by all map tasks (ms)=59634
>> Total vcore-seconds taken by all map tasks=59634
>> Total megabyte-seconds taken by all map tasks=213012648
>> Exception in thread "main" java.lang.RuntimeException: job failed: name=[]update-table,
>> jobid=job_1473832356852_0104
>> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
>> at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
>> at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
>> at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>
>> This leads to no new updation of urls in the corresponding tables.
>> Please help.
>> Thanks in advance
>>
>
Re: UpdateDb job fails everytime
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
this looks like a bug in Nutch 2.x.
Please, open an issue at http://issues.apache.org/jira/NUTCH
and add information about the exact Nutch version and the
configuration. Invalid URLs should normally be filtered out
or corrected by URL normalizers during the parsing step.
Thanks,
Sebastian
On 09/15/2016 08:58 AM, shubham.gupta wrote:
> Hey,
>
> Whenever the update job is executed the following errors occur:
>
> INFO mapreduce.Job: Task Id : attempt_1473832356852_0104_m_000000_2, Status : FAILED
> Error: java.net.MalformedURLException: no protocol:
> http%3A%2F%2Fwww.smh.com.au%2Fact-news%2Fcanberra-weather-warm-april-expected-after-record-breaking-march-temperatures-20160401-gnw2pg.html&title=Canberra+weather%3A+warm+April+expected+after+record+breaking+March+temperatures&source=The+Sydney+Morning+Herald&summary=Canberra+can+expect+warmer+than+average+temperatures+to+continue+for+April+after+enjoying+its+equal+second+warmest+March+on+record
>
> at java.net.URL.<init>(URL.java:586)
> at java.net.URL.<init>(URL.java:483)
> at java.net.URL.<init>(URL.java:432)
> at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
> at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
> at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>
>
> Job Counters
> Failed map tasks=4
> Launched map tasks=4
> Other local map tasks=4
> Total time spent by all maps in occupied slots (ms)=417438
> Total time spent by all reduces in occupied slots (ms)=0
> Total time spent by all map tasks (ms)=59634
> Total vcore-seconds taken by all map tasks=59634
> Total megabyte-seconds taken by all map tasks=213012648
> Exception in thread "main" java.lang.RuntimeException: job failed: name=[]update-table,
> jobid=job_1473832356852_0104
> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
> at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
> at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
> at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>
> This leads to no new updation of urls in the corresponding tables.
> Please help.
> Thanks in advance
>