You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "shubham.gupta" <sh...@orkash.com> on 2016/09/15 06:58:41 UTC

UpdateDb job fails everytime

Hey,

Whenever the update job is executed the following errors occur:

INFO mapreduce.Job: Task Id : attempt_1473832356852_0104_m_000000_2, 
Status : FAILED
Error: java.net.MalformedURLException: no protocol: 
http%3A%2F%2Fwww.smh.com.au%2Fact-news%2Fcanberra-weather-warm-april-expected-after-record-breaking-march-temperatures-20160401-gnw2pg.html&title=Canberra+weather%3A+warm+April+expected+after+record+breaking+March+temperatures&source=The+Sydney+Morning+Herald&summary=Canberra+can+expect+warmer+than+average+temperatures+to+continue+for+April+after+enjoying+its+equal+second+warmest+March+on+record
     at java.net.URL.<init>(URL.java:586)
     at java.net.URL.<init>(URL.java:483)
     at java.net.URL.<init>(URL.java:432)
     at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)


Job Counters
         Failed map tasks=4
         Launched map tasks=4
         Other local map tasks=4
         Total time spent by all maps in occupied slots (ms)=417438
         Total time spent by all reduces in occupied slots (ms)=0
         Total time spent by all map tasks (ms)=59634
         Total vcore-seconds taken by all map tasks=59634
         Total megabyte-seconds taken by all map tasks=213012648
Exception in thread "main" java.lang.RuntimeException: job failed: 
name=[]update-table, jobid=job_1473832356852_0104
     at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
     at 
org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
     at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)
     at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

This leads to no new updation of urls in the corresponding tables.
Please help.
Thanks in advance

-- 

Shubham

Re: UpdateDb job fails everytime

Posted by Sebastian Nagel <wa...@googlemail.com>.

Sorry, the correct link is:
https://issues.apache.org/jira/browse/NUTCH

On 09/15/2016 01:34 PM, Sebastian Nagel wrote:
> Hi,
> 
> this looks like a bug in Nutch 2.x.
> 
> Please, open an issue at http://issues.apache.org/jira/NUTCH
> and add information about the exact Nutch version and the
> configuration.  Invalid URLs should normally be filtered out
> or corrected by URL normalizers during the parsing step.
> 
> Thanks,
> Sebastian
> 
> On 09/15/2016 08:58 AM, shubham.gupta wrote:
>> Hey,
>>
>> Whenever the update job is executed the following errors occur:
>>
>> INFO mapreduce.Job: Task Id : attempt_1473832356852_0104_m_000000_2, Status : FAILED
>> Error: java.net.MalformedURLException: no protocol:
>> http%3A%2F%2Fwww.smh.com.au%2Fact-news%2Fcanberra-weather-warm-april-expected-after-record-breaking-march-temperatures-20160401-gnw2pg.html&title=Canberra+weather%3A+warm+April+expected+after+record+breaking+March+temperatures&source=The+Sydney+Morning+Herald&summary=Canberra+can+expect+warmer+than+average+temperatures+to+continue+for+April+after+enjoying+its+equal+second+warmest+March+on+record
>>
>>     at java.net.URL.<init>(URL.java:586)
>>     at java.net.URL.<init>(URL.java:483)
>>     at java.net.URL.<init>(URL.java:432)
>>     at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
>>     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
>>     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>     at java.security.AccessController.doPrivileged(Native Method)
>>     at javax.security.auth.Subject.doAs(Subject.java:422)
>>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>
>>
>> Job Counters
>>         Failed map tasks=4
>>         Launched map tasks=4
>>         Other local map tasks=4
>>         Total time spent by all maps in occupied slots (ms)=417438
>>         Total time spent by all reduces in occupied slots (ms)=0
>>         Total time spent by all map tasks (ms)=59634
>>         Total vcore-seconds taken by all map tasks=59634
>>         Total megabyte-seconds taken by all map tasks=213012648
>> Exception in thread "main" java.lang.RuntimeException: job failed: name=[]update-table,
>> jobid=job_1473832356852_0104
>>     at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
>>     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
>>     at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
>>     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>     at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>     at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>     at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>
>> This leads to no new updation of urls in the corresponding tables.
>> Please help.
>> Thanks in advance
>>
>

Re: UpdateDb job fails everytime

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

this looks like a bug in Nutch 2.x.

Please, open an issue at http://issues.apache.org/jira/NUTCH
and add information about the exact Nutch version and the
configuration.  Invalid URLs should normally be filtered out
or corrected by URL normalizers during the parsing step.

Thanks,
Sebastian

On 09/15/2016 08:58 AM, shubham.gupta wrote:
> Hey,
> 
> Whenever the update job is executed the following errors occur:
> 
> INFO mapreduce.Job: Task Id : attempt_1473832356852_0104_m_000000_2, Status : FAILED
> Error: java.net.MalformedURLException: no protocol:
> http%3A%2F%2Fwww.smh.com.au%2Fact-news%2Fcanberra-weather-warm-april-expected-after-record-breaking-march-temperatures-20160401-gnw2pg.html&title=Canberra+weather%3A+warm+April+expected+after+record+breaking+March+temperatures&source=The+Sydney+Morning+Herald&summary=Canberra+can+expect+warmer+than+average+temperatures+to+continue+for+April+after+enjoying+its+equal+second+warmest+March+on+record
> 
>     at java.net.URL.<init>(URL.java:586)
>     at java.net.URL.<init>(URL.java:483)
>     at java.net.URL.<init>(URL.java:432)
>     at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
>     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
>     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> 
> 
> Job Counters
>         Failed map tasks=4
>         Launched map tasks=4
>         Other local map tasks=4
>         Total time spent by all maps in occupied slots (ms)=417438
>         Total time spent by all reduces in occupied slots (ms)=0
>         Total time spent by all map tasks (ms)=59634
>         Total vcore-seconds taken by all map tasks=59634
>         Total megabyte-seconds taken by all map tasks=213012648
> Exception in thread "main" java.lang.RuntimeException: job failed: name=[]update-table,
> jobid=job_1473832356852_0104
>     at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
>     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
>     at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
>     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>     at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> 
> This leads to no new updation of urls in the corresponding tables.
> Please help.
> Thanks in advance
>