You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2007/09/19 07:09:44 UTC

[jira] Commented: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

    [ https://issues.apache.org/jira/browse/NUTCH-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528658 ] 

Hudson commented on NUTCH-554:
------------------------------

Integrated in Nutch-Nightly #211 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/211/])

> Generator throws java.io.IOException and dies on injected urls with no protocol 
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-554
>                 URL: https://issues.apache.org/jira/browse/NUTCH-554
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>         Environment: Linux(debian) Java 1.6
>            Reporter: Brian Whitman
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: genpatch.diff
>
>
> On trunk nutch, injecting URLs with no protocol (like issues.apache.org/jira/ vs. https://issues.apache.org/jira/) causes the generator to fail with an IOException:
> java.net.MalformedURLException: no protocol: www.variogr.am
>         at java.net.URL.<init>(URL.java:567)
>         at java.net.URL.<init>(URL.java:464)
>         at java.net.URL.<init>(URL.java:413)
>         at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:187)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
> 2007-09-15 11:11:26,986 FATAL crawl.Generator - Generator: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:416)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:557)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:520)
> To test:
> # cat test/urls.txt
> www.variogr.am
> http://www.variogr.am/
> # bin/nutch inject testcrawl/crawldb test/
> (this goes fine)
> # bin/nutch generate testcrawl/crawldb testcrawl/segments -topN 10
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: testcrawl/segments/20070915111125
> Generator: filtering: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: java.io.IOException: Job failed!
>  
> This issue did not exist in earlier versions of nutch -- it would ignore the malformed URL without crashing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.