You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dtiodtio <dt...@gmail.com> on 2009/10/07 12:21:57 UTC

URLNormalizer not found and integrating nutch programmatically

Hello,

I'm trying to integrate Nutch into our Java project not by using it via its
command-line but rather by calling explicitly its classes/method from within
our code.

However, when I try to do a simple run of the FreeGenerator using:

myconf.set("plugin.folders","/path/to/nutch-1.0/plugins");

int res = ToolRunner.run(myconf, new FreeGenerator(), frgen_args);
>

where frgen_args holds a String array with my input and output, I get
(snipping for brevity):

Oct 7, 2009 12:54:33 PM org.apache.nutch.plugin.PluginRepository
> displayStatus
> INFO: Registered Extension-Points:
> Oct 7, 2009 12:54:33 PM org.apache.nutch.plugin.PluginRepository
> displayStatus
> INFO:         NONE
> Oct 7, 2009 12:54:33 PM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0001
> java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not
> found.
>         at
> org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:122)
>         at
> org.apache.nutch.crawl.PartitionUrlByHost.configure(PartitionUrlByHost.java:38)
>         at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>         at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:472)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> Oct 7, 2009 12:54:33 PM org.apache.nutch.tools.FreeGenerator run
> SEVERE: FAILED: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>         at org.apache.nutch.tools.FreeGenerator.run(FreeGenerator.java:179)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at Crawler.CrawlerLauncher.run(CrawlerLauncher.java:83)
>         at Crawler.ExecCrawl.main(ExecCrawl.java:42)
>

This error message remains even after I've added to my libraries the
URLNormalizer-basic.jar plugin.

I suppose the problem lies in that I haven't explicitly defined all the
nutch configurations  - which I guess in command-prompt mode would be found
under the nutch installation conf directory.

This brings me to a more general issue of whether these configurations are
somewhere hardwired in the nutch src code - or  should I somehow be defining
them myself?

Thanks,
Dimitris