You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dtiodtio <dt...@gmail.com> on 2009/10/07 12:21:57 UTC
URLNormalizer not found and integrating nutch programmatically
Hello,
I'm trying to integrate Nutch into our Java project not by using it via its
command-line but rather by calling explicitly its classes/method from within
our code.
However, when I try to do a simple run of the FreeGenerator using:
myconf.set("plugin.folders","/path/to/nutch-1.0/plugins");
int res = ToolRunner.run(myconf, new FreeGenerator(), frgen_args);
>
where frgen_args holds a String array with my input and output, I get
(snipping for brevity):
Oct 7, 2009 12:54:33 PM org.apache.nutch.plugin.PluginRepository
> displayStatus
> INFO: Registered Extension-Points:
> Oct 7, 2009 12:54:33 PM org.apache.nutch.plugin.PluginRepository
> displayStatus
> INFO: NONE
> Oct 7, 2009 12:54:33 PM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0001
> java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not
> found.
> at
> org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:122)
> at
> org.apache.nutch.crawl.PartitionUrlByHost.configure(PartitionUrlByHost.java:38)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:472)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> Oct 7, 2009 12:54:33 PM org.apache.nutch.tools.FreeGenerator run
> SEVERE: FAILED: java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
> at org.apache.nutch.tools.FreeGenerator.run(FreeGenerator.java:179)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at Crawler.CrawlerLauncher.run(CrawlerLauncher.java:83)
> at Crawler.ExecCrawl.main(ExecCrawl.java:42)
>
This error message remains even after I've added to my libraries the
URLNormalizer-basic.jar plugin.
I suppose the problem lies in that I haven't explicitly defined all the
nutch configurations - which I guess in command-prompt mode would be found
under the nutch installation conf directory.
This brings me to a more general issue of whether these configurations are
somewhere hardwired in the nutch src code - or should I somehow be defining
them myself?
Thanks,
Dimitris