You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by feng lu <am...@gmail.com> on 2013/04/02 16:40:12 UTC
Re: IOException during #Crawl.run -> #JobClient.runJob()

hi，
first make sure that that your configuration is correctly。such as
pligin.folder property in nutch-site.xml。 see the runnutcheclipse again and
make sure every config is ok。open the debug log level to see what happened
if exception throw again。

hopes that will help you。


On Monday, April 1, 2013, cephtahrioh wrote:

> Hello guys, I am pretty new with nutch so bear with me. I have been
> encountering an IOException during one of my test crawls. I am using nutch
> 1.6 with hadoop 0.20.2 (chose this version for windows compatibiliy in
> setting file access rights).
>
> I am running nutch through eclipse. I followed this guide in importing
> nutch
> from an SVN: http://wiki.apache.org/nutch/RunNutchInEclipse
>
> My crawler's code is from this website:
>
> http://cmusphinx.sourceforge.net/2012/06/building-a-java-application-with-apache-nutch-and-solr/
>
> Here is the system exception log:
>
> solrUrl is not set, indexing will be skipped...
> crawl started in: crawl
> rootUrlDir = urls
> threads = 1
> depth = 1
> solrUrl=null
> topN = 1
> Injector: starting at 2013-03-31 23:51:11
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> *java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:218)
>         at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at rjpb.sp.crawler.CrawlerTest.main(CrawlerTest.java:51)*
>
> I see these calls involving paths before #Injector.inject() in Crawl.java
>
> *Path crawlDb = new Path(dir + "/crawldb");
> Path linkDb = new Path(dir + "/linkdb");
> Path segments = new Path(dir + "/segments");
> Path indexes = new Path(dir + "/indexes");
> Path index = new Path(dir + "/index");*
>
> Currently I my eclipse project does not include the folders
> crawldb,linkdb,segments... I think my problem is I have not set all the
> necessary files for crawling. I have only set
> nutch-site.xml,regex-urlfilter.txt, and urls/seed.txt. Any advice on the
> matter will be of great help. Thanks!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/IOException-during-Crawl-run-JobClient-runJob-tp4052732.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>


-- 
Don't Grow Old, Grow Up... :-)