You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Milica Bogicevic <mb...@nsphere.net> on 2012/03/21 16:46:41 UTC

Nutch on Elastic Map Reduce

Hi,

Is it possible to run Nutch crawler using Elastic Map Reduce?

I'm running Nutch crawler 1.4 from Java by calling Crawler.java main method
with necessary arg(s) using my local Hadoop (0.20.2) and everything works
fine.
When I try to run the same program using Elastic Map Reduce (and AWS SDK
Java), I get this exception:

Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://domU-12-31-39-0B-00-88.compute-1.internal:9000/data/crawl/flowers/urls/seed.txt
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
	at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:858)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:829)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:777)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1297)
	at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
	at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)


And I'm sure that I've uploaded file on the required location in my bucket.

Also, I'm confused - Injector.java initialize conf using JobConf.java
which is deprecated in Hadoop 0.20.2. So, which Hadoop version is the
most appropriate for Nutch 1.4?


Do you have any suggestions for me?

(Just to mention: I use the same Hadoop version (0.20.2) for other
Elastic Map Reduce task and everything is OK.)


Thanks in advance,

Milica