You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Milica Bogicevic <mb...@nsphere.net> on 2012/03/22 15:15:14 UTC

Amazon S3 and EC2

Hi,

I'm trying to save crawled data ona S3.
I am using Nutch 1.4 and Hadoop 0.20.2 and everything works fine on my
local machine. When I try to do the same thing on EC2 using EMR and store
data on S3, I'm getting following exception:

Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://domU-12-31-39-0B-00-88.compute-1.internal:9000/data/crawl/flowers/urls/seed.txt
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
	at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:858)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:829)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:777)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1297)
	at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
	at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)


I'm sure that I've set up input path correctly.

If you have any ideas, it will be more than welcome.

Or... If I succeed in my attentions, I'll let you know.

Milica