You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Milica Bogicevic <mb...@nsphere.net> on 2012/03/22 15:15:14 UTC
Amazon S3 and EC2
Hi,
I'm trying to save crawled data ona S3.
I am using Nutch 1.4 and Hadoop 0.20.2 and everything works fine on my
local machine. When I try to do the same thing on EC2 using EMR and store
data on S3, I'm getting following exception:
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://domU-12-31-39-0B-00-88.compute-1.internal:9000/data/crawl/flowers/urls/seed.txt
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:858)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:829)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:777)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1297)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
I'm sure that I've set up input path correctly.
If you have any ideas, it will be more than welcome.
Or... If I succeed in my attentions, I'll let you know.
Milica