You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Christian <ch...@gmail.com> on 2015/10/22 17:12:52 UTC

Does Nutch 1.7 support working with S3 buckets only ?

I am trying to reproduce the Common Crawl infrastructure for crawling just
a *few sites* on a *weekly/nightly* basis using AWS, EMR and S3. I am using
the Common Crawl fork for Nutch at https://github.com/Aloisius/nutch (cc
branch).

I use the Crawl job and I pass all the paths for s3 buckets. The inject and
fetch steps work perfect, but it fails on the ParseSegment step (see
following stack trace). I have tried with s3, s3n and s3a schemas.

org.apache.nutch.crawl.Crawl s3a://some-bucket/urls -dir
s3a://some-bucket/crawl -depth 2 -topN 5
  |
  V
org.apache.nutch.parse.ParseSegment
s3a://some-bucket/crawl/segments/20151022105922

Exception in thread "main" java.lang.IllegalArgumentException: Wrong
FS: s3a://some-bucket/crawl/segments/20151022105922/crawl_parse,
expected: hdfs://ip-152-71-19-40.eu-west-1.compute.internal:8020
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105)
	at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
	at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1404)
	at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:88)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:564)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
	at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:224)
	at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:258)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:231)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


Is working with only with S3 buckets fully supported by all the steps of
the crawler ? Any clue on what is the problem ?

I guess I could use hdfs for segments but I would like to avoid that, as I
want to terminate the cluster as soon as the crawling finishes, and keep
the data in S3 for next crawlings.

Thank you so much,
Christian Perez-Llamas