You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Christian <ch...@gmail.com> on 2015/10/22 17:12:52 UTC
Does Nutch 1.7 support working with S3 buckets only ?
I am trying to reproduce the Common Crawl infrastructure for crawling just
a *few sites* on a *weekly/nightly* basis using AWS, EMR and S3. I am using
the Common Crawl fork for Nutch at https://github.com/Aloisius/nutch (cc
branch).
I use the Crawl job and I pass all the paths for s3 buckets. The inject and
fetch steps work perfect, but it fails on the ParseSegment step (see
following stack trace). I have tried with s3, s3n and s3a schemas.
org.apache.nutch.crawl.Crawl s3a://some-bucket/urls -dir
s3a://some-bucket/crawl -depth 2 -topN 5
|
V
org.apache.nutch.parse.ParseSegment
s3a://some-bucket/crawl/segments/20151022105922
Exception in thread "main" java.lang.IllegalArgumentException: Wrong
FS: s3a://some-bucket/crawl/segments/20151022105922/crawl_parse,
expected: hdfs://ip-152-71-19-40.eu-west-1.compute.internal:8020
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1404)
at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:88)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:564)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:224)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:258)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:231)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Is working with only with S3 buckets fully supported by all the steps of
the crawler ? Any clue on what is the problem ?
I guess I could use hdfs for segments but I would like to avoid that, as I
want to terminate the cluster as soon as the crawling finishes, and keep
the data in S3 for next crawlings.
Thank you so much,
Christian Perez-Llamas