You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ali S Kureishy <sa...@gmail.com> on 2012/02/22 13:42:36 UTC

Error running Nutch 1.4 crawl on Amazon EMR using the S3 (s3n://) filesystem

Hi,

[This might be more relevant to the Amazon's EMR support, however, I have
posted it here as I'm not sure if the issue is on the EMR side or the Nutch
side].

I'm trying to run a Nutch crawl (v 1.4) on Amazon's EMR (Elastic Map
Reduce). I've setup the configuration parameters for the task as follows:

*Job jar:* s3n://mybucket/engine/job/nutch-1.4.job
*Arguments: *org.apache.nutch.crawl.Crawl s3n://mybucket/engine/seedurls/
-dir s3n://mybucket/engine/crawls

The job eventually fails with the below exception in the stderr log.

Exception in thread "main" java.lang.IllegalArgumentException: This
file system object (hdfs://10.2.21.205:9000) does not support access
to the request path 's3n://mybucket/engine/crawls/crawldb/current' You
possibly called FileSystem.get(conf) when you should have called
FileSystem.get(uri, conf) to obtain a file system supporting your
path.
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:372)
	at org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:709)
	at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:129)
	at org.apache.nutch.crawl.Injector.inject(Injector.java:223)
	at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I've read in a few places that previous versions of Nutch had a bug with
handling different filesystems (such as s3n:// etc). Is that still an issue
with Nutch 1.4? If so, what is the workaround if I want to run a Nutch job
on Amazon's EMR? (not specifying the S3 filesystem would mean that the HDFS
output would vanish once the EMR task completes). And if it has been fixed,
what do you think might be causing the issue below?

Thanks,
Safdar

Re: Error running Nutch 1.4 crawl on Amazon EMR using the S3 (s3n://) filesystem

Posted by Oakage <hn...@uw.edu>.

I tried doing as you said but it's doesn't work. JobConf doesn't have a uri
constructor so I changed it for filesystem.get which is 2 or 3 lines after
but the problem still persists in the install method of crawldb with the 

Exception in thread "main" java.lang.IllegalArgumentException: This file
system object (hdfs://10.240.55.134:9000) does not support access to the
request path 's3n://nutchdeploy/crawls/crawldb/current' You possibly called
FileSystem.get(conf) when you should have called FileSystem.get(uri, conf)
to obtain a file system supporting your path.
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)
	at
org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
	at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
	at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:712)
	at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:155)
	at org.apache.nutch.crawl.Injector.inject(Injector.java:227)
	at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I'm a beginning programmer so going further than this to fix this bug is
beyond my scope. 


--
View this message in context: http://lucene.472066.n3.nabble.com/Error-running-Nutch-1-4-crawl-on-Amazon-EMR-using-the-S3-s3n-filesystem-tp3766352p3996102.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Error running Nutch 1.4 crawl on Amazon EMR using the S3 (s3n://) filesystem

Posted by Lewis John Mcgibbney <le...@gmail.com>.

So maybe try hacking CrawlDb#createJob()

So that when you create the new NutchJob object you pass in the uri
parameter

124 JobConf job = new NutchJob(uri, config); as suggested in the thrown
stack trace.

Please get back to us with results. I've not been using anything like
Amazon EMR and would  be really interested to find out if this is solved.
fingers crossed.

On Wed, Feb 22, 2012 at 12:42 PM, Ali S Kureishy
<sa...@gmail.com>wrote:

> Hi,
>
> [This might be more relevant to the Amazon's EMR support, however, I have
> posted it here as I'm not sure if the issue is on the EMR side or the Nutch
> side].
>
> I'm trying to run a Nutch crawl (v 1.4) on Amazon's EMR (Elastic Map
> Reduce). I've setup the configuration parameters for the task as follows:
>
> *Job jar:* s3n://mybucket/engine/job/nutch-1.4.job
> *Arguments: *org.apache.nutch.crawl.Crawl s3n://mybucket/engine/seedurls/
> -dir s3n://mybucket/engine/crawls
>
> The job eventually fails with the below exception in the stderr log.
>
> Exception in thread "main" java.lang.IllegalArgumentException: This
> file system object (hdfs://10.2.21.205:9000) does not support access
> to the request path 's3n://mybucket/engine/crawls/crawldb/current' You
> possibly called FileSystem.get(conf) when you should have called
> FileSystem.get(uri, conf) to obtain a file system supporting your
> path.
>        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:372)
>        at
> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
>        at
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
>        at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:709)
>        at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:129)
>        at org.apache.nutch.crawl.Injector.inject(Injector.java:223)
>        at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> I've read in a few places that previous versions of Nutch had a bug with
> handling different filesystems (such as s3n:// etc). Is that still an issue
> with Nutch 1.4? If so, what is the workaround if I want to run a Nutch job
> on Amazon's EMR? (not specifying the S3 filesystem would mean that the HDFS
> output would vanish once the EMR task completes). And if it has been fixed,
> what do you think might be causing the issue below?
>
> Thanks,
> Safdar
>



-- 
*Lewis*