You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John Thornton <po...@john.thornton.name> on 2018/03/16 12:45:59 UTC

Fetcher error when running on Amazon EMR with S3

Hello,

I'm currently running Nutch under Amazon EMR 5.12.0 with Hadoop 2.83 using
S3 (EMRFS) as the filesystem.  If I build the latest version from the
master branch and run a crawl in distributed mode I get a fetcher error
like fetcher.Fetcher: Fetcher: java.lang.IllegalArgumentException: Wrong
FS: s3:..., expected: hdfs://...

This problem was reported in NUTCH-2494 and fixed in PR-274 and indeed when
I run the same crawl using a build of commit 87c7a2e it works with no
error.  So my question is has a regression been introduced, or am I missing
something?

Regards,

John

Re: Fetcher error when running on Amazon EMR with S3

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi John,

the recent master has seen an upgrade to the new MapReduce API (NUTCH-2375),
it was a huge change which is already known to have introduced some issues.
For production it's recommended to use 1.14 and if necessary patch it.

Could you open a new issue on
      https://issues.apache.org/jira/projects/NUTCH
and provide the detailed stack there.

Thanks,
Sebastian

On 03/16/2018 01:45 PM, John Thornton wrote:
> Hello,
> 
> I'm currently running Nutch under Amazon EMR 5.12.0 with Hadoop 2.83 using
> S3 (EMRFS) as the filesystem.  If I build the latest version from the
> master branch and run a crawl in distributed mode I get a fetcher error
> like fetcher.Fetcher: Fetcher: java.lang.IllegalArgumentException: Wrong
> FS: s3:..., expected: hdfs://...
> 
> This problem was reported in NUTCH-2494 and fixed in PR-274 and indeed when
> I run the same crawl using a build of commit 87c7a2e it works with no
> error.  So my question is has a regression been introduced, or am I missing
> something?
> 
> Regards,
> 
> John
>