You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "S.L" <si...@gmail.com> on 2014/04/29 16:57:53 UTC

Disable the Link Inversion Phase -Number of Reduce Tasks.

Hi All,

I am running Nutch on a single node Hadoop cluster  , I do not use a
indexing URL and I have disabled the LinkInversion phase as I do not need
any scores to be attached to any URL.

My question is that if LinkInversion phase in Nutch is the only phase that
requires the Reduce task to be run , as since I have disabled it in the
Crawl.java class, can I go ahead and set the number of reduce tasks in
Hadoop job submission to zero, or is there any other phase that still
requires a reduce tasks.

Re: Disable the Link Inversion Phase -Number of Reduce Tasks.

Posted by Bin Wang <bi...@gmail.com>.

Here is my understanding:

Nutch is using mapreduce everywhere, looking at the source code of Nutch
even 1.x, (1.8 in this case), just in the nutch/crawl folder. here are the
files that imports hadoop.mapred:

$ grep 'hadoop.mapred' * | awk 'BEGIN{FS=":"}{print $1}' | sort | uniq
CrawlDb.java
CrawlDbFilter.java
CrawlDbMerger.java
CrawlDbReader.java
CrawlDbReducer.java
DeduplicationJob.java
Generator.java
Injector.java
LinkDb.java
LinkDbFilter.java
LinkDbMerger.java
LinkDbReader.java
URLPartitioner.java

And for example, in the CrawDb.java, the code looks like this:

 public void update(Path crawlDb, ...) throws IOException {

    FileSystem fs = FileSystem.get(getConf());

    ...

    JobConf job = CrawlDb.createJob(getConf(), crawlDb);

    ...

Based on my understanding, it is reading the hadoop system configuration
and tell the job, hey, here are all the nodes that you can use...

And also, there is a reducer in that job... which crawdbReducer..... which
needs the reducer to "/** Merge new page entries with existing entries. */".

In conclusion, there are several steps which are all implemented using
mapreduce.

Correct me if I was wrong.

Bin

On Tue, Apr 29, 2014 at 8:57 AM, S.L <si...@gmail.com> wrote:

> Hi All,
>
> I am running Nutch on a single node Hadoop cluster  , I do not use a
> indexing URL and I have disabled the LinkInversion phase as I do not need
> any scores to be attached to any URL.
>
> My question is that if LinkInversion phase in Nutch is the only phase that
> requires the Reduce task to be run , as since I have disabled it in the
> Crawl.java class, can I go ahead and set the number of reduce tasks in
> Hadoop job submission to zero, or is there any other phase that still
> requires a reduce tasks.
>