You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Tejas Patil <te...@gmail.com> on 2014/01/04 09:00:47 UTC

Inject operation: can't it be done in a single map-reduce job ?

Hi nutch-dev,

I am looking at Injector code in trunk and I see that currently we are
launching two map-reduce jobs for the same:
1. sort job: get the urls from seeds file, emit CrawlDatum objects.
2. merge job: read CrawlDatum objects from both crawldb and output of sort
job. Merge and emit final CrawlDatum objects.

I realized that by using MultipleInputs, we can read CrawlDatum objects
from crawldb and urls from seeds file simultaneously and perform inject in
a single map-reduce job. PFA Injector2.java which is an implementation of
this approach. I did some basic testing on it and so far I have not
encountered any problems.

I am not sure why Injector was not written this way which is more efficient
than the one currently in trunk (maybe MultipleInputs was later added in
Hadoop). Wondering if I am wrong somewhere in my understanding. Any
comments about this ?

Thanks,
Tejas