You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2014/05/12 17:31:16 UTC
[jira] [Created] (NUTCH-1772) Injector does not need merging if no
pre-existing crawldb
Julien Nioche created NUTCH-1772:
------------------------------------
Summary: Injector does not need merging if no pre-existing crawldb
Key: NUTCH-1772
URL: https://issues.apache.org/jira/browse/NUTCH-1772
Project: Nutch
Issue Type: Improvement
Components: injector
Affects Versions: 1.8
Reporter: Julien Nioche
The injector currently works as following :
* MapReduce job 1 - Mapper : converts input lines into CrawlDatum objects with normalisation and filtering
* MapReduce job 1 - Reducer : identity reducers. Can still have duplicates at this stage
* MapReducer job 2 - Mapper : CrawlDbFilter on existing crawldb (if any) + output of previous job
* MapReducer job 2 - Reducer : deduplication
If there is no existing crawldb (which will often be the case at injection time) we don't really need to do the second mapreduce job and could simply take the output of the MR job #1 as CrawlDB provided that we do the deduplication as part of the reduce step.
If there is a crawldb then the reduce step of the MR job #1 is not really needed and we could have that step as map only.
--
This message was sent by Atlassian JIRA
(v6.2#6252)