You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2014/01/23 20:54:46 UTC
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to
make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880288#comment-13880288 ]
Tejas Patil commented on NUTCH-1712:
------------------------------------
The performance gains due to this patch won't be phenomenal for small seeds file w/o any metadata and large crawldb's. The only savings with this patch is in terms of saving time over :-
1. dumping the output of the first job (ie. datum objects for the seed urls)
2. reading this output as input for the next job
3. job launch and cleanup.
> Use MultipleInputs in Injector to make it a single mapreduce job
> ----------------------------------------------------------------
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
> Issue Type: Improvement
> Components: injector
> Affects Versions: 1.7
> Reporter: Tejas Patil
> Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3CCAFKhtFyXO6WL7gyUV+a5Y1pzNtdCoqPz4jz_up_bkp9cJe80kg@mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)