You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Sandeep Tata <sa...@gmail.com> on 2008/04/11 23:25:31 UTC
Fetcher2 Reduce Phase Question
Hi Folks,
I was just wondering what computation really happens in the reduce
phase for Fetcher2 ?
I know that it is implemented as a MapRunnable -- but I see no
explicit reducer being set for the job. Is the identity reducer being
used ? Why can't we simply use job.setNumReduceTasks(0) ?
Wouldn't this be faster?
Sandeep
Re: Fetcher2 Reduce Phase Question
Posted by Andrzej Bialecki <ab...@getopt.org>.
Sandeep Tata wrote:
> Hi Folks,
>
> I was just wondering what computation really happens in the reduce
> phase for Fetcher2 ?
If Fetcher was running in the parsing mode, then in the reduce phase
Outlinks are separated from Parse output and stored in crawl_parse, and
other data in parse_text and parse_data. This actually happens in
FetcherOutputFormat / ParseOutputFormat, so there is no need for any
Reduce apart from the IdentityReduce (default)
>
> I know that it is implemented as a MapRunnable -- but I see no
> explicit reducer being set for the job. Is the identity reducer being
> used ? Why can't we simply use job.setNumReduceTasks(0) ?
> Wouldn't this be faster?
First, when Fetcher / Fetcher2 were written there was no such option in
Hadoop. Second, the meaning of this setting is that the output from maps
becomes the final output - but this won't cut it, because map outputs
are always simple SequenceFile's, whereas we need to split the
FetcherOutput into a bunch of Sequence and MapFile-s (which have to be
sorted) ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com