You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2007/06/15 09:58:26 UTC
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505079 ]
Enis Soztutar commented on NUTCH-498:
-------------------------------------
I think you may not want
{code}
reporter.incrCounter(Counters.COMBINED, combined);
{code}
which increments the counter by the total count so far, but rather you may use
{code}
reporter.incrCounter(Counters.COMBINED, 1);
{code}
for each url combined.
Could you make attach the patch against current trunk, so that we can apply it directly.
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.