You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Espen Amble Kolstad (JIRA)" <ji...@apache.org> on 2007/06/14 10:07:25 UTC

[jira] Created: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Use Combiner in LinkDb to increase speed of linkdb generation
-------------------------------------------------------------

                 Key: NUTCH-498
                 URL: https://issues.apache.org/jira/browse/NUTCH-498
             Project: Nutch
          Issue Type: Improvement
          Components: linkdb
    Affects Versions: 0.9.0
            Reporter: Espen Amble Kolstad
            Priority: Minor


I tried to add the follwing combiner to LinkDb

{code}
   public static class LinkDbCombiner extends MapReduceBase implements Reducer {
      private int _maxInlinks;

      @Override
      public void configure(JobConf job) {
         super.configure(job);
         _maxInlinks = job.getInt("db.max.inlinks", 10000);
      }

      public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
            final Inlinks inlinks = (Inlinks) values.next();
            int combined = 0;
            while (values.hasNext()) {
               Inlinks val = (Inlinks) values.next();
               for (Iterator it = val.iterator(); it.hasNext();) {
                  if (inlinks.size() >= _maxInlinks) {
                     output.collect(key, inlinks);
                     return;
                  }
                  Inlink in = (Inlink) it.next();
                  inlinks.add(in);
               }
               combined++;
            }
            if (inlinks.size() == 0) {
               return;
            }
            if (combined > 0) {
               reporter.incrCounter(Counters.COMBINED, combined);
            }
            output.collect(key, inlinks);
      }
   }
{code}

This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.


|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|

That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505242 ] 

Espen Amble Kolstad commented on NUTCH-498:
-------------------------------------------

Yes, you're right

I forgot I added a new class just to get the Counter ...

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508506 ] 

Andrzej Bialecki  commented on NUTCH-498:
-----------------------------------------

+1.

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508505 ] 

Doğacan Güney commented on NUTCH-498:
-------------------------------------

I tested creating a linkdb from ~6M urls:

Combine input records  	 42,091,902
Combine output records 	15,684,838

(Combiner reduces number of records to around 1/3.)

Job took ~15 min overall with combiner, ~20 minutes without combiner.

So, +1 from me.




> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Espen Amble Kolstad updated NUTCH-498:
--------------------------------------

    Attachment: LinkDbCombiner.patch

Made a patch for the one-liner mentioned above

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505454 ] 

Doğacan Güney commented on NUTCH-498:
-------------------------------------

> Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could 
> replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.

Sounds good. I opened NUTCH-499 for this.

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Espen Amble Kolstad updated NUTCH-498:
--------------------------------------

    Attachment: LinkDbCombiner.patch

Here's a patch for trunk

I removed the Counter since it's not really useful information, only to show the reduction of output records.


> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508508 ] 

Sami Siren commented on NUTCH-498:
----------------------------------

+1

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505079 ] 

Enis Soztutar commented on NUTCH-498:
-------------------------------------

I think you may not want 
{code} 
reporter.incrCounter(Counters.COMBINED, combined); 
{code}

which increments the counter by the total count so far, but rather you may use 
{code} 
reporter.incrCounter(Counters.COMBINED, 1); 
{code}
for each url combined. 

Could you make attach the patch against current trunk, so that we can apply it directly. 


> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Espen Amble Kolstad updated NUTCH-498:
--------------------------------------

    Description: 
I tried to add the follwing combiner to LinkDb

   public static enum Counters {COMBINED}

   public static class LinkDbCombiner extends MapReduceBase implements Reducer {
      private int _maxInlinks;

      @Override
      public void configure(JobConf job) {
         super.configure(job);
         _maxInlinks = job.getInt("db.max.inlinks", 10000);
      }

      public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
            final Inlinks inlinks = (Inlinks) values.next();
            int combined = 0;
            while (values.hasNext()) {
               Inlinks val = (Inlinks) values.next();
               for (Iterator it = val.iterator(); it.hasNext();) {
                  if (inlinks.size() >= _maxInlinks) {
                     if (combined > 0) {
                        reporter.incrCounter(Counters.COMBINED, combined);
                     }
                     output.collect(key, inlinks);
                     return;
                  }
                  Inlink in = (Inlink) it.next();
                  inlinks.add(in);
               }
               combined++;
            }
            if (inlinks.size() == 0) {
               return;
            }
            if (combined > 0) {
               reporter.incrCounter(Counters.COMBINED, combined);
            }
            output.collect(key, inlinks);
      }
   }

This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.


Map output records    8717810541
Combined                  7632541507
Resulting output rec 1085269034

That's a 87% reduction of output records from the map phase

  was:
I tried to add the follwing combiner to LinkDb

{code}
   public static class LinkDbCombiner extends MapReduceBase implements Reducer {
      private int _maxInlinks;

      @Override
      public void configure(JobConf job) {
         super.configure(job);
         _maxInlinks = job.getInt("db.max.inlinks", 10000);
      }

      public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
            final Inlinks inlinks = (Inlinks) values.next();
            int combined = 0;
            while (values.hasNext()) {
               Inlinks val = (Inlinks) values.next();
               for (Iterator it = val.iterator(); it.hasNext();) {
                  if (inlinks.size() >= _maxInlinks) {
                     output.collect(key, inlinks);
                     return;
                  }
                  Inlink in = (Inlink) it.next();
                  inlinks.add(in);
               }
               combined++;
            }
            if (inlinks.size() == 0) {
               return;
            }
            if (combined > 0) {
               reporter.incrCounter(Counters.COMBINED, combined);
            }
            output.collect(key, inlinks);
      }
   }
{code}

This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.


|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|

That's a 87% reduction of output records from the map phase


> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-498.
-------------------------------


Issue resolved and committed.

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-498.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Doğacan Güney

Committed in rev. 551147.

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505302 ] 

Andrzej Bialecki  commented on NUTCH-498:
-----------------------------------------

Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505197 ] 

Doğacan Güney commented on NUTCH-498:
-------------------------------------

Why can't we just set combiner class as LinkDb? AFAICS, you are not doing anything different than LinkDb.reduce in LinkDbCombiner.reduce. A single-liner

job.setCombinerClass(LinkDb.class);

should do the trick, shouldn't it?

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505249 ] 

Doğacan Güney commented on NUTCH-498:
-------------------------------------

After examining the code better, I am a bit confused. We have a LinkDb.Merger.reduce and LinkDb.reduce. They both do the same thing (aggregate inlinks until its size is maxInlinks then collect). Why do we have them seperately? Is there a difference between them that I am missing?

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508748 ] 

Hudson commented on NUTCH-498:
------------------------------

Integrated in Nutch-Nightly #131 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/])

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.