You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Espen Amble Kolstad (JIRA)" <ji...@apache.org> on 2007/06/14 10:07:25 UTC
[jira] Created: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Use Combiner in LinkDb to increase speed of linkdb generation
-------------------------------------------------------------
Key: NUTCH-498
URL: https://issues.apache.org/jira/browse/NUTCH-498
Project: Nutch
Issue Type: Improvement
Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
I tried to add the follwing combiner to LinkDb
{code}
public static class LinkDbCombiner extends MapReduceBase implements Reducer {
private int _maxInlinks;
@Override
public void configure(JobConf job) {
super.configure(job);
_maxInlinks = job.getInt("db.max.inlinks", 10000);
}
public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
final Inlinks inlinks = (Inlinks) values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
if (inlinks.size() >= _maxInlinks) {
output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0) {
return;
}
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED, combined);
}
output.collect(key, inlinks);
}
}
{code}
This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|
That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505242 ]
Espen Amble Kolstad commented on NUTCH-498:
-------------------------------------------
Yes, you're right
I forgot I added a new class just to get the Counter ...
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508506 ]
Andrzej Bialecki commented on NUTCH-498:
-----------------------------------------
+1.
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508505 ]
Doğacan Güney commented on NUTCH-498:
-------------------------------------
I tested creating a linkdb from ~6M urls:
Combine input records 42,091,902
Combine output records 15,684,838
(Combiner reduces number of records to around 1/3.)
Job took ~15 min overall with combiner, ~20 minutes without combiner.
So, +1 from me.
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Espen Amble Kolstad updated NUTCH-498:
--------------------------------------
Attachment: LinkDbCombiner.patch
Made a patch for the one-liner mentioned above
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505454 ]
Doğacan Güney commented on NUTCH-498:
-------------------------------------
> Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could
> replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.
Sounds good. I opened NUTCH-499 for this.
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Espen Amble Kolstad updated NUTCH-498:
--------------------------------------
Attachment: LinkDbCombiner.patch
Here's a patch for trunk
I removed the Counter since it's not really useful information, only to show the reduction of output records.
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508508 ]
Sami Siren commented on NUTCH-498:
----------------------------------
+1
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505079 ]
Enis Soztutar commented on NUTCH-498:
-------------------------------------
I think you may not want
{code}
reporter.incrCounter(Counters.COMBINED, combined);
{code}
which increments the counter by the total count so far, but rather you may use
{code}
reporter.incrCounter(Counters.COMBINED, 1);
{code}
for each url combined.
Could you make attach the patch against current trunk, so that we can apply it directly.
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Espen Amble Kolstad updated NUTCH-498:
--------------------------------------
Description:
I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements Reducer {
private int _maxInlinks;
@Override
public void configure(JobConf job) {
super.configure(job);
_maxInlinks = job.getInt("db.max.inlinks", 10000);
}
public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
final Inlinks inlinks = (Inlinks) values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
if (inlinks.size() >= _maxInlinks) {
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED, combined);
}
output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0) {
return;
}
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED, combined);
}
output.collect(key, inlinks);
}
}
This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
Map output records 8717810541
Combined 7632541507
Resulting output rec 1085269034
That's a 87% reduction of output records from the map phase
was:
I tried to add the follwing combiner to LinkDb
{code}
public static class LinkDbCombiner extends MapReduceBase implements Reducer {
private int _maxInlinks;
@Override
public void configure(JobConf job) {
super.configure(job);
_maxInlinks = job.getInt("db.max.inlinks", 10000);
}
public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
final Inlinks inlinks = (Inlinks) values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
if (inlinks.size() >= _maxInlinks) {
output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0) {
return;
}
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED, combined);
}
output.collect(key, inlinks);
}
}
{code}
This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|
That's a 87% reduction of output records from the map phase
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-498) Use Combiner in LinkDb to increase speed
of linkdb generation
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney closed NUTCH-498.
-------------------------------
Issue resolved and committed.
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Assignee: Doğacan Güney
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney resolved NUTCH-498.
---------------------------------
Resolution: Fixed
Fix Version/s: 1.0.0
Assignee: Doğacan Güney
Committed in rev. 551147.
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Assignee: Doğacan Güney
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505302 ]
Andrzej Bialecki commented on NUTCH-498:
-----------------------------------------
Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505197 ]
Doğacan Güney commented on NUTCH-498:
-------------------------------------
Why can't we just set combiner class as LinkDb? AFAICS, you are not doing anything different than LinkDb.reduce in LinkDbCombiner.reduce. A single-liner
job.setCombinerClass(LinkDb.class);
should do the trick, shouldn't it?
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505249 ]
Doğacan Güney commented on NUTCH-498:
-------------------------------------
After examining the code better, I am a bit confused. We have a LinkDb.Merger.reduce and LinkDb.reduce. They both do the same thing (aggregate inlinks until its size is maxInlinks then collect). Why do we have them seperately? Is there a difference between them that I am missing?
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase
speed of linkdb generation
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508748 ]
Hudson commented on NUTCH-498:
------------------------------
Integrated in Nutch-Nightly #131 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/])
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Assignee: Doğacan Güney
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.