You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by weishenyun <we...@gmail.com> on 2014/01/14 12:02:32 UTC
Nutch 2.2.1 missing inbound link when using HBase
Hi all,
I have tried to use Nutch 2.2.1 recently. Using HBase as storage and I found
that the column family il(inbound link) was missing. I have set
db.update.max.inlinks = 1000 but none of il was there. Do you meet such
problem?
--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-2-1-missing-inbound-link-when-using-HBase-tp4111216.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch 2.2.1 missing inbound link when using HBase
Posted by weishenyun <we...@gmail.com>.
Hi lewis,
I also found that there is something wrong in the DBUpdaterReducer. See
below code block:
if (page.getInlinks() != null) {
page.getInlinks().clear();
}
for (ScoreDatum inlink : inlinkedScoreData) {
page.putToInlinks(new Utf8(inlink.getUrl()), new
Utf8(inlink.getAnchor()));
}
// Distance calculation.
// Retrieve smallest distance from all inlinks distances
// Calculate new distance for current page: smallest inlink distance
plus 1.
// If the new distance is smaller than old one (or if old did not exist
yet),
// write it to the page.
int smallestDist=Integer.MAX_VALUE;
for (ScoreDatum inlink : inlinkedScoreData) {
int inlinkDist = inlink.getDistance();
if (inlinkDist < smallestDist) {
smallestDist=inlinkDist;
}
page.putToInlinks(new Utf8(inlink.getUrl()), new
Utf8(inlink.getAnchor()));
}
This sentence 'page.putToInlinks(new Utf8(inlink.getUrl()), new
Utf8(inlink.getAnchor()));' is invoked twice. When I tried to remove the
second one, in my case inbound links are back. In fact, I think the second
one is redundant and it seems to bring this bug.
--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-2-1-missing-inbound-link-when-using-HBase-tp4111216p4112656.html
Sent from the Nutch - User mailing list archive at Nabble.com.