You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by weishenyun <we...@gmail.com> on 2014/01/14 12:02:32 UTC

Nutch 2.2.1 missing inbound link when using HBase

Hi all,

I have tried to use Nutch 2.2.1 recently. Using HBase as storage and I found
that the column family il(inbound link) was missing. I have set
db.update.max.inlinks = 1000 but none of il was there. Do you meet such
problem?



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-2-1-missing-inbound-link-when-using-HBase-tp4111216.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 2.2.1 missing inbound link when using HBase

Posted by weishenyun <we...@gmail.com>.
Hi lewis,

I also found that there is something wrong in the DBUpdaterReducer. See
below code block:
    if (page.getInlinks() != null) {
      page.getInlinks().clear();
    }
    for (ScoreDatum inlink : inlinkedScoreData) {
      page.putToInlinks(new Utf8(inlink.getUrl()), new
Utf8(inlink.getAnchor()));
    }

    // Distance calculation.
    // Retrieve smallest distance from all inlinks distances
    // Calculate new distance for current page: smallest inlink distance
plus 1.
    // If the new distance is smaller than old one (or if old did not exist
yet),
    // write it to the page.
    int smallestDist=Integer.MAX_VALUE;
    for (ScoreDatum inlink : inlinkedScoreData) {
      int inlinkDist = inlink.getDistance();
      if (inlinkDist < smallestDist) {
        smallestDist=inlinkDist;
      }
      page.putToInlinks(new Utf8(inlink.getUrl()), new
Utf8(inlink.getAnchor()));
    }

This sentence 'page.putToInlinks(new Utf8(inlink.getUrl()), new
Utf8(inlink.getAnchor()));' is invoked twice. When I tried to remove the
second one, in my case inbound links are back. In fact, I think the second
one is redundant and it seems to bring this bug.



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-2-1-missing-inbound-link-when-using-HBase-tp4111216p4112656.html
Sent from the Nutch - User mailing list archive at Nabble.com.