You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ali Safdar Kureishy <sa...@gmail.com> on 2012/06/04 13:52:44 UTC

Questions about the "hostCount" and related variables in org.apache.nutch.crawl.Generator$Selector::reduce()

Hi,

I'm trying to understand how the Nutch Generator class works, but am
finding it very hard to understand the following lines of code, in *
org.apache.nutch.crawl.Generator$Selector**::reduce()*, because the
comments aren't very clear and the code is intricate:

          ...

*[~Line 264 onwards]*
        // only filter if we are counting hosts or domains
        if (maxCount > 0) {
*          int[] hostCount = hostCounts.get(hostordomain);
          if (hostCount == null) {
            hostCount = new int[] {1, 0};
            hostCounts.put(hostordomain, hostCount);
          }

          // increment hostCount
          hostCount[1]++;

          // check if topN reached, select next segment if it is
          while (segCounts[hostCount[0]-1] >= limit && hostCount[0] <
maxNumSegments) {
            hostCount[0]++;
            hostCount[1] = 0;
          }

*          ...

1) What is the purpose of the 2-element array 'hostCount' with the values
([0,1])? And what do each of the two index slots represent?

2) And, what is this code doing (from above)? I don't understand what the
relation is between segCounts and hostCount[0] ... nor the relation between
hostCount[0] and maxNumSegments. All of these variables appear orthogonal
to me ... unless I'm misunderstanding their use relative to each other. So,
if someone could elaborate on this, that'd be greatly appreciated.
*          while (segCounts[hostCount[0]-1] >= limit && hostCount[0] <
maxNumSegments) {*
*            hostCount[0]++;
            hostCount[1] = 0;
          }
*
3) Lastly, I am not clear about what effect "maxNumSegments" has on the
generate phase. Perhaps if I understood the code above I wouldn't ask this
question either...

Many thanks in advance!

Safdar

Questions about the "hostCount" and related variables in org.apache.nutch.crawl.Generator$Selector::reduce()

Posted by Ali Safdar Kureishy <sa...@gmail.com>.
Hi,

I'm trying to understand how the Nutch Generator class works, but am
finding it hard to understand the following lines of code, in
org.apache.nutch.crawl.Generator$Selector::reduce(), because the comments
aren't very clear and the code is intricate:

          ...

[~Line 264 onwards]
        // only filter if we are counting hosts or domains
        if (maxCount > 0) {
          int[] hostCount = hostCounts.get(hostordomain);
          if (hostCount == null) {
            hostCount = new int[] {1, 0};
            hostCounts.put(hostordomain, hostCount);
          }

          // increment hostCount
          hostCount[1]++;

          // check if topN reached, select next segment if it is
          while (segCounts[hostCount[0]-1] >= limit && hostCount[0] <
maxNumSegments) {
            hostCount[0]++;
            hostCount[1] = 0;
          }

          ...

I'd really appreciate if someone could answer my questions below, about
this code...

1) What is the purpose of the 2-element array 'hostCount' with the values
([0,1])? And what do each of the two index slots represent?

2) And, what is this code doing (from above)? I don't understand what the
relation is between segCounts and hostCount[0] ... nor the relation between
hostCount[0] and maxNumSegments. All of these variables appear orthogonal
to me ... unless I'm misunderstanding their use relative to each other. So,
if someone could elaborate on this, that'd be greatly appreciated.
          while (segCounts[hostCount[0]-1] >= limit && hostCount[0] <
maxNumSegments) {
            hostCount[0]++;
            hostCount[1] = 0;
          }

3) Lastly, I am not clear about what effect "maxNumSegments" has on the
generate phase. Perhaps if I understood the code above I wouldn't ask this
question either...

Many thanks in advance!

Safdar