You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Junqiang Zhang <ju...@gmail.com> on 2017/05/05 03:50:26 UTC

A question regarding CrawlDbReducer

Hello,

I have a question regarding some code in
org.apache.nutch.crawl.CrawlDbReducer (Nutch version 1.13).

How does Lines 333-338 define the lessThan() function? Is arg0 with the
higher score of the two arguments defined as the less element of the two?
Which of arg0 and arg1 is poped out first?

In line 143, which element will be poped by linked.pop()? The CrawlDatum
with the highest score or the one with the lowest score?

Thanks.
Junqiang

Re: A question regarding CrawlDbReducer

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Junqiang,

looks like you uncovered a bug which is there since years. :( Gongrats and thanks!

See [1] for insert(element) and lessThan(...):
   Adds element to the PriorityQueue in log(size) time if either the PriorityQueue is not full,
   or not lessThan(element, top()).

In CrawlDbReducer
  lessThan(Object arg0, Object arg1)
returns true if
  arg0.getScore() > arg1.getScore()
resp. if
  element.getScore() > top().getScore()
Then the expression "not lessThan(element, top())" is false and arg0/element isn't added
although its score is larger than the least in the InlinkPriorityQueue.

The lowest ranking inlinks are kept which is in contradiction to

<property>
  <name>db.update.max.inlinks</name>
  <value>10000</value>
  <description>Maximum number of inlinks to take into account when updating
  a URL score in the crawlDB. Only the best scoring inlinks are kept.
  </description>
</property>


> In line 143, which element will be poped by linked.pop()? The CrawlDatum
> with the highest score or the one with the lowest score?

According to [2] pop()
   Removes and returns the least element of the PriorityQueue in log(size) time.

In this case the behavior is correct though I think that's unintentionally - the lessThan(...)
method "defines" the element with the highest score as the "least" one. So, the highest scoring of
the 10000 lowest scoring elements is taken first.

Of course, it should be
 "the highest scoring of the 10000 highest scoring elements"
At least, afaics. It's not a easy to understand...

Could you open an issue on
  https://issues.apache.org/jira/browse/NUTCH
to track the problem?

Thanks,
Sebastian

[1] https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/util/PriorityQueue.html#insert(T)
[2] https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/util/PriorityQueue.html#pop()

On 05/05/2017 05:50 AM, Junqiang Zhang wrote:
> Hello,
> 
> I have a question regarding some code in
> org.apache.nutch.crawl.CrawlDbReducer (Nutch version 1.13).
> 
> How does Lines 333-338 define the lessThan() function? Is arg0 with the
> higher score of the two arguments defined as the less element of the two?
> Which of arg0 and arg1 is poped out first?
> 
> In line 143, which element will be poped by linked.pop()? The CrawlDatum
> with the highest score or the one with the lowest score?
> 
> Thanks.
> Junqiang
>