You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Nutch User - 1 <nu...@gmail.com> on 2011/07/12 15:25:36 UTC

A possible bug or misleading documentation

This concerns 1.3 distribution and I don't know if this is fixed in some
newer revision.

>From nutch-default.xml:

"
<property>
 <name>fetcher.max.crawl.delay</name>
 <value>30</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property>
"

Fetcher.java:
(http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).

The line 554 in Fetcher.java: "this.maxCrawlDelay =
conf.getInt("fetcher.max.crawl.delay", 30) * 1000;" .

The lines 615-616 in Fetcher.java:

"
if (rules.getCrawlDelay() > 0) {
  if (rules.getCrawlDelay() > maxCrawlDelay) {
"

Now, the documentation states that, if fetcher.max.crawl.delay is set to
-1, the crawler will always wait the amount of time the Crawl-Delay
parameter specifies. However, as you can see, if it really is negative
the condition on the line 616 is always true, which leads to skipping
the page whose Crawl-Delay is set.

Re: A possible bug or misleading documentation

Posted by Nutch User - 1 <nu...@gmail.com>.

On 07/12/2011 04:34 PM, Julien Nioche wrote:
> Please open an issue on JIRA (https://issues.apache.org/jira/browse/NUTCH)
>
> Thanks

It's now here: (https://issues.apache.org/jira/browse/NUTCH-1042).

Re: A possible bug or misleading documentation

Posted by Julien Nioche <li...@gmail.com>.

Please open an issue on JIRA (https://issues.apache.org/jira/browse/NUTCH)

Thanks


On 12 July 2011 14:25, Nutch User - 1 <nu...@gmail.com> wrote:

> This concerns 1.3 distribution and I don't know if this is fixed in some
> newer revision.
>
> From nutch-default.xml:
>
> "
> <property>
>  <name>fetcher.max.crawl.delay</name>
>  <value>30</value>
>  <description>
>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>  seconds) then the fetcher will skip this page, generating an error report.
>  If set to -1 the fetcher will never skip such pages and will wait the
>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>  might be.
>  </description>
> </property>
> "
>
> Fetcher.java:
> (
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
> ).
>
> The line 554 in Fetcher.java: "this.maxCrawlDelay =
> conf.getInt("fetcher.max.crawl.delay", 30) * 1000;" .
>
> The lines 615-616 in Fetcher.java:
>
> "
> if (rules.getCrawlDelay() > 0) {
>  if (rules.getCrawlDelay() > maxCrawlDelay) {
> "
>
> Now, the documentation states that, if fetcher.max.crawl.delay is set to
> -1, the crawler will always wait the amount of time the Crawl-Delay
> parameter specifies. However, as you can see, if it really is negative
> the condition on the line 616 is always true, which leads to skipping
> the page whose Crawl-Delay is set.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com