You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Yossi Tamari <yo...@pipl.com> on 2017/12/04 18:47:34 UTC

crawlcomplete

Hi,

 

I'm trying to understand some of the design decisions behind the
crawlcomplete tool. I find the concept itself very useful, but there are a
couple of behaviors that I don't understand:

1.	URLs that resulted in redirect (even permanent) are counted as
unfetched. That means that if I had a crawl with only one URL, and that URL
returned a redirect, which was fetched successfully, I would see 1 FETCHED
and 1 UNFETCHED in crawlcomplete, and there is no inherent way for me to
know that, really, my crawl is 100% complete. My expectation would be for
URLs that resulted in redirection to not be counted (as they have been
replaced by new URLs), or to be counted in a separate group (which can then
be ignored).
2.	URLs that are db_gone are also counted as unfetched. It seems to me
these URLs were "successfully" crawled. It's the reality of the web that
pages disappear over time, and knowing that this happened is useful. These
URLs do not need to be crawled again, so they should not be counted as
unfetched. I can see why counting them as FETCHED would be confusing, so
maybe the names of the groups should be changed (COMPLETE and INCOMPLETE)
or a new group (GONE) added.

 

Are there good reasons for the current behavior? 



               Yossi.

Re: crawlcomplete

Posted by Semyon Semyonov <se...@mail.com>.

The third question can be:
1) Now we have hostdb that stores all statistics per host. You can read/write to the database. Does it make sense to have both for the reporting?

Sent: Monday, December 04, 2017 at 7:47 PM
From: "Yossi Tamari" <yo...@pipl.com>
To: user@nutch.apache.org
Subject: crawlcomplete
Hi,

I'm trying to understand some of the design decisions behind the
crawlcomplete tool. I find the concept itself very useful, but there are a
couple of behaviors that I don't understand:

1. URLs that resulted in redirect (even permanent) are counted as
unfetched. That means that if I had a crawl with only one URL, and that URL
returned a redirect, which was fetched successfully, I would see 1 FETCHED
and 1 UNFETCHED in crawlcomplete, and there is no inherent way for me to
know that, really, my crawl is 100% complete. My expectation would be for
URLs that resulted in redirection to not be counted (as they have been
replaced by new URLs), or to be counted in a separate group (which can then
be ignored).
2. URLs that are db_gone are also counted as unfetched. It seems to me
these URLs were "successfully" crawled. It's the reality of the web that
pages disappear over time, and knowing that this happened is useful. These
URLs do not need to be crawled again, so they should not be counted as
unfetched. I can see why counting them as FETCHED would be confusing, so
maybe the names of the groups should be changed (COMPLETE and INCOMPLETE)
or a new group (GONE) added.

Are there good reasons for the current behavior?

Yossi.