You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Mathias Conradt <ma...@gmail.com> on 2008/06/25 03:14:14 UTC

URLs not crawled in order (referring to URL list)

I created my URL list file from my Google sitemap with all URLs in it, and
then set the depth of the crawler to 1, so I don't want the crawler to
follow any sublinks.
When I look at the log, I found that the crawler doesn't follow the URL list
line by line, but randomly. Is there a reason why it doesn't do so?
Or do I actually have to set the depth to 0 instead of 1 ? 

(Because the crawling process takes a while, I wanted to check by the log,
at which URL the crawler is at at the moment, but couldn't do it.)

-- 
View this message in context: http://www.nabble.com/URLs-not-crawled-in-order-%28referring-to-URL-list%29-tp18103118p18103118.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: URLs not crawled in order (referring to URL list)

Posted by Mathias Conradt <ma...@gmail.com>.

Thanks for your quick reply.
Oh yes, i just see, it runs with 10 threads. 



Winton Davies-3 wrote:
> 
> This is probably either because of the Injection process (it takes 
> the list of URLs and inserts them into the fetchlist, presumably 
> introducing some other sort order rather than just your file list 
> ordeer), or because you have multiple threads (which are 
> asynchronous) and/or both. I dont know if there is a default of say N 
> threads, but anyway, this would account for it.
> 
> 

-- 
View this message in context: http://www.nabble.com/URLs-not-crawled-in-order-%28referring-to-URL-list%29-tp18103118p18103584.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: URLs not crawled in order (referring to URL list)

Posted by Winton Davies <wd...@cs.stanford.edu>.

This is probably either because of the Injection process (it takes 
the list of URLs and inserts them into the fetchlist, presumably 
introducing some other sort order rather than just your file list 
ordeer), or because you have multiple threads (which are 
asynchronous) and/or both. I dont know if there is a default of say N 
threads, but anyway, this would account for it.

Winton


>I created my URL list file from my Google sitemap with all URLs in it, and
>then set the depth of the crawler to 1, so I don't want the crawler to
>follow any sublinks.
>When I look at the log, I found that the crawler doesn't follow the URL list
>line by line, but randomly. Is there a reason why it doesn't do so?
>Or do I actually have to set the depth to 0 instead of 1 ?
>
>(Because the crawling process takes a while, I wanted to check by the log,
>at which URL the crawler is at at the moment, but couldn't do it.)
>
>--
>View this message in context: 
>http://www.nabble.com/URLs-not-crawled-in-order-%28referring-to-URL-list%29-tp18103118p18103118.html
>Sent from the Nutch - User mailing list archive at Nabble.com.