You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/07/20 15:05:13 UTC

crawling in any depth until no new pages were found

Hi all,

has anyone suggestions how I could solve following task:

I want to crawl a sub-domain of our network completely. I always did it 
by multiple fetch / parse / update cycles manually. After a few cycles I 
checked if there are unfetched pages in the crawldb. If so, I started 
the cycle over again. I repeated that until no new pages were discovered.
But that is annoying me and that is why I am looking for a way to do 
this steps automatic until no unfetched pages are left.

Any ideas?

Re: crawling in any depth until no new pages were found

Posted by Marek Bachmann <m....@uni-kassel.de>.

Hi Lewis,

On 20.07.2011 23:08, lewis john mcgibbney wrote:
> Hi Marek,
>
> As were talking about automating the task were immediately looking at
> implementing a bash script. In the situation we have described, we wish
> Nutch to adopt a breadth first search BFS behaviour when crawling. Between
> us can we suggest any methods for best practice relating to BFS?

Uh, I don't think that I could help there offhanded. I just don't have 
enough experience with nutch yet.
I think we would have a BFS when we always could fetch all pages at once 
(in one fetch cylce) which were discovered in the parse process before. 
Should thing about that a while. In the moment I don't believe it would 
be hard to realize.

> As you have highlighted we can check the crawldb after every updatedb
> command to determine whether there are any status (?) unfetched urls, and
> ideally we wish to continue until this number is non existent when we either
> dump stats or read them via stdout. I would suggest that we discuss a method
> for obtaining the dbunfecthed value and creating a loop based on whether or
> not it is = 0. Is this possible?

I have decided to write a little script that exactly does this task. 
Have to see if it works without to much grep of stdout. Maybe it is 
enough to check if the generator fails due no urls to be fetched. But 
there could be problems if it starts to generate urls that have to be 
refetched.

I'll do that in the next week. Will keep you up-to-date :-)

> On Wed, Jul 20, 2011 at 2:05 PM, Marek Bachmann<m....@uni-kassel.de>wrote:
>
>> Hi all,
>>
>> has anyone suggestions how I could solve following task:
>>
>> I want to crawl a sub-domain of our network completely. I always did it by
>> multiple fetch / parse / update cycles manually. After a few cycles I
>> checked if there are unfetched pages in the crawldb. If so, I started the
>> cycle over again. I repeated that until no new pages were discovered.
>> But that is annoying me and that is why I am looking for a way to do this
>> steps automatic until no unfetched pages are left.
>>
>> Any ideas?
>>
>
>
>

Re: crawling in any depth until no new pages were found

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Marek,

As were talking about automating the task were immediately looking at
implementing a bash script. In the situation we have described, we wish
Nutch to adopt a breadth first search BFS behaviour when crawling. Between
us can we suggest any methods for best practice relating to BFS?

As you have highlighted we can check the crawldb after every updatedb
command to determine whether there are any status (?) unfetched urls, and
ideally we wish to continue until this number is non existent when we either
dump stats or read them via stdout. I would suggest that we discuss a method
for obtaining the dbunfecthed value and creating a loop based on whether or
not it is = 0. Is this possible?

On Wed, Jul 20, 2011 at 2:05 PM, Marek Bachmann <m....@uni-kassel.de>wrote:

> Hi all,
>
> has anyone suggestions how I could solve following task:
>
> I want to crawl a sub-domain of our network completely. I always did it by
> multiple fetch / parse / update cycles manually. After a few cycles I
> checked if there are unfetched pages in the crawldb. If so, I started the
> cycle over again. I repeated that until no new pages were discovered.
> But that is annoying me and that is why I am looking for a way to do this
> steps automatic until no unfetched pages are left.
>
> Any ideas?
>

-- 
*Lewis*

Re: crawling in any depth until no new pages were found

Posted by Markus Jelsma <ma...@openindex.io>.

You don't need to check manually if you use the generator return code. It 
returns a non-zero value if no fetch-list is generated, that usually happens 
when there's nothing left to crawl at the moment.

> Hi all,
> 
> has anyone suggestions how I could solve following task:
> 
> I want to crawl a sub-domain of our network completely. I always did it
> by multiple fetch / parse / update cycles manually. After a few cycles I
> checked if there are unfetched pages in the crawldb. If so, I started
> the cycle over again. I repeated that until no new pages were discovered.
> But that is annoying me and that is why I am looking for a way to do
> this steps automatic until no unfetched pages are left.
> 
> Any ideas?