You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sourajit Basak <so...@gmail.com> on 2012/08/12 17:55:19 UTC

limit nutch to all pages within a certain domain

How do I limit nutch to crawl only certain domains ?

For e.g. lets say, I have 2 domains. I put the following in a text file and
inject the crawldb

http://www.domain1.com
http://name.domain2.com

Now, I wish to crawl all pages only in the above 2 domains.

To do that, I added these to the regex filter (config file)

+^http://www\.domain1\.com
+^http://name\.domain2\.com

However, it seems to crawl only the (home) top most page of the above
domains only. How do I visit all inner pages ?

Re: limit nutch to all pages within a certain domain

Posted by Sourajit Basak <so...@gmail.com>.

I think you mean generate-fetch-*parse*-update cycles. Per my
understanding, the 'parse' phase finds out the number of outlinks at each
step of the iteration.

I will try with increasing the topN value at each step of the iteration.

However ....

Lets say, the domains being crawled are updated frequently, such as News
websites. So, the home page and the hub pages (i.e. pages for the different
sections) will change but the individual article/story pages will not.
Those will be replaced with new links in such hub pages. So, if I set
db.fetch.interval.default to 1 day, the home and hub pages will work fine;
but won't the older story pages be fetched again ?


On Mon, Aug 13, 2012 at 12:05 AM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:

> On 08/12/2012 07:14 PM, Sourajit Basak wrote:
> > Do I need to carry this iteration several times to crawl all the domains
> > satisfactorily ?
> Yes, you have to loop over generate-fetch-update cycles. In trunk there is
> a script src/bin/crawl which does this.
>
> > These domains may not have links among themselves. This is just to group
> > related websites together. So, if I assume, on average each domain has
> > (max) 100 links per page, and I have 5 domains; I need to set topN = 5 *
> > 100 during each 'generate' phase ?
> For large sites you can take more because the growth is exponential.
> The 100 pages of the second cycle have theoretically 10000 outlinks.
> Practically, many targets are shared, so you'll get much less outlinks.
>
> >
> > On Sun, Aug 12, 2012 at 10:27 PM, Sebastian Nagel <
> > wastl.nagel@googlemail.com> wrote:
> >
> >>> However, how is topN determined?
> >> It's just the top N  unfetched pages sorted by decreasing score.
> >> Pages will be re-fetched only after some larger amount of time,
> >> 30 days per default, see property db.fetch.interval.default.
> >>
> >>> If I am crawling inside a domain, there will be links from almost every
> >>> inner pages to the menu items. Wouldn't that increase the score of the
> >>> menu/navigation items ?
> >> Yes. And that's what you expect. These pages are hubs containing many
> >> outlinks. So you want to re-fetch them first to detect links to new
> pages.
> >>
> >>>> How do I limit nutch to crawl only certain domains ?
> >> You did it right. But you need time to get all pages fetched.
> >>
> >> Sebastian
> >>
> >> On 08/12/2012 06:29 PM, Sourajit Basak wrote:
> >>> I proceeded like this ..
> >>>
> >>> 1. inject the urls
> >>> 2. run generate
> >>> 3. run fetch
> >>> 4. run parse
> >>> 5. run generate with topN 1000
> >>> .. repeat 3 & 4
> >>> ...
> >>> 6. run generate with topN 1000
> >>>
> >>> This seems to be fetching the inner pages. However, how is topN
> >> determined
> >>> ? If I am crawling inside a domain, there will be links from almost
> every
> >>> inner pages to the menu items. Wouldn't that increase the score of the
> >>> menu/navigation items ?
> >>>
> >>> On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak <
> >> sourajit.basac@gmail.com>wrote:
> >>>
> >>>> How do I limit nutch to crawl only certain domains ?
> >>>>
> >>>> For e.g. lets say, I have 2 domains. I put the following in a text
> file
> >>>> and inject the crawldb
> >>>>
> >>>> http://www.domain1.com
> >>>> http://name.domain2.com
> >>>>
> >>>> Now, I wish to crawl all pages only in the above 2 domains.
> >>>>
> >>>> To do that, I added these to the regex filter (config file)
> >>>>
> >>>> +^http://www\.domain1\.com
> >>>> +^http://name\.domain2\.com
> >>>>
> >>>> However, it seems to crawl only the (home) top most page of the above
> >>>> domains only. How do I visit all inner pages ?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: limit nutch to all pages within a certain domain

Posted by Sebastian Nagel <wa...@googlemail.com>.

On 08/12/2012 07:14 PM, Sourajit Basak wrote:
> Do I need to carry this iteration several times to crawl all the domains
> satisfactorily ?
Yes, you have to loop over generate-fetch-update cycles. In trunk there is
a script src/bin/crawl which does this.

> These domains may not have links among themselves. This is just to group
> related websites together. So, if I assume, on average each domain has
> (max) 100 links per page, and I have 5 domains; I need to set topN = 5 *
> 100 during each 'generate' phase ?
For large sites you can take more because the growth is exponential.
The 100 pages of the second cycle have theoretically 10000 outlinks.
Practically, many targets are shared, so you'll get much less outlinks.

> 
> On Sun, Aug 12, 2012 at 10:27 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com> wrote:
> 
>>> However, how is topN determined?
>> It's just the top N  unfetched pages sorted by decreasing score.
>> Pages will be re-fetched only after some larger amount of time,
>> 30 days per default, see property db.fetch.interval.default.
>>
>>> If I am crawling inside a domain, there will be links from almost every
>>> inner pages to the menu items. Wouldn't that increase the score of the
>>> menu/navigation items ?
>> Yes. And that's what you expect. These pages are hubs containing many
>> outlinks. So you want to re-fetch them first to detect links to new pages.
>>
>>>> How do I limit nutch to crawl only certain domains ?
>> You did it right. But you need time to get all pages fetched.
>>
>> Sebastian
>>
>> On 08/12/2012 06:29 PM, Sourajit Basak wrote:
>>> I proceeded like this ..
>>>
>>> 1. inject the urls
>>> 2. run generate
>>> 3. run fetch
>>> 4. run parse
>>> 5. run generate with topN 1000
>>> .. repeat 3 & 4
>>> ...
>>> 6. run generate with topN 1000
>>>
>>> This seems to be fetching the inner pages. However, how is topN
>> determined
>>> ? If I am crawling inside a domain, there will be links from almost every
>>> inner pages to the menu items. Wouldn't that increase the score of the
>>> menu/navigation items ?
>>>
>>> On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak <
>> sourajit.basac@gmail.com>wrote:
>>>
>>>> How do I limit nutch to crawl only certain domains ?
>>>>
>>>> For e.g. lets say, I have 2 domains. I put the following in a text file
>>>> and inject the crawldb
>>>>
>>>> http://www.domain1.com
>>>> http://name.domain2.com
>>>>
>>>> Now, I wish to crawl all pages only in the above 2 domains.
>>>>
>>>> To do that, I added these to the regex filter (config file)
>>>>
>>>> +^http://www\.domain1\.com
>>>> +^http://name\.domain2\.com
>>>>
>>>> However, it seems to crawl only the (home) top most page of the above
>>>> domains only. How do I visit all inner pages ?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: limit nutch to all pages within a certain domain

Posted by Sourajit Basak <so...@gmail.com>.

Do I need to carry this iteration several times to crawl all the domains
satisfactorily ?

These domains may not have links among themselves. This is just to group
related websites together. So, if I assume, on average each domain has
(max) 100 links per page, and I have 5 domains; I need to set topN = 5 *
100 during each 'generate' phase ?

On Sun, Aug 12, 2012 at 10:27 PM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:

> > However, how is topN determined?
> It's just the top N  unfetched pages sorted by decreasing score.
> Pages will be re-fetched only after some larger amount of time,
> 30 days per default, see property db.fetch.interval.default.
>
> > If I am crawling inside a domain, there will be links from almost every
> > inner pages to the menu items. Wouldn't that increase the score of the
> > menu/navigation items ?
> Yes. And that's what you expect. These pages are hubs containing many
> outlinks. So you want to re-fetch them first to detect links to new pages.
>
> >> How do I limit nutch to crawl only certain domains ?
> You did it right. But you need time to get all pages fetched.
>
> Sebastian
>
> On 08/12/2012 06:29 PM, Sourajit Basak wrote:
> > I proceeded like this ..
> >
> > 1. inject the urls
> > 2. run generate
> > 3. run fetch
> > 4. run parse
> > 5. run generate with topN 1000
> > .. repeat 3 & 4
> > ...
> > 6. run generate with topN 1000
> >
> > This seems to be fetching the inner pages. However, how is topN
> determined
> > ? If I am crawling inside a domain, there will be links from almost every
> > inner pages to the menu items. Wouldn't that increase the score of the
> > menu/navigation items ?
> >
> > On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak <
> sourajit.basac@gmail.com>wrote:
> >
> >> How do I limit nutch to crawl only certain domains ?
> >>
> >> For e.g. lets say, I have 2 domains. I put the following in a text file
> >> and inject the crawldb
> >>
> >> http://www.domain1.com
> >> http://name.domain2.com
> >>
> >> Now, I wish to crawl all pages only in the above 2 domains.
> >>
> >> To do that, I added these to the regex filter (config file)
> >>
> >> +^http://www\.domain1\.com
> >> +^http://name\.domain2\.com
> >>
> >> However, it seems to crawl only the (home) top most page of the above
> >> domains only. How do I visit all inner pages ?
> >>
> >>
> >>
> >>
> >>
> >
>
>

Re: limit nutch to all pages within a certain domain

Posted by Sebastian Nagel <wa...@googlemail.com>.

> However, how is topN determined?
It's just the top N  unfetched pages sorted by decreasing score.
Pages will be re-fetched only after some larger amount of time,
30 days per default, see property db.fetch.interval.default.

> If I am crawling inside a domain, there will be links from almost every
> inner pages to the menu items. Wouldn't that increase the score of the
> menu/navigation items ?
Yes. And that's what you expect. These pages are hubs containing many
outlinks. So you want to re-fetch them first to detect links to new pages.

>> How do I limit nutch to crawl only certain domains ?
You did it right. But you need time to get all pages fetched.

Sebastian

On 08/12/2012 06:29 PM, Sourajit Basak wrote:
> I proceeded like this ..
> 
> 1. inject the urls
> 2. run generate
> 3. run fetch
> 4. run parse
> 5. run generate with topN 1000
> .. repeat 3 & 4
> ...
> 6. run generate with topN 1000
> 
> This seems to be fetching the inner pages. However, how is topN determined
> ? If I am crawling inside a domain, there will be links from almost every
> inner pages to the menu items. Wouldn't that increase the score of the
> menu/navigation items ?
> 
> On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak <so...@gmail.com>wrote:
> 
>> How do I limit nutch to crawl only certain domains ?
>>
>> For e.g. lets say, I have 2 domains. I put the following in a text file
>> and inject the crawldb
>>
>> http://www.domain1.com
>> http://name.domain2.com
>>
>> Now, I wish to crawl all pages only in the above 2 domains.
>>
>> To do that, I added these to the regex filter (config file)
>>
>> +^http://www\.domain1\.com
>> +^http://name\.domain2\.com
>>
>> However, it seems to crawl only the (home) top most page of the above
>> domains only. How do I visit all inner pages ?
>>
>>
>>
>>
>>
>

Re: limit nutch to all pages within a certain domain

Posted by Sourajit Basak <so...@gmail.com>.

I proceeded like this ..

1. inject the urls
2. run generate
3. run fetch
4. run parse
5. run generate with topN 1000
.. repeat 3 & 4
...
6. run generate with topN 1000

This seems to be fetching the inner pages. However, how is topN determined
? If I am crawling inside a domain, there will be links from almost every
inner pages to the menu items. Wouldn't that increase the score of the
menu/navigation items ?

On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak <so...@gmail.com>wrote:

> How do I limit nutch to crawl only certain domains ?
>
> For e.g. lets say, I have 2 domains. I put the following in a text file
> and inject the crawldb
>
> http://www.domain1.com
> http://name.domain2.com
>
> Now, I wish to crawl all pages only in the above 2 domains.
>
> To do that, I added these to the regex filter (config file)
>
> +^http://www\.domain1\.com
> +^http://name\.domain2\.com
>
> However, it seems to crawl only the (home) top most page of the above
> domains only. How do I visit all inner pages ?
>
>
>
>
>