You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by brainstorm <br...@gmail.com> on 2008/07/02 19:42:38 UTC

Maximum links limit per domain

Hi,

Is there a configurable setting to limit the number of links to fetch
for every domain in nutch ? I'm not referring to the topN setting
which sorts a fetchlist using lucene (or nutch) scoring mechanism.

In other words, I just want to fetch 1500 links from every
"whatever.com" domain *including* all their subdomains.

Ex:

upc.edu : 1500 links hard limit counter
escert.upc.edu: *new* 1500 links hard limit counter
ac.upc.edu: *new* 1500 links hard limit counter

*or* shared links hard limit per domain:

*.upc.edu: 1500 links hard limit

Thanks in advance !

Re: Maximum links limit per domain

Posted by brainstorm <br...@gmail.com>.
Thanks for your support Dennis :)

I've been told that db.max.outlinks.per.page is precisely what we
want, so there's no need for the other option right now.

Roman

On Thu, Jul 3, 2008 at 12:13 AM, Dennis Kubes <ku...@apache.org> wrote:
> The db.max.outlinks.per.page defines how many links are grabbed from a given
> url.  There isn't a way to grab by domain, including subs, without writing a
> custom tool.
>
> I am almost finished with a new scoring framework that creates a webgraphdb.
>  Part of that process is storing outlinks.  It stores only the url though,
> not the anchor text.  It would be fairly simple to write a tool that takes
> that output and processes for domains and subdomains.  Let me know if you
> want a current patch and I will send it to you.
>
> Dennis
>
> brainstorm wrote:
>>
>> Hi,
>>
>> Is there a configurable setting to limit the number of links to fetch
>> for every domain in nutch ? I'm not referring to the topN setting
>> which sorts a fetchlist using lucene (or nutch) scoring mechanism.
>>
>> In other words, I just want to fetch 1500 links from every
>> "whatever.com" domain *including* all their subdomains.
>>
>> Ex:
>>
>> upc.edu : 1500 links hard limit counter
>> escert.upc.edu: *new* 1500 links hard limit counter
>> ac.upc.edu: *new* 1500 links hard limit counter
>>
>> *or* shared links hard limit per domain:
>>
>> *.upc.edu: 1500 links hard limit
>>
>> Thanks in advance !
>

Re: Maximum links limit per domain

Posted by Dennis Kubes <ku...@apache.org>.
The db.max.outlinks.per.page defines how many links are grabbed from a 
given url.  There isn't a way to grab by domain, including subs, without 
writing a custom tool.

I am almost finished with a new scoring framework that creates a 
webgraphdb.  Part of that process is storing outlinks.  It stores only 
the url though, not the anchor text.  It would be fairly simple to write 
a tool that takes that output and processes for domains and subdomains. 
  Let me know if you want a current patch and I will send it to you.

Dennis

brainstorm wrote:
> Hi,
> 
> Is there a configurable setting to limit the number of links to fetch
> for every domain in nutch ? I'm not referring to the topN setting
> which sorts a fetchlist using lucene (or nutch) scoring mechanism.
> 
> In other words, I just want to fetch 1500 links from every
> "whatever.com" domain *including* all their subdomains.
> 
> Ex:
> 
> upc.edu : 1500 links hard limit counter
> escert.upc.edu: *new* 1500 links hard limit counter
> ac.upc.edu: *new* 1500 links hard limit counter
> 
> *or* shared links hard limit per domain:
> 
> *.upc.edu: 1500 links hard limit
> 
> Thanks in advance !