You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by MilleBii <mi...@gmail.com> on 2009/08/25 23:48:07 UTC

Limiting number of URL from the same site in a fetch cycle

I'm wondering if there is a setting by which you can limit the number of
urls per site on a fetch list, not a on a total site.
In this way I could avoid long tails in a fetch list all from the same site
so it takes damn long (5s per URL), I'd like to fetch them on the next
cycle.

-- 
-MilleBii-

Re: Limiting number of URL from the same site in a fetch cycle

Posted by MilleBii <mi...@gmail.com>.
ok thx, looks great.

2009/8/26 Fuad Efendi <fu...@efendi.ca>

> Probably this is suitable:
>
>
> <property>
>  <name>generate.max.per.host</name>
>  <value>-1</value>
>  <description>The maximum number of urls per host in a single
>  fetchlist.  -1 if unlimited.</description>
> </property>
>
>
> [-topN N] - Number of top URLs to be selected
>
>
>
> -----Original Message-----
> From: MilleBii [mailto:millebii@gmail.com]
> Sent: August-26-09 5:39 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Limiting number of URL from the same site in a fetch cycle
>
>  db.max.outlinks.per.page will result in missing links ? Don't want that.
> I just would want to balance them on a next fetch cycle.
>
>
>
>
> 2009/8/26 Fuad Efendi <fu...@efendi.ca>
>
> > You can filter some unnecessary "tail" using UrlFilter; for instance,
> some
> > sites may have long forums which you don't need, or shopping cart /
> process
> > to checkout pages which they forgot to restrict via robots.txt...
> >
> > Check regex-urlfilter.txt.template in /conf
> >
> >
> > Another parameter which equalizes 'per-site' URLs is
> > db.max.outlinks.per.page=100 (some sites may have 10 links per page,
> others
> > - 1000...)
> >
> >
> > -Fuad
> > http://www.linkedin.com/in/liferay
> > http://www.tokenizer.org
> >
> >
> >
> > -----Original Message-----
> > From: MilleBii [mailto:millebii@gmail.com]
> > Sent: August-25-09 5:48 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Limiting number of URL from the same site in a fetch cycle
> >
> > I'm wondering if there is a setting by which you can limit the number of
> > urls per site on a fetch list, not a on a total site.
> > In this way I could avoid long tails in a fetch list all from the same
> site
> > so it takes damn long (5s per URL), I'd like to fetch them on the next
> > cycle.
> >
> > --
> > -MilleBii-
> >
> >
> >
>
>
> --
> -MilleBii-
>
>
>


-- 
-MilleBii-

RE: Limiting number of URL from the same site in a fetch cycle

Posted by Fuad Efendi <fu...@efendi.ca>.
Probably this is suitable:


<property>
  <name>generate.max.per.host</name>
  <value>-1</value>
  <description>The maximum number of urls per host in a single
  fetchlist.  -1 if unlimited.</description>
</property>


[-topN N] - Number of top URLs to be selected



-----Original Message-----
From: MilleBii [mailto:millebii@gmail.com] 
Sent: August-26-09 5:39 AM
To: nutch-user@lucene.apache.org
Subject: Re: Limiting number of URL from the same site in a fetch cycle

 db.max.outlinks.per.page will result in missing links ? Don't want that.
I just would want to balance them on a next fetch cycle.




2009/8/26 Fuad Efendi <fu...@efendi.ca>

> You can filter some unnecessary "tail" using UrlFilter; for instance, some
> sites may have long forums which you don't need, or shopping cart / process
> to checkout pages which they forgot to restrict via robots.txt...
>
> Check regex-urlfilter.txt.template in /conf
>
>
> Another parameter which equalizes 'per-site' URLs is
> db.max.outlinks.per.page=100 (some sites may have 10 links per page, others
> - 1000...)
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
> http://www.tokenizer.org
>
>
>
> -----Original Message-----
> From: MilleBii [mailto:millebii@gmail.com]
> Sent: August-25-09 5:48 PM
> To: nutch-user@lucene.apache.org
> Subject: Limiting number of URL from the same site in a fetch cycle
>
> I'm wondering if there is a setting by which you can limit the number of
> urls per site on a fetch list, not a on a total site.
> In this way I could avoid long tails in a fetch list all from the same site
> so it takes damn long (5s per URL), I'd like to fetch them on the next
> cycle.
>
> --
> -MilleBii-
>
>
>


-- 
-MilleBii-



Re: Limiting number of URL from the same site in a fetch cycle

Posted by MilleBii <mi...@gmail.com>.
 db.max.outlinks.per.page will result in missing links ? Don't want that.
I just would want to balance them on a next fetch cycle.




2009/8/26 Fuad Efendi <fu...@efendi.ca>

> You can filter some unnecessary "tail" using UrlFilter; for instance, some
> sites may have long forums which you don't need, or shopping cart / process
> to checkout pages which they forgot to restrict via robots.txt...
>
> Check regex-urlfilter.txt.template in /conf
>
>
> Another parameter which equalizes 'per-site' URLs is
> db.max.outlinks.per.page=100 (some sites may have 10 links per page, others
> - 1000...)
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
> http://www.tokenizer.org
>
>
>
> -----Original Message-----
> From: MilleBii [mailto:millebii@gmail.com]
> Sent: August-25-09 5:48 PM
> To: nutch-user@lucene.apache.org
> Subject: Limiting number of URL from the same site in a fetch cycle
>
> I'm wondering if there is a setting by which you can limit the number of
> urls per site on a fetch list, not a on a total site.
> In this way I could avoid long tails in a fetch list all from the same site
> so it takes damn long (5s per URL), I'd like to fetch them on the next
> cycle.
>
> --
> -MilleBii-
>
>
>


-- 
-MilleBii-

RE: Limiting number of URL from the same site in a fetch cycle

Posted by Fuad Efendi <fu...@efendi.ca>.
You can filter some unnecessary "tail" using UrlFilter; for instance, some sites may have long forums which you don't need, or shopping cart / process to checkout pages which they forgot to restrict via robots.txt...

Check regex-urlfilter.txt.template in /conf


Another parameter which equalizes 'per-site' URLs is db.max.outlinks.per.page=100 (some sites may have 10 links per page, others - 1000...)


-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org



-----Original Message-----
From: MilleBii [mailto:millebii@gmail.com] 
Sent: August-25-09 5:48 PM
To: nutch-user@lucene.apache.org
Subject: Limiting number of URL from the same site in a fetch cycle

I'm wondering if there is a setting by which you can limit the number of
urls per site on a fetch list, not a on a total site.
In this way I could avoid long tails in a fetch list all from the same site
so it takes damn long (5s per URL), I'd like to fetch them on the next
cycle.

-- 
-MilleBii-