You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Diaa Abdallah <di...@gmail.com> on 2014/05/11 17:00:58 UTC

How to generate equal number of pages per host

When crawling multiple websites how would I enforce that nutch generates
and adds to the fetch list multiple hosts rather than the same host.

For example:
Lets say we have 3 websites with the following discovered pages:
a.com 100
b.com 100
c.com 100

When I generate topN 30 I'd like to make sure that these 30 are
proportional from each page so that it would take:
10 from a.com
10 from b.com
10 from c.com

Rather than take 30 from just a.com

This happens when the webpages from a.com have a better score.
The harm here lies in that if only a.com generates the pages the crawl
would have less throughput since it takes longer to retrieve 30 from just a
rather than 10 from each, since there is a delay for each time the host is
crawled.

Regards,
Diaa

Re: How to generate equal number of pages per host

Posted by Diaa Abdallah <di...@gmail.com>.
Hi Talat,
I think it's something worth doing since it would boost crawling
significantly.
I will see what I can do and will start a jira once I have something.

Thanks,
Diaa


On Mon, May 12, 2014 at 12:22 AM, Talat Uyarer <ta...@uyarer.com> wrote:

> Hi Diaa,
>
> Good question, but now that is impossible. When you use topN parameter
> Nutch pays attend to list that ordered by score. If you want to take
> same number for each host, you can use different webpage table. Or If
> you are willing develop this feature for Nutch I can help you
>
> Talat
>
> 2014-05-11 18:00 GMT+03:00 Diaa Abdallah <di...@gmail.com>:
> > When crawling multiple websites how would I enforce that nutch generates
> > and adds to the fetch list multiple hosts rather than the same host.
> >
> > For example:
> > Lets say we have 3 websites with the following discovered pages:
> > a.com 100
> > b.com 100
> > c.com 100
> >
> > When I generate topN 30 I'd like to make sure that these 30 are
> > proportional from each page so that it would take:
> > 10 from a.com
> > 10 from b.com
> > 10 from c.com
> >
> > Rather than take 30 from just a.com
> >
> > This happens when the webpages from a.com have a better score.
> > The harm here lies in that if only a.com generates the pages the crawl
> > would have less throughput since it takes longer to retrieve 30 from
> just a
> > rather than 10 from each, since there is a delay for each time the host
> is
> > crawled.
> >
> > Regards,
> > Diaa
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>

Re: How to generate equal number of pages per host

Posted by Julien Nioche <li...@gmail.com>.
Hi

There are 2 parameters that are used exactly for this purpose
 (generate.max.count && generator.count.mode).  Look at nutch-default.xml
for a description. These are available in both versions of Nutch and allow
you to set a max number of URLs from the same host/domain/IP in a
fetchlist.

Julien



On 11 May 2014 23:22, Talat Uyarer <ta...@uyarer.com> wrote:

> Hi Diaa,
>
> Good question, but now that is impossible. When you use topN parameter
> Nutch pays attend to list that ordered by score. If you want to take
> same number for each host, you can use different webpage table. Or If
> you are willing develop this feature for Nutch I can help you
>
> Talat
>
> 2014-05-11 18:00 GMT+03:00 Diaa Abdallah <di...@gmail.com>:
> > When crawling multiple websites how would I enforce that nutch generates
> > and adds to the fetch list multiple hosts rather than the same host.
> >
> > For example:
> > Lets say we have 3 websites with the following discovered pages:
> > a.com 100
> > b.com 100
> > c.com 100
> >
> > When I generate topN 30 I'd like to make sure that these 30 are
> > proportional from each page so that it would take:
> > 10 from a.com
> > 10 from b.com
> > 10 from c.com
> >
> > Rather than take 30 from just a.com
> >
> > This happens when the webpages from a.com have a better score.
> > The harm here lies in that if only a.com generates the pages the crawl
> > would have less throughput since it takes longer to retrieve 30 from
> just a
> > rather than 10 from each, since there is a delay for each time the host
> is
> > crawled.
> >
> > Regards,
> > Diaa
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: How to generate equal number of pages per host

Posted by Talat Uyarer <ta...@uyarer.com>.
Hi Diaa,

Good question, but now that is impossible. When you use topN parameter
Nutch pays attend to list that ordered by score. If you want to take
same number for each host, you can use different webpage table. Or If
you are willing develop this feature for Nutch I can help you

Talat

2014-05-11 18:00 GMT+03:00 Diaa Abdallah <di...@gmail.com>:
> When crawling multiple websites how would I enforce that nutch generates
> and adds to the fetch list multiple hosts rather than the same host.
>
> For example:
> Lets say we have 3 websites with the following discovered pages:
> a.com 100
> b.com 100
> c.com 100
>
> When I generate topN 30 I'd like to make sure that these 30 are
> proportional from each page so that it would take:
> 10 from a.com
> 10 from b.com
> 10 from c.com
>
> Rather than take 30 from just a.com
>
> This happens when the webpages from a.com have a better score.
> The harm here lies in that if only a.com generates the pages the crawl
> would have less throughput since it takes longer to retrieve 30 from just a
> rather than 10 from each, since there is a delay for each time the host is
> crawled.
>
> Regards,
> Diaa



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304