You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mateusz Zakarczemny <ma...@up2data.pl> on 2014/02/17 16:14:14 UTC

Setting different fetch interval for some pages

Hi,

I'm going to crawl some set of news sites. Pages on those sites could be
divided into two types: category page and article page. I would like to
fetch categories pages more frequently than article pages. List of
categories is rather fixed so I could mark them manually.

I know I could reach similar behaviour using AdaptiveFetchSchedule but it
require some time to adjust fetch time. This doesn't satisfy me because
before the fetch I already know how often pages should be re crawled.

I wonder if it is possible in nutch to set different fetch intervals for
sites. I know that I could extend AbstractFetchSchedule and implement this
behaviour manually. This would require adding some extra field to WebPage
object which indicate what type of page we are dealing with. It is possible
to add such field to WebPage object? Maybe there is another approach?

Regards,
Mateusz

RE: Setting different fetch interval for some pages

Posted by Markus Jelsma <ma...@openindex.io>.
Hi - it uses a number of features including URL length, number of list items on page, number of paragraphs, extracted text size, number of hyperlinks, whether the URL ends with a digit (optional slash included iirc), similarity of list items and more. It works fairly well but some pages will fool the classifier. You can also do other interesting things with the score such as have high scoring pages at the end of the search result set.
 
-----Original message-----
> From:Mateusz Zakarczemny <ma...@up2data.pl>
> Sent: Tuesday 18th February 2014 10:12
> To: user@nutch.apache.org
> Subject: Re: Setting different fetch interval for some pages
> 
> Markus as far as I see there is no CrawlDatum in nutch 2.1. However, it is
> interesting approach. What factors are considered by your classifier to
> detect hub pages?  It parse urls or count outlinks?
> 
> 
> 2014-02-18 10:03 GMT+01:00 Mateusz Zakarczemny <
> mateusz.zakarczemny@up2data.pl>:
> 
> > As Jorge said it could be parametrized in seed file:
> > <URL>\tnutch.fetchInterval=86400
> > It is quite important that if we use AdaptiveFetchSchedule interval will
> > be overriden. In nutch 1.6 it could be bypassed using
> > nutch.fetchInterval.fixed (Issue NUTCH-1388) but it wasn't yet ported to
> > nutch 2.1 (Issue NUTCH-1682)
> >
> >
> >
> > 2014-02-18 9:53 GMT+01:00 Markus Jelsma <ma...@openindex.io>:
> >
> > Hi
> >>
> >> We do something similar using a parse filter plugin and a custom
> >> scheduler. The parse filter plugin contains a SVM classifier that gives a
> >> high score to hub pages, or pages we consider not important, no content,
> >> overviews, lists etc. This score is passed back to the CrawlDatum and used
> >> in the scheduler to adjust fetch time partially based on the hub score.
> >>
> >> Markus
> >>
> >> -----Original message-----
> >> > From:Jorge Luis Betancourt González <jl...@uci.cu>
> >> > Sent: Tuesday 18th February 2014 0:48
> >> > To: user@nutch.apache.org
> >> > Subject: Re: Setting different fetch interval for some pages
> >> >
> >> > If I'm don't remember wrong in the list there was a patch to accomplish
> >> this, specifying the fetch interval in the seed file. Also this could work
> >> as a base to implement a custom plugin to accomplish your specific use case.
> >> >
> >> > ----- Original Message -----
> >> > From: "Mateusz Zakarczemny" <ma...@up2data.pl>
> >> > To: user@nutch.apache.org
> >> > Sent: Monday, February 17, 2014 10:14:14 AM
> >> > Subject: Setting different fetch interval for some pages
> >> >
> >> > Hi,
> >> >
> >> > I'm going to crawl some set of news sites. Pages on those sites could be
> >> > divided into two types: category page and article page. I would like to
> >> > fetch categories pages more frequently than article pages. List of
> >> > categories is rather fixed so I could mark them manually.
> >> >
> >> > I know I could reach similar behaviour using AdaptiveFetchSchedule but
> >> it
> >> > require some time to adjust fetch time. This doesn't satisfy me because
> >> > before the fetch I already know how often pages should be re crawled.
> >> >
> >> > I wonder if it is possible in nutch to set different fetch intervals for
> >> > sites. I know that I could extend AbstractFetchSchedule and implement
> >> this
> >> > behaviour manually. This would require adding some extra field to
> >> WebPage
> >> > object which indicate what type of page we are dealing with. It is
> >> possible
> >> > to add such field to WebPage object? Maybe there is another approach?
> >> >
> >> > Regards,
> >> > Mateusz
> >> >
> >> ________________________________________________________________________________________________
> >> > III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero
> >> del 2014. Ver www.uci.cu
> >> >
> >>
> >
> >
> 

Re: Setting different fetch interval for some pages

Posted by Mateusz Zakarczemny <ma...@up2data.pl>.
Markus as far as I see there is no CrawlDatum in nutch 2.1. However, it is
interesting approach. What factors are considered by your classifier to
detect hub pages?  It parse urls or count outlinks?


2014-02-18 10:03 GMT+01:00 Mateusz Zakarczemny <
mateusz.zakarczemny@up2data.pl>:

> As Jorge said it could be parametrized in seed file:
> <URL>\tnutch.fetchInterval=86400
> It is quite important that if we use AdaptiveFetchSchedule interval will
> be overriden. In nutch 1.6 it could be bypassed using
> nutch.fetchInterval.fixed (Issue NUTCH-1388) but it wasn't yet ported to
> nutch 2.1 (Issue NUTCH-1682)
>
>
>
> 2014-02-18 9:53 GMT+01:00 Markus Jelsma <ma...@openindex.io>:
>
> Hi
>>
>> We do something similar using a parse filter plugin and a custom
>> scheduler. The parse filter plugin contains a SVM classifier that gives a
>> high score to hub pages, or pages we consider not important, no content,
>> overviews, lists etc. This score is passed back to the CrawlDatum and used
>> in the scheduler to adjust fetch time partially based on the hub score.
>>
>> Markus
>>
>> -----Original message-----
>> > From:Jorge Luis Betancourt González <jl...@uci.cu>
>> > Sent: Tuesday 18th February 2014 0:48
>> > To: user@nutch.apache.org
>> > Subject: Re: Setting different fetch interval for some pages
>> >
>> > If I'm don't remember wrong in the list there was a patch to accomplish
>> this, specifying the fetch interval in the seed file. Also this could work
>> as a base to implement a custom plugin to accomplish your specific use case.
>> >
>> > ----- Original Message -----
>> > From: "Mateusz Zakarczemny" <ma...@up2data.pl>
>> > To: user@nutch.apache.org
>> > Sent: Monday, February 17, 2014 10:14:14 AM
>> > Subject: Setting different fetch interval for some pages
>> >
>> > Hi,
>> >
>> > I'm going to crawl some set of news sites. Pages on those sites could be
>> > divided into two types: category page and article page. I would like to
>> > fetch categories pages more frequently than article pages. List of
>> > categories is rather fixed so I could mark them manually.
>> >
>> > I know I could reach similar behaviour using AdaptiveFetchSchedule but
>> it
>> > require some time to adjust fetch time. This doesn't satisfy me because
>> > before the fetch I already know how often pages should be re crawled.
>> >
>> > I wonder if it is possible in nutch to set different fetch intervals for
>> > sites. I know that I could extend AbstractFetchSchedule and implement
>> this
>> > behaviour manually. This would require adding some extra field to
>> WebPage
>> > object which indicate what type of page we are dealing with. It is
>> possible
>> > to add such field to WebPage object? Maybe there is another approach?
>> >
>> > Regards,
>> > Mateusz
>> >
>> ________________________________________________________________________________________________
>> > III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero
>> del 2014. Ver www.uci.cu
>> >
>>
>
>

Re: Setting different fetch interval for some pages

Posted by Mateusz Zakarczemny <ma...@up2data.pl>.
As Jorge said it could be parametrized in seed file:
<URL>\tnutch.fetchInterval=86400
It is quite important that if we use AdaptiveFetchSchedule interval will be
overriden. In nutch 1.6 it could be bypassed using
nutch.fetchInterval.fixed (Issue NUTCH-1388) but it wasn't yet ported to
nutch 2.1 (Issue NUTCH-1682)



2014-02-18 9:53 GMT+01:00 Markus Jelsma <ma...@openindex.io>:

> Hi
>
> We do something similar using a parse filter plugin and a custom
> scheduler. The parse filter plugin contains a SVM classifier that gives a
> high score to hub pages, or pages we consider not important, no content,
> overviews, lists etc. This score is passed back to the CrawlDatum and used
> in the scheduler to adjust fetch time partially based on the hub score.
>
> Markus
>
> -----Original message-----
> > From:Jorge Luis Betancourt González <jl...@uci.cu>
> > Sent: Tuesday 18th February 2014 0:48
> > To: user@nutch.apache.org
> > Subject: Re: Setting different fetch interval for some pages
> >
> > If I'm don't remember wrong in the list there was a patch to accomplish
> this, specifying the fetch interval in the seed file. Also this could work
> as a base to implement a custom plugin to accomplish your specific use case.
> >
> > ----- Original Message -----
> > From: "Mateusz Zakarczemny" <ma...@up2data.pl>
> > To: user@nutch.apache.org
> > Sent: Monday, February 17, 2014 10:14:14 AM
> > Subject: Setting different fetch interval for some pages
> >
> > Hi,
> >
> > I'm going to crawl some set of news sites. Pages on those sites could be
> > divided into two types: category page and article page. I would like to
> > fetch categories pages more frequently than article pages. List of
> > categories is rather fixed so I could mark them manually.
> >
> > I know I could reach similar behaviour using AdaptiveFetchSchedule but it
> > require some time to adjust fetch time. This doesn't satisfy me because
> > before the fetch I already know how often pages should be re crawled.
> >
> > I wonder if it is possible in nutch to set different fetch intervals for
> > sites. I know that I could extend AbstractFetchSchedule and implement
> this
> > behaviour manually. This would require adding some extra field to WebPage
> > object which indicate what type of page we are dealing with. It is
> possible
> > to add such field to WebPage object? Maybe there is another approach?
> >
> > Regards,
> > Mateusz
> >
> ________________________________________________________________________________________________
> > III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero
> del 2014. Ver www.uci.cu
> >
>

RE: Setting different fetch interval for some pages

Posted by Markus Jelsma <ma...@openindex.io>.
Hi

We do something similar using a parse filter plugin and a custom scheduler. The parse filter plugin contains a SVM classifier that gives a high score to hub pages, or pages we consider not important, no content, overviews, lists etc. This score is passed back to the CrawlDatum and used in the scheduler to adjust fetch time partially based on the hub score.

Markus
 
-----Original message-----
> From:Jorge Luis Betancourt González <jl...@uci.cu>
> Sent: Tuesday 18th February 2014 0:48
> To: user@nutch.apache.org
> Subject: Re: Setting different fetch interval for some pages
> 
> If I'm don't remember wrong in the list there was a patch to accomplish this, specifying the fetch interval in the seed file. Also this could work as a base to implement a custom plugin to accomplish your specific use case. 
> 
> ----- Original Message -----
> From: "Mateusz Zakarczemny" <ma...@up2data.pl>
> To: user@nutch.apache.org
> Sent: Monday, February 17, 2014 10:14:14 AM
> Subject: Setting different fetch interval for some pages
> 
> Hi,
> 
> I'm going to crawl some set of news sites. Pages on those sites could be
> divided into two types: category page and article page. I would like to
> fetch categories pages more frequently than article pages. List of
> categories is rather fixed so I could mark them manually.
> 
> I know I could reach similar behaviour using AdaptiveFetchSchedule but it
> require some time to adjust fetch time. This doesn't satisfy me because
> before the fetch I already know how often pages should be re crawled.
> 
> I wonder if it is possible in nutch to set different fetch intervals for
> sites. I know that I could extend AbstractFetchSchedule and implement this
> behaviour manually. This would require adding some extra field to WebPage
> object which indicate what type of page we are dealing with. It is possible
> to add such field to WebPage object? Maybe there is another approach?
> 
> Regards,
> Mateusz
> ________________________________________________________________________________________________
> III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu
> 

Re: Setting different fetch interval for some pages

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.
If I'm don't remember wrong in the list there was a patch to accomplish this, specifying the fetch interval in the seed file. Also this could work as a base to implement a custom plugin to accomplish your specific use case.

----- Original Message -----
From: "Mateusz Zakarczemny" <ma...@up2data.pl>
To: user@nutch.apache.org
Sent: Monday, February 17, 2014 10:14:14 AM
Subject: Setting different fetch interval for some pages

Hi,

I'm going to crawl some set of news sites. Pages on those sites could be
divided into two types: category page and article page. I would like to
fetch categories pages more frequently than article pages. List of
categories is rather fixed so I could mark them manually.

I know I could reach similar behaviour using AdaptiveFetchSchedule but it
require some time to adjust fetch time. This doesn't satisfy me because
before the fetch I already know how often pages should be re crawled.

I wonder if it is possible in nutch to set different fetch intervals for
sites. I know that I could extend AbstractFetchSchedule and implement this
behaviour manually. This would require adding some extra field to WebPage
object which indicate what type of page we are dealing with. It is possible
to add such field to WebPage object? Maybe there is another approach?

Regards,
Mateusz
________________________________________________________________________________________________
III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu