You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Otis Gospodnetic <ot...@gmail.com> on 2013/12/10 15:55:26 UTC

New feature: Seed URL high fetch frequency

Hi,

While working for a client we came across a use case that seems like it
might not be uncommon.  We may have some code to contribute.

The use case is that we have a few seed URLs that we need to fetch at
relatively high frequency (e.g. every N minutes).  There URLs have pointers
to news type of content.  Thus, these seed URLs are used primarily for URL
discovery.  From there we do w  relatively shallow crawl.  But the
important thing is that we need to make sure we get to refetching seed URLs
(depth=0) at some high frequency, while all other URLs can be refetched at
their default frequency.  In case of news that actually probably means
"fetch once and never again".

So I'm wondering if a simple custom "seed URL scheduler" would be of
interest.  Something like:

if (URL is seed)
  fetch at seed URL fetch freq
else
  fetch at standard freq

?

.... or if this can already be done without a custom scheduler, I'd love to
know how!

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

RE: New feature: Seed URL high fetch frequency

Posted by Markus Jelsma <ma...@openindex.io>.
By the way, if you don't use an adaptive scheduler but one that maintain's the configured or injected interval, you can already do it by simply injecting url's with low intervals. 
 
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Tuesday 10th December 2013 16:04
> To: user@nutch.apache.org
> Subject: RE: New feature: Seed URL high fetch frequency
> 
> Already in 1.x:
> https://issues.apache.org/jira/browse/NUTCH-1388
> 
> Also see:
> https://issues.apache.org/jira/browse/NUTCH-1405
> 
> You can already inject with fetchInterval but you need a fixedFetchInterval to be added to the metadata and a FetchScheduler that supports it.
>  
> -----Original message-----
> > From:Otis Gospodnetic <ot...@gmail.com>
> > Sent: Tuesday 10th December 2013 15:56
> > To: user@nutch.apache.org
> > Subject: New feature: Seed URL high fetch frequency
> > 
> > Hi,
> > 
> > While working for a client we came across a use case that seems like it
> > might not be uncommon.  We may have some code to contribute.
> > 
> > The use case is that we have a few seed URLs that we need to fetch at
> > relatively high frequency (e.g. every N minutes).  There URLs have pointers
> > to news type of content.  Thus, these seed URLs are used primarily for URL
> > discovery.  From there we do w  relatively shallow crawl.  But the
> > important thing is that we need to make sure we get to refetching seed URLs
> > (depth=0) at some high frequency, while all other URLs can be refetched at
> > their default frequency.  In case of news that actually probably means
> > "fetch once and never again".
> > 
> > So I'm wondering if a simple custom "seed URL scheduler" would be of
> > interest.  Something like:
> > 
> > if (URL is seed)
> >   fetch at seed URL fetch freq
> > else
> >   fetch at standard freq
> > 
> > ?
> > 
> > .... or if this can already be done without a custom scheduler, I'd love to
> > know how!
> > 
> > Thanks,
> > Otis
> > --
> > Performance Monitoring * Log Analytics * Search Analytics
> > Solr & Elasticsearch Support * http://sematext.com/
> > 
> 

RE: New feature: Seed URL high fetch frequency

Posted by Markus Jelsma <ma...@openindex.io>.
Already in 1.x:
https://issues.apache.org/jira/browse/NUTCH-1388

Also see:
https://issues.apache.org/jira/browse/NUTCH-1405

You can already inject with fetchInterval but you need a fixedFetchInterval to be added to the metadata and a FetchScheduler that supports it.
 
-----Original message-----
> From:Otis Gospodnetic <ot...@gmail.com>
> Sent: Tuesday 10th December 2013 15:56
> To: user@nutch.apache.org
> Subject: New feature: Seed URL high fetch frequency
> 
> Hi,
> 
> While working for a client we came across a use case that seems like it
> might not be uncommon.  We may have some code to contribute.
> 
> The use case is that we have a few seed URLs that we need to fetch at
> relatively high frequency (e.g. every N minutes).  There URLs have pointers
> to news type of content.  Thus, these seed URLs are used primarily for URL
> discovery.  From there we do w  relatively shallow crawl.  But the
> important thing is that we need to make sure we get to refetching seed URLs
> (depth=0) at some high frequency, while all other URLs can be refetched at
> their default frequency.  In case of news that actually probably means
> "fetch once and never again".
> 
> So I'm wondering if a simple custom "seed URL scheduler" would be of
> interest.  Something like:
> 
> if (URL is seed)
>   fetch at seed URL fetch freq
> else
>   fetch at standard freq
> 
> ?
> 
> .... or if this can already be done without a custom scheduler, I'd love to
> know how!
> 
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>