You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Joshua J Pavel <jp...@us.ibm.com> on 2011/12/08 17:36:12 UTC

Selective fetching without exclusion


I would like to emphasize SPEED in my re-crawl, and I know that there are
certain branches on my domain that will not be updated while others will
be.

My first attempt was to do a crawl, and then add an exclusion in
regex-urlfilter.txt (-history) and perform a recrawl, but that seems to
exclude the urls from the database as well.

My next thought is that I could construct a segment *only* including those
urls, and then copy that segment into my directory when I'm doing my
recrawls... thereby excluding it from the fetch/parse stage, but including
it when I index.  If the configuration still has -history excluded at that
point, will those urls be searchable in the resultant crawldb?

Or is there another way to exclude URLs manually from the fetch/parse
stage, though including them in everything else? Or maybe can I set the
refetch time of certain urls in the database? The adaptive algorithm is
rather slow for what I'm trying to accomplish.  :-)

Thanks in advance!

Re: Selective fetching without exclusion

Posted by Markus Jelsma <ma...@openindex.io>.

> I would like to emphasize SPEED in my re-crawl, and I know that there are
> certain branches on my domain that will not be updated while others will
> be.
> 
> My first attempt was to do a crawl, and then add an exclusion in
> regex-urlfilter.txt (-history) and perform a recrawl, but that seems to
> exclude the urls from the database as well.

You can use different (or none) filters for each job. It'll work but you'll 
have to find a way to switch filters and perhaps rebuild the Hadoop job file.

> 
> My next thought is that I could construct a segment *only* including those
> urls, and then copy that segment into my directory when I'm doing my
> recrawls... thereby excluding it from the fetch/parse stage, but including
> it when I index.  If the configuration still has -history excluded at that
> point, will those urls be searchable in the resultant crawldb?
> 
> Or is there another way to exclude URLs manually from the fetch/parse
> stage, though including them in everything else? Or maybe can I set the
> refetch time of certain urls in the database? The adaptive algorithm is
> rather slow for what I'm trying to accomplish.  :-)

You mean the AdaptiveFetchSchedule? It's easily tuned, check the configuration 
descriptions. And then again, the default fetch time is 30 days and unless you 
crawl many thousands of domains and a huge amount of pages this is not going 
to be a serious problem.

> 
> Thanks in advance!