You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matt Poff <ma...@headfirst.co.nz> on 2012/02/03 01:37:02 UTC

How parse *only* specific URLs under a domain... -depth 1 -topN 1 does not work as desired

Hi,

I have several arbitrary URLs on a domain I need to index which, given the paths and content hierarchy, it would be complex to use URL filters on. Ideally, I just want to explicitly specify the URLs in my urls file but this doesn't work as desired. The first run picks up the initial seed pages but then subsequent runs pick up links from that first page and crawl those unless I delete the crawldb.

Are my only options to crawl a set of specific URLs and prevent subsequent crawls from following outlinks to do one of the following:-

Set topN to 1 for the initial crawl then to 0 for subsequent crawls, or
Delete the crawldb before every new invocation?

If so, which of these is better? I assume the former is more efficient. Will it still refresh my indexed URLs periodically?

I have Nutch running on a cron, performing indexing of some non-database indexed pages of a CMS-driven website. Nutch is run periodically by cron to index such pages. I just need it to revisit a predefined set of URLs, not crawl out beyond them.

Thanks for your help.

Cheers,
Matt

Re: How parse *only* specific URLs under a domain... -depth 1 -topN 1 does not work as desired

Posted by Matt Poff <ma...@gmail.com>.
>I'm not a nutch expert, but I would try to run a crawl with -depth 0.

Tried that, but a depth of zero genetates no results at all.

On 3/02/2012, at 10:07 PM, Markus Jelsma <ma...@openindex.io> wrote:

> you can inject the url's you want and use the noAdditions switch when updating 
> the crawldb.

Thanks - that sounds perfect.

Re: How parse *only* specific URLs under a domain... -depth 1 -topN 1 does not work as desired

Posted by Markus Jelsma <ma...@openindex.io>.
you can inject the url's you want and use the noAdditions switch when updating 
the crawldb.

> Hi,
> 
> I have several arbitrary URLs on a domain I need to index which, given the
> paths and content hierarchy, it would be complex to use URL filters on.
> Ideally, I just want to explicitly specify the URLs in my urls file but
> this doesn't work as desired. The first run picks up the initial seed
> pages but then subsequent runs pick up links from that first page and
> crawl those unless I delete the crawldb.
> 
> Are my only options to crawl a set of specific URLs and prevent subsequent
> crawls from following outlinks to do one of the following:-
> 
> Set topN to 1 for the initial crawl then to 0 for subsequent crawls, or
> Delete the crawldb before every new invocation?
> 
> If so, which of these is better? I assume the former is more efficient.
> Will it still refresh my indexed URLs periodically?
> 
> I have Nutch running on a cron, performing indexing of some non-database
> indexed pages of a CMS-driven website. Nutch is run periodically by cron
> to index such pages. I just need it to revisit a predefined set of URLs,
> not crawl out beyond them.
> 
> Thanks for your help.
> 
> Cheers,
> Matt

Re: How parse *only* specific URLs under a domain... -depth 1 -topN 1 does not work as desired

Posted by Adriana Farina <ad...@gmail.com>.
Hi,

I'm not a nutch expert, but I would try to run a crawl with -depth 0. From
what I understand, topN refers to the breadth of the search, so if you set
-depth 1 nutch will crawl all your seeds and the pages linked by them. I'm
not sure about it, but if you set -depth 0, nutch should crawl just the
pages that are to the level 0 of your "crawl-tree", that is it should crawl
just your seeds.



2012/2/3 Matt Poff <ma...@headfirst.co.nz>

> Hi,
>
> I have several arbitrary URLs on a domain I need to index which, given the
> paths and content hierarchy, it would be complex to use URL filters on.
> Ideally, I just want to explicitly specify the URLs in my urls file but
> this doesn't work as desired. The first run picks up the initial seed pages
> but then subsequent runs pick up links from that first page and crawl those
> unless I delete the crawldb.
>
> Are my only options to crawl a set of specific URLs and prevent subsequent
> crawls from following outlinks to do one of the following:-
>
> Set topN to 1 for the initial crawl then to 0 for subsequent crawls, or
> Delete the crawldb before every new invocation?
>
> If so, which of these is better? I assume the former is more efficient.
> Will it still refresh my indexed URLs periodically?
>
> I have Nutch running on a cron, performing indexing of some non-database
> indexed pages of a CMS-driven website. Nutch is run periodically by cron to
> index such pages. I just need it to revisit a predefined set of URLs, not
> crawl out beyond them.
>
> Thanks for your help.
>
> Cheers,
> Matt