You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Thumuluri, Sai" <Sa...@VerizonWireless.com> on 2010/09/16 13:52:44 UTC

Crawl depth

Hi, 
We are using Nutch to crawl URL entry points and index content using
Solr. I have an entry point like
http://urlentrypoint.com/searchLinks.aspx which in turn resolves into
http://domain.com/link1.aspx, http://domain.com/link2.aspx etc etc. -
basically a list of URLs that need to be indexed

I want Nutch to crawl http://urlentrypoint.com/searchLinks.aspx page but
create content for Solr index only for http://domain.com/link1.aspx,
http://domain.com/link2.aspx ... 

Me setting depth in the crawl command is not aiding this - any help is
greatly appreciated.

Thanks,
Sai Thumuluri



Re: Crawl depth

Posted by Mike Baranczak <mb...@gmail.com>.
You could write a custom IndexingFilter plugin. This will allow you to control which documents are indexed.


There might also be a way to do what you want without writing any code. Remember that the "crawl" actually combines several jobs, some of which can be run on their own:

1. Fetch the documents.
2. Parse them, extracting any links.
3. Merge those links into the crawl database.
4. Add the document content to the index.
Those steps are repeated several times - the crawl depth setting is the number of repetitions.

So you'd first do the above sequence, but without #4. This would prime the crawl DB with links. Then run a crawl normally, but filtering out http://urlentrypoint.com/searchLinks.aspx.


Or you can look into doing something to Solr so it ignores the http://urlentrypoint.com/searchLinks.aspx document.

-MB


On Sep 16, 2010, at 7:52 AM, Thumuluri, Sai wrote:

> Hi, 
> We are using Nutch to crawl URL entry points and index content using
> Solr. I have an entry point like
> http://urlentrypoint.com/searchLinks.aspx which in turn resolves into
> http://domain.com/link1.aspx, http://domain.com/link2.aspx etc etc. -
> basically a list of URLs that need to be indexed
> 
> I want Nutch to crawl http://urlentrypoint.com/searchLinks.aspx page but
> create content for Solr index only for http://domain.com/link1.aspx,
> http://domain.com/link2.aspx ... 
> 
> Me setting depth in the crawl command is not aiding this - any help is
> greatly appreciated.
> 
> Thanks,
> Sai Thumuluri
> 
>