You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jun Zhang <ju...@gmail.com> on 2016/01/31 03:12:30 UTC

How to set up Nutch to only crawl links on designated web pages repeatedly?

Hello,
I want to only crawl links on several designated web pages (for example A, B, and C). The links (for example A1, A2, A3, B1, B2, B3, andC1) on the pages might be deleted or added. How to only crawl these links on these pages, but not include other links on other pages (for example the links on page A1, A2, A3, B1, B2, B3, and C1)?
I appricate any suggestions or help very much. 
Thank you.Junqiang
 		 	   		  

Re: [MASSMAIL] How to set up Nutch to only crawl links on designated web pages repeatedly?

Posted by Junqiang Zhang <ju...@gmail.com>.
Hello Eyeris,

Thank you very much for your suggestion. Sorry for my late reply.

Using the urls filter plugins is a good option. I am doing this for my
current crawling task. However, using urls filters is not exactly what
I want. I feel there should be some better ways to restrict nutch only
crawl the links on designated web pages. Currently, maybe nutch does
not provide such a feature.

Best,
Junqiang

On Sun, Jan 31, 2016 at 9:26 PM, Eyeris Rodriguez Rueda <er...@uci.cu> wrote:
> Hello Jun.
> Maybe you can use nutch´s urls filter plugins. This plugins are used to filter o restrict the visit of links.
> Please i need more details about your situation.
>
> 1-How are selected the link to visit on your pages(A, B, C) , it has some pattern,subdomain or some keyword in url´s links?

Re: [MASSMAIL] How to set up Nutch to only crawl links on designated web pages repeatedly?

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.
Hello Jun.
Maybe you can use nutch´s urls filter plugins. This plugins are used to filter o restrict the visit of links.
Please i need more details about your situation.

1-How are selected the link to visit on your pages(A, B, C) , it has some pattern,subdomain or some keyword in url´s links?