You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by KRIS MUSSHORN <mu...@comcast.net> on 2016/12/12 19:54:46 UTC

config help

I'm using nutch 1.12 and Solr 5.4.1.  
   
Crawling a website and indexing into nutch.  
  
AFAIK the regex-urlfilter.txt file will cause content to not be crawled..  
   
what if I have  
https://XXXX/inside/default.cfm  as my seed url...  
I want the links on this page to be crawled and indexed but I do not want this page to be indexed into SOLR.  
How would I set this up?  
   
I'm thnking that the regex.urlfilter.txt file is NOT the right place.

Re: config help

Posted by KRIS MUSSHORN <mu...@comcast.net>.

Sebastian. i am triggering nutch with a bash script that fires crawl. 
How would i set it up to use the index filtering? 

Kris 

----- Original Message -----

From: "Sebastian Nagel" <wa...@googlemail.com> 
To: user@nutch.apache.org 
Sent: Tuesday, December 13, 2016 6:11:52 AM 
Subject: Re: config help 

Hi Kris, 

also the indexer can filter by URL. It's possible to create an extra 
configuration file used only for indexing and set this only for the indexing job 
in combination with the option -filter to enable URL filtering (off by default): 

bin/nutch index -Durlfilter.regex.file=regex-urlfilter-index.txt ... -filter 

Make sure that the extra file is properly placed / packed so that it is found. 
Since most undesired URLs are already filtered (.jpeg, etc.), for better performance 
the file should contain only those rules required to keep the index clean. Also 
note that the -D... arguments must precede all other arguments. 

Best, 
Sebastian 

On 12/12/2016 08:54 PM, KRIS MUSSHORN wrote: 
> I'm using nutch 1.12 and Solr 5.4.1. 
> 
> Crawling a website and indexing into nutch. 
> 
> AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. 
> 
> what if I have 
> https://XXXX/inside/default.cfm as my seed url... 
> I want the links on this page to be crawled and indexed but I do not want this page to be indexed into SOLR. 
> How would I set this up? 
> 
> I'm thnking that the regex.urlfilter.txt file is NOT the right place. 
>

Re: config help

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Kris,

also the indexer can filter by URL. It's possible to create an extra
configuration file used only for indexing and set this only for the indexing job
in combination with the option -filter to enable URL filtering (off by default):

  bin/nutch index -Durlfilter.regex.file=regex-urlfilter-index.txt ... -filter

Make sure that the extra file is properly placed / packed so that it is found.
Since most undesired URLs are already filtered (.jpeg, etc.), for better performance
the file should contain only those rules required to keep the index clean. Also
note that the -D... arguments must precede all other arguments.

Best,
Sebastian

On 12/12/2016 08:54 PM, KRIS MUSSHORN wrote:
> I'm using nutch 1.12 and Solr 5.4.1.  
>    
> Crawling a website and indexing into nutch.  
>   
> AFAIK the regex-urlfilter.txt file will cause content to not be crawled..  
>    
> what if I have  
> https://XXXX/inside/default.cfm  as my seed url...  
> I want the links on this page to be crawled and indexed but I do not want this page to be indexed into SOLR.  
> How would I set this up?  
>    
> I'm thnking that the regex.urlfilter.txt file is NOT the right place. 
>