You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by KRIS MUSSHORN <mu...@comcast.net> on 2016/12/12 19:54:46 UTC
config help
I'm using nutch 1.12 and Solr 5.4.1.
Crawling a website and indexing into nutch.
AFAIK the regex-urlfilter.txt file will cause content to not be crawled..
what if I have
https://XXXX/inside/default.cfm as my seed url...
I want the links on this page to be crawled and indexed but I do not want this page to be indexed into SOLR.
How would I set this up?
I'm thnking that the regex.urlfilter.txt file is NOT the right place.
Re: config help
Posted by KRIS MUSSHORN <mu...@comcast.net>.
Sebastian. i am triggering nutch with a bash script that fires crawl.
How would i set it up to use the index filtering?
Kris
----- Original Message -----
From: "Sebastian Nagel" <wa...@googlemail.com>
To: user@nutch.apache.org
Sent: Tuesday, December 13, 2016 6:11:52 AM
Subject: Re: config help
Hi Kris,
also the indexer can filter by URL. It's possible to create an extra
configuration file used only for indexing and set this only for the indexing job
in combination with the option -filter to enable URL filtering (off by default):
bin/nutch index -Durlfilter.regex.file=regex-urlfilter-index.txt ... -filter
Make sure that the extra file is properly placed / packed so that it is found.
Since most undesired URLs are already filtered (.jpeg, etc.), for better performance
the file should contain only those rules required to keep the index clean. Also
note that the -D... arguments must precede all other arguments.
Best,
Sebastian
On 12/12/2016 08:54 PM, KRIS MUSSHORN wrote:
> I'm using nutch 1.12 and Solr 5.4.1.
>
> Crawling a website and indexing into nutch.
>
> AFAIK the regex-urlfilter.txt file will cause content to not be crawled..
>
> what if I have
> https://XXXX/inside/default.cfm as my seed url...
> I want the links on this page to be crawled and indexed but I do not want this page to be indexed into SOLR.
> How would I set this up?
>
> I'm thnking that the regex.urlfilter.txt file is NOT the right place.
>
Re: config help
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Kris,
also the indexer can filter by URL. It's possible to create an extra
configuration file used only for indexing and set this only for the indexing job
in combination with the option -filter to enable URL filtering (off by default):
bin/nutch index -Durlfilter.regex.file=regex-urlfilter-index.txt ... -filter
Make sure that the extra file is properly placed / packed so that it is found.
Since most undesired URLs are already filtered (.jpeg, etc.), for better performance
the file should contain only those rules required to keep the index clean. Also
note that the -D... arguments must precede all other arguments.
Best,
Sebastian
On 12/12/2016 08:54 PM, KRIS MUSSHORN wrote:
> I'm using nutch 1.12 and Solr 5.4.1.
>
> Crawling a website and indexing into nutch.
>
> AFAIK the regex-urlfilter.txt file will cause content to not be crawled..
>
> what if I have
> https://XXXX/inside/default.cfm as my seed url...
> I want the links on this page to be crawled and indexed but I do not want this page to be indexed into SOLR.
> How would I set this up?
>
> I'm thnking that the regex.urlfilter.txt file is NOT the right place.
>