You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by atul <at...@hexaware.com> on 2012/04/24 21:44:56 UTC

Skipping Root File from Indexing

Hi,

We have a nutch-solr combination in place to built up a web page.
We are reading a source index.html file which contains links to other web
pages.


Our code is working fine, we are getting rest of the web pages indexed
following the URL's on index.html.

However we don't want index.html file to get indexed in solr. We want rest
all the internal URL's (web pages)
indexed except the root page.

Please advice, how this can be achieved?

Thanks,
Atul

--
View this message in context: http://lucene.472066.n3.nabble.com/Skipping-Root-File-from-Indexing-tp3936375p3936375.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Skipping Root File from Indexing

Posted by Markus Jelsma <ma...@openindex.io>.

 Hi,

 With 1.5 or lower you have two options, either you skip the record in a 
 simple custom indexing filter or you skip it in a custom UpdateProcessor 
 in Solr. You can also try NUTCH-1300, it has a patch enabling filtering 
 and normalizing for the indexer job. With the RegexURLFilter enabled and 
 filter rules specifically for the indexer you can skip the URL. It 
 hasn't been fully tested but it should work. It's likely to be committed 
 for trunk in the next month.

 https://issues.apache.org/jira/browse/NUTCH-1300

 Cheers

 On Tue, 24 Apr 2012 12:44:56 -0700 (PDT), atul <at...@hexaware.com> 
 wrote:
> Hi,
>
> We have a nutch-solr combination in place to built up a web page.
> We are reading a source index.html file which contains links to other 
> web
> pages.
>
>
> Our code is working fine, we are getting rest of the web pages 
> indexed
> following the URL's on index.html.
>
> However we don't want index.html file to get indexed in solr. We want 
> rest
> all the internal URL's (web pages)
> indexed except the root page.
>
> Please advice, how this can be achieved?
>
> Thanks,
> Atul
>
> --
> View this message in context:
> 
> http://lucene.472066.n3.nabble.com/Skipping-Root-File-from-Indexing-tp3936375p3936375.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.

-- 
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350