You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Christian Kunz <ch...@1und1.de> on 2017/02/02 14:23:14 UTC

Tell Nutch to only crawl parts of document

Hi everybody,

we've got a problem using Nutch: On the website that has to be crawled, there is a navigation on top of each page. Nutch crawls the navigation of each page which leads to the situation that for certain queries (that are included in the navigation) every page is delivered as a result.

Is there a way to tell Nutch to only crawl parts of a page like only the main content?

Thanks in advance and regards,
Christian

AW: Tell Nutch to only crawl parts of document

Posted by André Schild <a....@aarboard.ch>.

Hello Christian,

>we've got a problem using Nutch: On the website that has to be crawled, there is 
>a navigation on top of each page. Nutch crawls the navigation of each page 
>which leads to the situation that for certain queries (that are included in the navigation) every page is delivered as a result.

We had always used the blacklist-whitelist plugin for this.
There you can specify tags/ids and classes to white or black list in your html.

http://lucene.472066.n3.nabble.com/HTML-tag-filtering-td4116686.html

Here is a version compiled for nutch 1.12 with java 8.

https://aarboard.oncloud7.ch/index.php/s/MfFDlsUBWMWW5ZM


André

RE: Tell Nutch to only crawl parts of document

Posted by Mark Vega <ve...@uci.edu>.

Christian,
I am using a Nutch plugin called Extractor from BayanGroup (https://github.com/BayanGroup/nutch-custom-search)  that allows you to select content elements on the page based on xpath expressions or css selectors.  I've mapped all the repeating content elements (navs, headers, footers, search bars, etc) on my sites to specific custom SOLR fields and am able to index the non-repeating content into the defaut 'content' field in SOLR.  Only the 'content' field is used when conducting a search, thereby side-stepping the issue you've encountered of every page showing up in results for certain searches that match on repeated content.  I think the plugin may have changed somewhat from when I included it in my Nutch 1.10 installation, but was easy to set up and has worked well for several years now.  I still index the repeating elements, but now that information is in custom SOLR fields that are not searched (I indexed them anyway just in case I have some reason to search those fields in the future).  One caveat:  When I first set this up, I was indexing 7 sites that basically used the same theme but had no consistent template across sites, i.e, the main 'content' section and the repeating content sections were each given different css selectors in different sites so that the only way to, say, grab all the left navs of every site and separate that content from the main searchable content was to create a very detailed Extractor config file that mapped each individual site's elements into a shared set of custom SOLR fields. Again, only the main 'content' section from each site is indexed into the default SOLR content field and repeating content is indexed into custom global nav, left nav, global search, header, and footer fields in SOLR.  As we undertook redesigns of our public sites last year, I took special pains to make sure that each site used the same css selectors for the repeating content elements and the main content section of all pages.  Now my Extractor config file is much smaller and still works great!

--
Mark F. Vega
Programmer/Analyst
UC Irvine Libraries - Web Services
vegamf@uci.edu
949.824.9872
--


-----Original Message-----
From: Christian Kunz [mailto:christian.kunz@1und1.de] 
Sent: Thursday, February 02, 2017 6:23 AM
To: user@nutch.apache.org
Subject: Tell Nutch to only crawl parts of document

Hi everybody,

we've got a problem using Nutch: On the website that has to be crawled, there is a navigation on top of each page. Nutch crawls the navigation of each page which leads to the situation that for certain queries (that are included in the navigation) every page is delivered as a result.

Is there a way to tell Nutch to only crawl parts of a page like only the main content?

Thanks in advance and regards,
Christian