You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Mark Wilson <mw...@sanger.ac.uk> on 2015/06/03 17:43:46 UTC

Crawling pages but ignoring header and footer

Does anyone know of a way to crawl a website, but ignore headers and footers, or include just the main content of a site by say only including content in a <div class="main">, for example.

I have tried using https://github.com/BayanGroup/nutch-custom-search in Nutch 1.9 but I can't get it to work.

Any ideas greatly appreciated.

Thanks

Regards 

Mark Wilson
mw8@sanger.ac.uk

Re: [MASSMAIL]Crawling pages but ignoring header and footer

Posted by Mark Wilson <mw...@sanger.ac.uk>.

Thanks for the info Jorge. I'll take a look at these. Cheers Mark 


On 4 Jun 2015, at 02:29, Jorge Luis Betancourt González wrote:

> Hi Mark, 
> 
> You can use the boilerpipe feature that comes with Tika that will try to detect the main content (text) of the page and ignore all the noise around it, although this is supported by current versions of Tika Nutch doesn't expose a configuration option to enable, you could apply/use the patch in [1], this patch needs update to work the Nutch 1.9 source code but it shouldn't be that hard. One more option is using [2] you'll also need to apply a patch and then configure a property "parser.html.NodesToExclude" in your nutch-site.xml file, and then you can set a list of nodes separated by | that will not be indexed; in the JIRA you can check the format of this configuration.
> 
> Regards,
> 
> [1] https://issues.apache.org/jira/browse/NUTCH-961
> [2] https://issues.apache.org/jira/browse/NUTCH-585
> 
> 
> ----- Original Message ----- 
> From: "Mark Wilson" <mw...@sanger.ac.uk> 
> To: user@nutch.apache.org 
> Sent: Wednesday, June 3, 2015 11:43:46 AM 
> Subject: [MASSMAIL]Crawling pages but ignoring header and footer 
> 
> Does anyone know of a way to crawl a website, but ignore headers and footers, or include just the main content of a site by say only including content in a <div class="main">, for example. 
> 
> I have tried using https://github.com/BayanGroup/nutch-custom-search in Nutch 1.9 but I can't get it to work. 
> 
> Any ideas greatly appreciated. 
> 
> Thanks 
> 
> Regards 
> 
> Mark Wilson 
> mw8@sanger.ac.uk 
> 
> 
> 
> 

Mark Wilson
mw8@sanger.ac.uk

Re: [MASSMAIL]Crawling pages but ignoring header and footer

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.

Hi Mark, 

You can use the boilerpipe feature that comes with Tika that will try to detect the main content (text) of the page and ignore all the noise around it, although this is supported by current versions of Tika Nutch doesn't expose a configuration option to enable, you could apply/use the patch in [1], this patch needs update to work the Nutch 1.9 source code but it shouldn't be that hard. One more option is using [2] you'll also need to apply a patch and then configure a property "parser.html.NodesToExclude" in your nutch-site.xml file, and then you can set a list of nodes separated by | that will not be indexed; in the JIRA you can check the format of this configuration.

Regards,

[1] https://issues.apache.org/jira/browse/NUTCH-961
[2] https://issues.apache.org/jira/browse/NUTCH-585


----- Original Message ----- 
From: "Mark Wilson" <mw...@sanger.ac.uk> 
To: user@nutch.apache.org 
Sent: Wednesday, June 3, 2015 11:43:46 AM 
Subject: [MASSMAIL]Crawling pages but ignoring header and footer 

Does anyone know of a way to crawl a website, but ignore headers and footers, or include just the main content of a site by say only including content in a <div class="main">, for example. 

I have tried using https://github.com/BayanGroup/nutch-custom-search in Nutch 1.9 but I can't get it to work. 

Any ideas greatly appreciated. 

Thanks 

Regards 

Mark Wilson 
mw8@sanger.ac.uk