You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/02/28 18:30:01 UTC

Re: Query in nutch

As far as I know, Elisabeth Adler contributed a patch exactly for this on
NUTCH-585 [0].

If you wish to get cracking with it please check out the latest trunk code
[1] patch it using the blacklist_whitelist_plugin.patch Elisabeth attached
to the issue.

Would be excellent if you could provide some insight into your experiences
using the patch on the issue. Thank you

[0] https://issues.apache.org/jira/browse/NUTCH-585
[1] http://svn.apache.org/repos/asf/nutch/trunk/

On Tue, Feb 28, 2012 at 6:53 AM, Geetha Venu <Ge...@infosys.com>wrote:

>
> Hi All,
>
> I have specific requirement to crawl only a specific content in the
> body<tag> of the website. The Nutch Crawler crawls all the content present
> in the body, even the menu items, urls, whatever data is present in the
> body<tag> of the website.I couldn't find an option in Nutch to restrict
> particular content (i.e. Some content within the <body> tag of an HTML) not
> to be crawled.
>
> Can you please provide any pointers on this ASAP. Thanks in advance
>
> Thanks and Regards,
> Geetha
>
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys
> has taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>



-- 
*Lewis*