You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by karthik085 <ka...@gmail.com> on 2007/11/07 20:29:44 UTC

[HOW-TO] How to make Nutch Ignore META Tags

One of the problems when indexing a site - META tags not allowing nutch to
index or follow links. It is always a good respect to obey the rules of the
site. But, if the site owner is ok with you to ignore this rule, you can
make nutch ignore this rule.

In File HtmlParser.java located in -
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java

comment the following lines:
if (!metaTags.getNoIndex()) {               // okay to index
if (!metaTags.getNoFollow()) {              // okay to follow links

and of course, the closing brackets for each if loop. After this, Just
rebuild nutch jar & war file

Why would you want to do this?
* Site Owner does not want to change his code and at the same time you want
to make that site available for indexing & searching.

Any other suggestions are welcome. Thanks.

-- 
View this message in context: http://www.nabble.com/-HOW-TO--How-to-make-Nutch-Ignore-META-Tags-tf4766792.html#a13634099
Sent from the Nutch - User mailing list archive at Nabble.com.