You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by karthik085 <ka...@gmail.com> on 2007/11/07 20:29:44 UTC
[HOW-TO] How to make Nutch Ignore META Tags
One of the problems when indexing a site - META tags not allowing nutch to
index or follow links. It is always a good respect to obey the rules of the
site. But, if the site owner is ok with you to ignore this rule, you can
make nutch ignore this rule.
In File HtmlParser.java located in -
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
comment the following lines:
if (!metaTags.getNoIndex()) { // okay to index
if (!metaTags.getNoFollow()) { // okay to follow links
and of course, the closing brackets for each if loop. After this, Just
rebuild nutch jar & war file
Why would you want to do this?
* Site Owner does not want to change his code and at the same time you want
to make that site available for indexing & searching.
Any other suggestions are welcome. Thanks.
--
View this message in context: http://www.nabble.com/-HOW-TO--How-to-make-Nutch-Ignore-META-Tags-tf4766792.html#a13634099
Sent from the Nutch - User mailing list archive at Nabble.com.