You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by karthik085 <ka...@gmail.com> on 2007/04/27 20:47:46 UTC
Ignore Robots meta tag
Hi,
I am trying to index a website. That website has
<meta name='ROBOTS' content='NOINDEX, NOFOLLOW'> in their html file.
If they want to remove this, they will have to remove it in all their pages
and they don't want to regenerate these pages from database.
I already crawled this website. Is there anyway I can make Nutch to ignore
the above and index the page?
One way I can think of is:
a) Retrieve HTML from segments
b) Remove that line
c) Write back
d) Re-index
Anyone has a better solution? Can I use PruneIndexTool?
If the above is the way I go about it, how do I do it...I mean, what are the
commands I need to issue/classes I need to call and modify?
Any help is appreciated. Thanks.
Karthik
--
View this message in context: http://www.nabble.com/Ignore-Robots-meta-tag-tf3659247.html#a10224500
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Ignore Robots meta tag
Posted by karthik085 <ka...@gmail.com>.
OOPS...I meant IndexSegment.
PruneIndexTool prunes existing Nutch indexes of unwanted content. :-)
karthik085 wrote:
>
> Hi,
>
> I am trying to index a website. That website has
> <meta name='ROBOTS' content='NOINDEX, NOFOLLOW'> in their html file.
>
> If they want to remove this, they will have to remove it in all their
> pages and they don't want to regenerate these pages from database.
>
> I already crawled this website. Is there anyway I can make Nutch to ignore
> the above and index the page?
>
> One way I can think of is:
> a) Retrieve HTML from segments
> b) Remove that line
> c) Write back
> d) Re-index
>
> Anyone has a better solution? Can I use PruneIndexTool?
>
> If the above is the way I go about it, how do I do it...I mean, what are
> the commands I need to issue/classes I need to call and modify?
>
> Any help is appreciated. Thanks.
>
> Karthik
>
>
--
View this message in context: http://www.nabble.com/Ignore-Robots-meta-tag-tf3659247.html#a10225171
Sent from the Nutch - User mailing list archive at Nabble.com.