You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by charlie w <sp...@gmail.com> on 2007/12/19 01:04:41 UTC

semantics of meta noindex

I have a question about the proper interpretation of a noindex robots
directive in a meta tag (<meta name="robots" content="noindex" />).

When Nutch fetches such a page, the content, title, etc. of the page
is not indexed, but the URL itself is.  The document is searchable by
terms in the URL.  That is, if the URL of the page is
http://www.mysite.com/onepage.html, the page is be returned as a hit
when searching "onepage".

Is it correct that Nutch does not index the content but still created
a Lucene document for a page with such a directive?  Intuitively it
seems to me as if it should not be searchable at all.

Thanks,
Charlie

Re: Anchor links

Posted by Brian Whitman <br...@variogr.am>.
On Dec 19, 2007, at 10:31 AM, Bolle, Jeffrey F. wrote:

> All,
> Is there a way to have Nutch (sorry for not being more specific in
> terms of the crawler, indexer, parser, etc.) ignore anchor links
> internal to the page (but not ignore pages internal to the site)?  I
> have some pages being indexed, archives of mailing lists, that have a
> whole ton of anchors and Nutch re-fetches and re-parses the same page
> countless times, each time on the different anchor link.  I know there
> is the property to ignore internal links, but I want other pages on  
> the
> same host to be included, just not self-referencing links within a
> page.



In your urlnormalizer regex conf file (regex-normalize.xml) you can  
remove everything after the # symbol like so:

	<!-- remove anchors, who needs em -->
	<regex>
	   <pattern>\#(.*)</pattern>
	   <substitution></substitution>
	</regex>


Anchor links

Posted by "Bolle, Jeffrey F." <jb...@mitre.org>.
All,
Is there a way to have Nutch (sorry for not being more specific in
terms of the crawler, indexer, parser, etc.) ignore anchor links
internal to the page (but not ignore pages internal to the site)?  I
have some pages being indexed, archives of mailing lists, that have a
whole ton of anchors and Nutch re-fetches and re-parses the same page
countless times, each time on the different anchor link.  I know there
is the property to ignore internal links, but I want other pages on the
same host to be included, just not self-referencing links within a
page.

Any help would be appreciated.  Thanks.

Jeff

Re: semantics of meta noindex

Posted by charlie w <sp...@gmail.com>.
Heck, I'm not committed to my intuition; it gets me in trouble all the time ;-)

I was just curious as to whether this behavior was by design.  This
whole robots thing is pretty un-spec'd as it is.  Apparently the big
search engines don't agree on this either:
http://www.mattcutts.com/blog/handling-noindex-meta-tags/

In my particular case, I want to pretend the page doesn't exist at
all.  Since I already have my own parse and indexing plugins, it was
relatively trivial to cause my crawler to behave the way I want.

Thanks
Charlie

On 12/19/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> charlie w wrote:
> > I have a question about the proper interpretation of a noindex robots
> > directive in a meta tag (<meta name="robots" content="noindex" />).
>
> I couldn't find any unambiguous description of this tag in the official
> documents (robotstxt.org or HTML 4.01). Should a crawler completely skip
> such a page, including its URL, i.e. to pretend such a page doesn't
> exist? Or should it skip the content of the page but still recognize
> that such a page exists?
>
> Nutch does the latter, i.e. it skips the content of the page but still
> adds a page (without content) to the index.
>
> >
> > When Nutch fetches such a page, the content, title, etc. of the page
> > is not indexed, but the URL itself is.  The document is searchable by
> > terms in the URL.  That is, if the URL of the page is
> > http://www.mysite.com/onepage.html, the page is be returned as a hit
> > when searching "onepage".
> >
> > Is it correct that Nutch does not index the content but still created
> > a Lucene document for a page with such a directive?  Intuitively it
> > seems to me as if it should not be searchable at all.
>
> Your intuition may be right, my intuition may be right too .. ;) If you
> find an official specification that unambiguously defines the expected
> behavior, we'll comply.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: semantics of meta noindex

Posted by Andrzej Bialecki <ab...@getopt.org>.
charlie w wrote:
> I have a question about the proper interpretation of a noindex robots
> directive in a meta tag (<meta name="robots" content="noindex" />).

I couldn't find any unambiguous description of this tag in the official 
documents (robotstxt.org or HTML 4.01). Should a crawler completely skip 
such a page, including its URL, i.e. to pretend such a page doesn't 
exist? Or should it skip the content of the page but still recognize 
that such a page exists?

Nutch does the latter, i.e. it skips the content of the page but still 
adds a page (without content) to the index.

> 
> When Nutch fetches such a page, the content, title, etc. of the page
> is not indexed, but the URL itself is.  The document is searchable by
> terms in the URL.  That is, if the URL of the page is
> http://www.mysite.com/onepage.html, the page is be returned as a hit
> when searching "onepage".
> 
> Is it correct that Nutch does not index the content but still created
> a Lucene document for a page with such a directive?  Intuitively it
> seems to me as if it should not be searchable at all.

Your intuition may be right, my intuition may be right too .. ;) If you 
find an official specification that unambiguously defines the expected 
behavior, we'll comply.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: semantics of meta noindex

Posted by Martin Kuen <ma...@gmail.com>.
Hi Charlie,

IMO if the maintainer doesn't want a page to to be searchable at all
the page should be excluded using robots.txt (my intuition).
Unfortunately, I cannot tell you how Nutch finally handles such a page
in its index.


My two cents,

Martin

On Dec 19, 2007 1:04 AM, charlie w <sp...@gmail.com> wrote:
> I have a question about the proper interpretation of a noindex robots
> directive in a meta tag (<meta name="robots" content="noindex" />).
>
> When Nutch fetches such a page, the content, title, etc. of the page
> is not indexed, but the URL itself is.  The document is searchable by
> terms in the URL.  That is, if the URL of the page is
> http://www.mysite.com/onepage.html, the page is be returned as a hit
> when searching "onepage".
>
> Is it correct that Nutch does not index the content but still created
> a Lucene document for a page with such a directive?  Intuitively it
> seems to me as if it should not be searchable at all.
>
> Thanks,
> Charlie
>