You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Joshua J Pavel <jp...@us.ibm.com> on 2011/02/07 22:41:54 UTC

Nutch not respecting a NOINDEX,FOLLOW tag

Running version 1.2.

A very simple page I'm using to seed some URLs but don't want to return in
the index itself has this metatag:
<head><META http-equiv="Content-Type" content="text/html;
charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>

...but the page keeps showing up in my index.  Any thoughts on how I can
troubleshoot this or otherwise implement a page that I want to be crawled
but not indexed?

Re: Nutch not respecting a NOINDEX,FOLLOW tag

Posted by Joshua J Pavel <jp...@us.ibm.com>.

JIRA 966 has been opened for this issue.  And thank you, the custom
indexing filter works perfectly.



|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |".: Abhishek :." <ab...@gmail.com>                                                                                                              |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |user@nutch.apache.org                                                                                                                             |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |02/08/2011 08:05 PM                                                                                                                               |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: Nutch not respecting a NOINDEX,FOLLOW tag                                                                                                     |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|





Hi Julien,

 Thanks! This actually answers the other question I asked sometime back :)

Cheers,
Abi


On Tue, Feb 8, 2011 at 6:14 PM, Julien Nioche
<lists.digitalpebble@gmail.com
> wrote:

> Hi Joshua
>
> you can circumvent that by creating a custom indexing filter e.g.
> MetaNoIndexingFilter below
>
> */**
>  * Prevents documents not allowing indexing in the meta to be indexed. By
>  * default Nutch simply empties the content and title fields but this is
> not
>  * enough to prevent documents to match e.g. on URL, metatags etc...
>  **/
> public class MetaNoIndexingFilter implements IndexingFilter {
>    public static final Log LOG =
> LogFactory.getLog(MetaNoIndexingFilter.class);
>
>    private Configuration conf;
>
>    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>            CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>        // should rely on doc or parse metadata but nothing stored
>        // by the html parser
>        String text = parse.getText();
>        String title = parse.getData().getTitle();
>        if ((text == null || text.equals(""))
>                && (title == null || title.equals(""))) {
>            // no text -> no indexing
>            return null;
>        }
>        return doc;
>    }
>
>    public void setConf(Configuration conf) {
>        this.conf = conf;
>    }
>
>    public Configuration getConf() {
>        return this.conf;
>    }
>
> }
> *
> We should probably have a think about how to do that systematically as
the
> current behaviour is slightly counter intuitive. Could you please open a
> JIRA for this?
>
> Thanks
>
> Julien
>
>
>
> On 7 February 2011 21:41, Joshua J Pavel <jp...@us.ibm.com> wrote:
>
> >
> > Running version 1.2.
> >
> > A very simple page I'm using to seed some URLs but don't want to return
> in
> > the index itself has this metatag:
> > <head><META http-equiv="Content-Type" content="text/html;
> > charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
> >
> > ...but the page keeps showing up in my index.  Any thoughts on how I
can
> > troubleshoot this or otherwise implement a page that I want to be
crawled
> > but not indexed?
>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Nutch not respecting a NOINDEX,FOLLOW tag

Posted by ".: Abhishek :." <ab...@gmail.com>.

Hi Julien,

 Thanks! This actually answers the other question I asked sometime back :)

Cheers,
Abi


On Tue, Feb 8, 2011 at 6:14 PM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> Hi Joshua
>
> you can circumvent that by creating a custom indexing filter e.g.
> MetaNoIndexingFilter below
>
> */**
>  * Prevents documents not allowing indexing in the meta to be indexed. By
>  * default Nutch simply empties the content and title fields but this is
> not
>  * enough to prevent documents to match e.g. on URL, metatags etc...
>  **/
> public class MetaNoIndexingFilter implements IndexingFilter {
>    public static final Log LOG =
> LogFactory.getLog(MetaNoIndexingFilter.class);
>
>    private Configuration conf;
>
>    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>            CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>        // should rely on doc or parse metadata but nothing stored
>        // by the html parser
>        String text = parse.getText();
>        String title = parse.getData().getTitle();
>        if ((text == null || text.equals(""))
>                && (title == null || title.equals(""))) {
>            // no text -> no indexing
>            return null;
>        }
>        return doc;
>    }
>
>    public void setConf(Configuration conf) {
>        this.conf = conf;
>    }
>
>    public Configuration getConf() {
>        return this.conf;
>    }
>
> }
> *
> We should probably have a think about how to do that systematically as the
> current behaviour is slightly counter intuitive. Could you please open a
> JIRA for this?
>
> Thanks
>
> Julien
>
>
>
> On 7 February 2011 21:41, Joshua J Pavel <jp...@us.ibm.com> wrote:
>
> >
> > Running version 1.2.
> >
> > A very simple page I'm using to seed some URLs but don't want to return
> in
> > the index itself has this metatag:
> > <head><META http-equiv="Content-Type" content="text/html;
> > charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
> >
> > ...but the page keeps showing up in my index.  Any thoughts on how I can
> > troubleshoot this or otherwise implement a page that I want to be crawled
> > but not indexed?
>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Nutch not respecting a NOINDEX,FOLLOW tag

Posted by Julien Nioche <li...@gmail.com>.

Hi Joshua

you can circumvent that by creating a custom indexing filter e.g.
MetaNoIndexingFilter below

*/**
 * Prevents documents not allowing indexing in the meta to be indexed. By
 * default Nutch simply empties the content and title fields but this is not
 * enough to prevent documents to match e.g. on URL, metatags etc...
 **/
public class MetaNoIndexingFilter implements IndexingFilter {
    public static final Log LOG =
LogFactory.getLog(MetaNoIndexingFilter.class);

    private Configuration conf;

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
            CrawlDatum datum, Inlinks inlinks) throws IndexingException {
        // should rely on doc or parse metadata but nothing stored
        // by the html parser
        String text = parse.getText();
        String title = parse.getData().getTitle();
        if ((text == null || text.equals(""))
                && (title == null || title.equals(""))) {
            // no text -> no indexing
            return null;
        }
        return doc;
    }

    public void setConf(Configuration conf) {
        this.conf = conf;
    }

    public Configuration getConf() {
        return this.conf;
    }

}
*
We should probably have a think about how to do that systematically as the
current behaviour is slightly counter intuitive. Could you please open a
JIRA for this?

Thanks

Julien



On 7 February 2011 21:41, Joshua J Pavel <jp...@us.ibm.com> wrote:

>
> Running version 1.2.
>
> A very simple page I'm using to seed some URLs but don't want to return in
> the index itself has this metatag:
> <head><META http-equiv="Content-Type" content="text/html;
> charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
>
> ...but the page keeps showing up in my index.  Any thoughts on how I can
> troubleshoot this or otherwise implement a page that I want to be crawled
> but not indexed?




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com