You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Joshua J Pavel <jp...@us.ibm.com> on 2011/02/07 22:41:54 UTC
Nutch not respecting a NOINDEX,FOLLOW tag
Running version 1.2.
A very simple page I'm using to seed some URLs but don't want to return in
the index itself has this metatag:
<head><META http-equiv="Content-Type" content="text/html;
charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
...but the page keeps showing up in my index. Any thoughts on how I can
troubleshoot this or otherwise implement a page that I want to be crawled
but not indexed?
Re: Nutch not respecting a NOINDEX,FOLLOW tag
Posted by Joshua J Pavel <jp...@us.ibm.com>.
JIRA 966 has been opened for this issue. And thank you, the custom
indexing filter works perfectly.
|------------>
| From: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|".: Abhishek :." <ab...@gmail.com> |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|user@nutch.apache.org |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|02/08/2011 08:05 PM |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|Re: Nutch not respecting a NOINDEX,FOLLOW tag |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
Hi Julien,
Thanks! This actually answers the other question I asked sometime back :)
Cheers,
Abi
On Tue, Feb 8, 2011 at 6:14 PM, Julien Nioche
<lists.digitalpebble@gmail.com
> wrote:
> Hi Joshua
>
> you can circumvent that by creating a custom indexing filter e.g.
> MetaNoIndexingFilter below
>
> */**
> * Prevents documents not allowing indexing in the meta to be indexed. By
> * default Nutch simply empties the content and title fields but this is
> not
> * enough to prevent documents to match e.g. on URL, metatags etc...
> **/
> public class MetaNoIndexingFilter implements IndexingFilter {
> public static final Log LOG =
> LogFactory.getLog(MetaNoIndexingFilter.class);
>
> private Configuration conf;
>
> public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> // should rely on doc or parse metadata but nothing stored
> // by the html parser
> String text = parse.getText();
> String title = parse.getData().getTitle();
> if ((text == null || text.equals(""))
> && (title == null || title.equals(""))) {
> // no text -> no indexing
> return null;
> }
> return doc;
> }
>
> public void setConf(Configuration conf) {
> this.conf = conf;
> }
>
> public Configuration getConf() {
> return this.conf;
> }
>
> }
> *
> We should probably have a think about how to do that systematically as
the
> current behaviour is slightly counter intuitive. Could you please open a
> JIRA for this?
>
> Thanks
>
> Julien
>
>
>
> On 7 February 2011 21:41, Joshua J Pavel <jp...@us.ibm.com> wrote:
>
> >
> > Running version 1.2.
> >
> > A very simple page I'm using to seed some URLs but don't want to return
> in
> > the index itself has this metatag:
> > <head><META http-equiv="Content-Type" content="text/html;
> > charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
> >
> > ...but the page keeps showing up in my index. Any thoughts on how I
can
> > troubleshoot this or otherwise implement a page that I want to be
crawled
> > but not indexed?
>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
Re: Nutch not respecting a NOINDEX,FOLLOW tag
Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi Julien,
Thanks! This actually answers the other question I asked sometime back :)
Cheers,
Abi
On Tue, Feb 8, 2011 at 6:14 PM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:
> Hi Joshua
>
> you can circumvent that by creating a custom indexing filter e.g.
> MetaNoIndexingFilter below
>
> */**
> * Prevents documents not allowing indexing in the meta to be indexed. By
> * default Nutch simply empties the content and title fields but this is
> not
> * enough to prevent documents to match e.g. on URL, metatags etc...
> **/
> public class MetaNoIndexingFilter implements IndexingFilter {
> public static final Log LOG =
> LogFactory.getLog(MetaNoIndexingFilter.class);
>
> private Configuration conf;
>
> public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> // should rely on doc or parse metadata but nothing stored
> // by the html parser
> String text = parse.getText();
> String title = parse.getData().getTitle();
> if ((text == null || text.equals(""))
> && (title == null || title.equals(""))) {
> // no text -> no indexing
> return null;
> }
> return doc;
> }
>
> public void setConf(Configuration conf) {
> this.conf = conf;
> }
>
> public Configuration getConf() {
> return this.conf;
> }
>
> }
> *
> We should probably have a think about how to do that systematically as the
> current behaviour is slightly counter intuitive. Could you please open a
> JIRA for this?
>
> Thanks
>
> Julien
>
>
>
> On 7 February 2011 21:41, Joshua J Pavel <jp...@us.ibm.com> wrote:
>
> >
> > Running version 1.2.
> >
> > A very simple page I'm using to seed some URLs but don't want to return
> in
> > the index itself has this metatag:
> > <head><META http-equiv="Content-Type" content="text/html;
> > charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
> >
> > ...but the page keeps showing up in my index. Any thoughts on how I can
> > troubleshoot this or otherwise implement a page that I want to be crawled
> > but not indexed?
>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
Re: Nutch not respecting a NOINDEX,FOLLOW tag
Posted by Julien Nioche <li...@gmail.com>.
Hi Joshua
you can circumvent that by creating a custom indexing filter e.g.
MetaNoIndexingFilter below
*/**
* Prevents documents not allowing indexing in the meta to be indexed. By
* default Nutch simply empties the content and title fields but this is not
* enough to prevent documents to match e.g. on URL, metatags etc...
**/
public class MetaNoIndexingFilter implements IndexingFilter {
public static final Log LOG =
LogFactory.getLog(MetaNoIndexingFilter.class);
private Configuration conf;
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
// should rely on doc or parse metadata but nothing stored
// by the html parser
String text = parse.getText();
String title = parse.getData().getTitle();
if ((text == null || text.equals(""))
&& (title == null || title.equals(""))) {
// no text -> no indexing
return null;
}
return doc;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return this.conf;
}
}
*
We should probably have a think about how to do that systematically as the
current behaviour is slightly counter intuitive. Could you please open a
JIRA for this?
Thanks
Julien
On 7 February 2011 21:41, Joshua J Pavel <jp...@us.ibm.com> wrote:
>
> Running version 1.2.
>
> A very simple page I'm using to seed some URLs but don't want to return in
> the index itself has this metatag:
> <head><META http-equiv="Content-Type" content="text/html;
> charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
>
> ...but the page keeps showing up in my index. Any thoughts on how I can
> troubleshoot this or otherwise implement a page that I want to be crawled
> but not indexed?
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com