You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by weishenyun <wl...@yahoo.com.cn> on 2012/08/16 06:27:30 UTC

Can Nutch process rel-tag likes rel="nofollow"?

I know Nutch crawl the website according to Robot protocol if you make that
configuration. And it will not fetch and parse the link on the page which
contains <meta name="robots" content="nofollow">. But can Nutch process
rel-tag likes rel="nofollow" in the tags  ......  on the page?



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Can Nutch process rel-tag likes rel="nofollow"?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
My bad

Thank you Markus and sorry if I caused any confusion :0|

Lewis

On Thu, Aug 16, 2012 at 9:20 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> Yes, this is supported in trunk and will still be supported when switching to Tika for outlink extraction. Anchors with NOFOLLOW will simply be discarded.
>
>
> -----Original message-----
>> From:Lewis John Mcgibbney <le...@gmail.com>
>> Sent: Thu 16-Aug-2012 10:12
>> To: dev@nutch.apache.org
>> Subject: Re: Can Nutch process rel-tag likes rel=&quot;nofollow&quot;?
>>
>> Currently it looks we like don't have full support for such
>> functionality. It is straight foward to grab the nofollow rel tag but
>> the post processing is not currently implemented therefore you would
>> need to do this yourself.
>>
>> Lewis
>>
>> On Thu, Aug 16, 2012 at 5:27 AM, weishenyun <wl...@yahoo.com.cn> wrote:
>> > I know Nutch crawl the website according to Robot protocol if you make that
>> > configuration. And it will not fetch and parse the link on the page which
>> > contains <meta name="robots" content="nofollow">. But can Nutch process
>> > rel-tag likes rel="nofollow" in the tags  ......  on the page?
>> >
>> >
>> >
>> > --
>> > View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541.html
>> > Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>
>>
>>
>> --
>> Lewis
>>



-- 
Lewis

Re: Can Nutch process rel-tag likes rel="nofollow"?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Yep its there alright

Thanks Markus

On Thu, Aug 16, 2012 at 9:38 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> Yes! It is in 1.x (for a looong time) and must be in 2.x as well. Can't find any reason why it shouldn't but you can always check the TikeParser.java code, it should be somewhere near the bottom of the source.
>
>
>
> -----Original message-----
>> From:weishenyun <wl...@yahoo.com.cn>
>> Sent: Thu 16-Aug-2012 10:36
>> To: dev@nutch.apache.org
>> Subject: RE: Can Nutch process rel-tag likes rel=&quot;nofollow&quot;?
>>
>> You mean that function is supported in trunk? In which Nutch version? Nutch
>> 1.5.1? Or Nutch 2.0?
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001570.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>



-- 
Lewis

RE: Can Nutch process rel-tag likes rel="nofollow"?

Posted by weishenyun <wl...@yahoo.com.cn>.
I find that.
Thank you so much, Markus !



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001588.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

RE: Can Nutch process rel-tag likes rel="nofollow"?

Posted by Markus Jelsma <ma...@openindex.io>.
I've checked it, the source is in DOMContentUtils. Anchors with rel="nofollow" are discarded.
 
 
-----Original message-----
> From:weishenyun <wl...@yahoo.com.cn>
> Sent: Thu 16-Aug-2012 11:09
> To: dev@nutch.apache.org
> Subject: RE: Can Nutch process rel-tag likes rel=&quot;nofollow&quot;?
> 
> Well, I have read TikaParser.java code in Nutch 1.x and Nutch 2.0. I can
> easily get source code like these below.
> 
> if (!metaTags.getNoFollow()) { // okay to follow links
>       ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
>       URL baseTag = utils.getBase(root);
>       if (LOG.isTraceEnabled()) {
>         LOG.trace("Getting links...");
>       }
>       utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
>       outlinks = l.toArray(new Outlink[l.size()]);
>       if (LOG.isTraceEnabled()) {
>         LOG.trace("found " + outlinks.length + " outlinks in " + base);
>       }
>     }
> 
> But I think these code is trying to process nofollow or noIndex in metadata
> tags. For example, <meta name="robots" content="nofollow"> or <meta
> name="robots" content="noindex">. And these tags control all the links on
> that page.
> 
> But my problem is that a single link on one page just like  a
> href="http://www.google.com" rel="nofollow" . In this case, will Nutch
> discard this link according to tags rel='nofollow'. 
> Thanks Markus. 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001582.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 

RE: Can Nutch process rel-tag likes rel="nofollow"?

Posted by weishenyun <wl...@yahoo.com.cn>.
Well, I have read TikaParser.java code in Nutch 1.x and Nutch 2.0. I can
easily get source code like these below.

if (!metaTags.getNoFollow()) { // okay to follow links
      ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
      URL baseTag = utils.getBase(root);
      if (LOG.isTraceEnabled()) {
        LOG.trace("Getting links...");
      }
      utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
      outlinks = l.toArray(new Outlink[l.size()]);
      if (LOG.isTraceEnabled()) {
        LOG.trace("found " + outlinks.length + " outlinks in " + base);
      }
    }

But I think these code is trying to process nofollow or noIndex in metadata
tags. For example, <meta name="robots" content="nofollow"> or <meta
name="robots" content="noindex">. And these tags control all the links on
that page.

But my problem is that a single link on one page just like  a
href="http://www.google.com" rel="nofollow" . In this case, will Nutch
discard this link according to tags rel='nofollow'. 
Thanks Markus. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001582.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

RE: Can Nutch process rel-tag likes rel="nofollow"?

Posted by Markus Jelsma <ma...@openindex.io>.
Yes! It is in 1.x (for a looong time) and must be in 2.x as well. Can't find any reason why it shouldn't but you can always check the TikeParser.java code, it should be somewhere near the bottom of the source.

 
 
-----Original message-----
> From:weishenyun <wl...@yahoo.com.cn>
> Sent: Thu 16-Aug-2012 10:36
> To: dev@nutch.apache.org
> Subject: RE: Can Nutch process rel-tag likes rel=&quot;nofollow&quot;?
> 
> You mean that function is supported in trunk? In which Nutch version? Nutch
> 1.5.1? Or Nutch 2.0? 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001570.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 

RE: Can Nutch process rel-tag likes rel="nofollow"?

Posted by weishenyun <wl...@yahoo.com.cn>.
You mean that function is supported in trunk? In which Nutch version? Nutch
1.5.1? Or Nutch 2.0? 



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001570.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

RE: Can Nutch process rel-tag likes rel="nofollow"?

Posted by Markus Jelsma <ma...@openindex.io>.
Yes, this is supported in trunk and will still be supported when switching to Tika for outlink extraction. Anchors with NOFOLLOW will simply be discarded.
 
 
-----Original message-----
> From:Lewis John Mcgibbney <le...@gmail.com>
> Sent: Thu 16-Aug-2012 10:12
> To: dev@nutch.apache.org
> Subject: Re: Can Nutch process rel-tag likes rel=&quot;nofollow&quot;?
> 
> Currently it looks we like don't have full support for such
> functionality. It is straight foward to grab the nofollow rel tag but
> the post processing is not currently implemented therefore you would
> need to do this yourself.
> 
> Lewis
> 
> On Thu, Aug 16, 2012 at 5:27 AM, weishenyun <wl...@yahoo.com.cn> wrote:
> > I know Nutch crawl the website according to Robot protocol if you make that
> > configuration. And it will not fetch and parse the link on the page which
> > contains <meta name="robots" content="nofollow">. But can Nutch process
> > rel-tag likes rel="nofollow" in the tags  ......  on the page?
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541.html
> > Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
> 
> 
> -- 
> Lewis
> 

Re: Can Nutch process rel-tag likes rel="nofollow"?

Posted by weishenyun <wl...@yahoo.com.cn>.
Is there any plugin extension points related to this problem? Or should I
modify Nutch source code, perhaps the part of  ParserJob?  Thanks very much!



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001566.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Can Nutch process rel-tag likes rel="nofollow"?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Currently it looks we like don't have full support for such
functionality. It is straight foward to grab the nofollow rel tag but
the post processing is not currently implemented therefore you would
need to do this yourself.

Lewis

On Thu, Aug 16, 2012 at 5:27 AM, weishenyun <wl...@yahoo.com.cn> wrote:
> I know Nutch crawl the website according to Robot protocol if you make that
> configuration. And it will not fetch and parse the link on the page which
> contains <meta name="robots" content="nofollow">. But can Nutch process
> rel-tag likes rel="nofollow" in the tags  ......  on the page?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.



-- 
Lewis