You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2016/10/26 20:34:04 UTC

Re: [MASSMAIL]RE: about canonical pages to avoid duplicates pages

Thanks Markus.
I thinks i will do a parsefilter to add this logic. you are right,excellents comments are there.
This is an intersting topic.


----- Mensaje original -----
De: "Markus Jelsma" <ma...@openindex.io>
Para: "user" <us...@nutch.apache.org>
Enviados: MiƩrcoles, 26 de Octubre 2016 16:27:06
Asunto: [MASSMAIL]RE: about canonical pages to avoid duplicates pages

Hello Eyeris - there is no such thing in Nutch right now. Although i do seem to remember having a plugin that provides support for it, as well as support for it via HTTP headers and og:url, of course with normalize and filter and uses robots=noindex to prevent indexing duplicates.

You can also try to improve on the patch attached to NUTCH-710. There are excellent comments for guidance. 

M.
 
-----Original message-----
> From:Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Wednesday 26th October 2016 22:01
> To: user@nutch.apache.org
> Subject: about canonical pages to avoid duplicates pages
> 
> Hi all.
> Im using nutch 1.12 and solr 4.10.3. in local mode.
> I have detected a lot of duplicates pages on crawlDB. Maybe using canonical atribute i can reduce duplicate pages on crawldb.
> I have read a old post(see below),that is an intersting topic.
> https://issues.apache.org/jira/browse/NUTCH-710 
> 
> Is this feature supported by nutch or not ?.
> 
> 
>