You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2015/10/19 22:02:23 UTC

how to avoid duplicate pages in nutch and solr?

Hello all.
I am using nutch 1.9(local mode) and solr 4.10.3
I have detected that some pages will appear duplicates in solr with diferent url but the same information
This are two examples of url

http://www.cubadebate.cu/noticias/2012/07/06/cientificos-espanoles-trabajan-en-gel-para-prevenir-el-sida/
http://www.cubadebate.cu/noticias/2012/07/06/cientificos-espanoles-trabajan-en-gel-para-prevenir-el-sida/comment-page-1/

How nutch try with duplicate pages? 
The solution must be in nutch or in solr?
Any body can suggest me any way to avoid and solve that problem? 
17 de octubre: Final Cubana 2015 del Concurso de ProgramaciĆ³n ACM-ICPC.
http://coj.uci.cu/contest/contestview.xhtml?cid=1407