You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2015/10/22 17:28:52 UTC

Re: [MASSMAIL]RE: how to avoid duplicate pages in nutch and solr?

Thanks a lot markus for your answer. it was very usefull for me.
The problem with the solution using DeduplicationJob in nutch is that i have deleted sometimes crawldb and duplicates pages are in solr only.
I think that the best solution for me must be in Solr.
I was reading about dedupe in solr and the post below was very usefull for me, it explain exactly what i need.

https://cwiki.apache.org/confluence/display/solr/De-Duplication

I have use TextProfileSignature (Fuzzy hashing implementation from nutch for near duplicate detection)
I will wait for tika boilerpipe to avoid page´s content repetitive.
I have detected that 2 pages has identical signature.
Do you know how to do a mechanism to delete the older of these duplicate document in solr ? 

17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
http://coj.uci.cu/contest/contestview.xhtml?cid07