You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "kunhu0404@gmail.com" <ku...@gmail.com> on 2018/08/29 12:02:07 UTC

Solr indexing Duplicate URL's ending with /

Team,

Need suggestion on how to remove the duplicate entries while indexing to
Solr. Below are the sample entries i see in solr collection while i need to
remove the one which is ending with /

https://www.abc.com/2018/test.html
https://www.abc.com/2018/test.html/


Thank you



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr indexing Duplicate URL's ending with /

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,

You would have to direct this question to the crawler you are using, since it is the crawler that decides the document ID to send to Solr. Most crawlers will have configuration options to normalize the URL for each document.

However you could also try to clean the URL after it arrives in SOlr. See URLClassifyProcessor https://lucene.apache.org/solr/guide/7_2/update-request-processors.html#general-use-updateprocessorfactories <https://lucene.apache.org/solr/guide/7_2/update-request-processors.html#general-use-updateprocessorfactories> which may perhaps help.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 29. aug. 2018 kl. 14:02 skrev kunhu0404@gmail.com:
> 
> Team,
> 
> Need suggestion on how to remove the duplicate entries while indexing to
> Solr. Below are the sample entries i see in solr collection while i need to
> remove the one which is ending with /
> 
> https://www.abc.com/2018/test.html
> https://www.abc.com/2018/test.html/
> 
> 
> Thank you
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html