You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Furkan KAMACI <fu...@gmail.com> on 2013/05/02 22:29:38 UTC

Pros and Cons of Using Deduplication of Solr at Huge Data Indexing

I use Solr 4.2.1 as SolrCloud. I crawl huge data with Nutch and index them
with SolrCloud. I wonder about Solr's deduplication mechanism. What exactly
it does and does it results with a slow indexing or is it beneficial for my
situation?

RE: Pros and Cons of Using Deduplication of Solr at Huge Data Indexing

Posted by Markus Jelsma <ma...@openindex.io>.

Distributed deduplication does not work right now:
https://issues.apache.org/jira/browse/SOLR-3473

We've chosen not do use update processors for deduplication anymore and rely on several custom mapreduce jobs in Nutch and some custom collectors in Solr to do some on-demand online deduplication.

If SOLR-3473 is fixed you can get very decent deduplication.

-----Original message-----
> From:Furkan KAMACI <fu...@gmail.com>
> Sent: Thu 02-May-2013 22:30
> To: solr-user@lucene.apache.org
> Subject: Pros and Cons of Using Deduplication of Solr at Huge Data Indexing
> 
> I use Solr 4.2.1 as SolrCloud. I crawl huge data with Nutch and index them
> with SolrCloud. I wonder about Solr's deduplication mechanism. What exactly
> it does and does it results with a slow indexing or is it beneficial for my
> situation?
>