You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marcin Rzewucki <mr...@gmail.com> on 2013/03/13 22:34:56 UTC

Rejecting document already existing in different shard.

Hi there,

Let's say we use custom hashing algorithm and there is a document already
indexed in "shard1". After some time the same document has changed and
should be indexed to "shard2" (because of routing rules used in indexing
program). It has been indexed without issues and as a result 2 "almost" the
same documents are in different shards. In my case, they are duplicates for
the end user. Is it possible to reject a document if it already exists in
different shard ? It would be even easier to handle such cases prior to
adding new with the same ID.

Regards.

Re: Rejecting document already existing in different shard.

Posted by Dmitry Kan <so...@gmail.com>.
Hi,

Although we use logical sharding, there are cases in our environment as you
described. We handle them manually:

0. prepare new version of a document
1. remove the old version of the document
2. post it and commit

With logical sharding it is relatively easy, but we do need to store
location metadata in a DB.

In your case, have you had a look onto this:

http://wiki.apache.org/solr/Deduplication

Other things that come to mind: store the parameters of hashing and then
find a link between new and parameters of the "same" document.

Dmitry


On Wed, Mar 13, 2013 at 11:34 PM, Marcin Rzewucki <mr...@gmail.com>wrote:

> Hi there,
>
> Let's say we use custom hashing algorithm and there is a document already
> indexed in "shard1". After some time the same document has changed and
> should be indexed to "shard2" (because of routing rules used in indexing
> program). It has been indexed without issues and as a result 2 "almost" the
> same documents are in different shards. In my case, they are duplicates for
> the end user. Is it possible to reject a document if it already exists in
> different shard ? It would be even easier to handle such cases prior to
> adding new with the same ID.
>
> Regards.
>