You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by philippa griggs <ph...@hotmail.co.uk> on 2015/12/02 11:59:08 UTC

Protect against duplicates with the Migrate statement

Hello,


I'm using Solr 5.2.1 and Zookeeper 3.4.6.


I'm implementing two collections - HotDocuments and ColdDocuments . New documents will only be written to HotDocuments and every night I will migrate a chunk of documents into ColdDocuments.


In the test environment, I have the Collection API migrate statement working fine. I know this won't handle duplicates ending up in the ColdDocuments collection and I don't expect to have duplicate documents but I would like to protect against it- just in case.


We have a unique key and I've tried to implement de-duplication (https://cwiki.apache.org/confluence/display/solr/De-Duplication) but I still end up with duplicates in the ColdDocuments collection.



Does anyone have any suggestions on how I can protect against duplicates with the migrate statement?  Any ideas would be greatly appreciated.


Many thanks

Philippa

Re: Protect against duplicates with the Migrate statement

Posted by philippa griggs <ph...@hotmail.co.uk>.
I used two fields to set up the signature, the unique Id and a time stamp field.

As its in test, I set it up- cleared all the data out in both collecionsand reloaded it. I could see the signature which was created. I then migrated into cold collection which already had documents in with the same unique id and signature.
I ended up with duplicates in the cold collection.

Thanks for your help,

Philippa

________________________________________
From: Zheng Lin Edwin Yeo <ed...@gmail.com>
Sent: 03 December 2015 02:30:31
To: solr-user@lucene.apache.org
Subject: Re: Protect against duplicates with the Migrate statement

Hi Philippa,

Which field did you use to set it as SignatureField in your ColdDocuments
when you implement the de-duplication?

Regards,
Edwin


On 2 December 2015 at 18:59, philippa griggs <ph...@hotmail.co.uk>
wrote:

> Hello,
>
>
> I'm using Solr 5.2.1 and Zookeeper 3.4.6.
>
>
> I'm implementing two collections - HotDocuments and ColdDocuments . New
> documents will only be written to HotDocuments and every night I will
> migrate a chunk of documents into ColdDocuments.
>
>
> In the test environment, I have the Collection API migrate statement
> working fine. I know this won't handle duplicates ending up in the
> ColdDocuments collection and I don't expect to have duplicate documents but
> I would like to protect against it- just in case.
>
>
> We have a unique key and I've tried to implement de-duplication (
> https://cwiki.apache.org/confluence/display/solr/De-Duplication) but I
> still end up with duplicates in the ColdDocuments collection.
>
>
>
> Does anyone have any suggestions on how I can protect against duplicates
> with the migrate statement?  Any ideas would be greatly appreciated.
>
>
> Many thanks
>
> Philippa
>

Re: Protect against duplicates with the Migrate statement

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Philippa,

Which field did you use to set it as SignatureField in your ColdDocuments
when you implement the de-duplication?

Regards,
Edwin


On 2 December 2015 at 18:59, philippa griggs <ph...@hotmail.co.uk>
wrote:

> Hello,
>
>
> I'm using Solr 5.2.1 and Zookeeper 3.4.6.
>
>
> I'm implementing two collections - HotDocuments and ColdDocuments . New
> documents will only be written to HotDocuments and every night I will
> migrate a chunk of documents into ColdDocuments.
>
>
> In the test environment, I have the Collection API migrate statement
> working fine. I know this won't handle duplicates ending up in the
> ColdDocuments collection and I don't expect to have duplicate documents but
> I would like to protect against it- just in case.
>
>
> We have a unique key and I've tried to implement de-duplication (
> https://cwiki.apache.org/confluence/display/solr/De-Duplication) but I
> still end up with duplicates in the ColdDocuments collection.
>
>
>
> Does anyone have any suggestions on how I can protect against duplicates
> with the migrate statement?  Any ideas would be greatly appreciated.
>
>
> Many thanks
>
> Philippa
>

Re: Protect against duplicates with the Migrate statement

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Hi Philippa,

The migrate command actually splits the lucene index from the source
and merges it into the target collection. Whereas, the de-duplication
is applied only to incoming updates. So you see migrate is lower level
than de-duplication and therefore they cannot work together. If you
want de-duplication, you have no option but to index documents instead
of using migrate command.

On Wed, Dec 2, 2015 at 4:29 PM, philippa griggs
<ph...@hotmail.co.uk> wrote:
> Hello,
>
>
> I'm using Solr 5.2.1 and Zookeeper 3.4.6.
>
>
> I'm implementing two collections - HotDocuments and ColdDocuments . New documents will only be written to HotDocuments and every night I will migrate a chunk of documents into ColdDocuments.
>
>
> In the test environment, I have the Collection API migrate statement working fine. I know this won't handle duplicates ending up in the ColdDocuments collection and I don't expect to have duplicate documents but I would like to protect against it- just in case.
>
>
> We have a unique key and I've tried to implement de-duplication (https://cwiki.apache.org/confluence/display/solr/De-Duplication) but I still end up with duplicates in the ColdDocuments collection.
>
>
>
> Does anyone have any suggestions on how I can protect against duplicates with the migrate statement?  Any ideas would be greatly appreciated.
>
>
> Many thanks
>
> Philippa



-- 
Regards,
Shalin Shekhar Mangar.