You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bernd Fehling <be...@uni-bielefeld.de> on 2020/05/13 08:26:40 UTC

unique key accross collections within datacenter

Dear list,

in my SolrCloud 6.6 I have a huge collection and now I will get
much more data from a different source to be indexed.
So I'm thinking about a new collection and combine both, the existing
one and the new one with an alias.

But how to handle the unique key accross collections within a datacenter?
Is it at all possible?

I don't see any problems with add, update and delete of documents because
these operations are not using the alias.

But searching accross collections with alias and then fetching documents
by id from the result may lead to results where the id is in both collections?

I have no idea, but there are SolrClouds with a lot of collections out there.
How do they handle uniqueness accross collections within a datacenter?

Regards
Bernd

Re: unique key accross collections within datacenter

Posted by Erick Erickson <er...@gmail.com>.
So a doc in your new collection is expected to supersede a doc
with the same ID in the old one, right? 

What I’d do is delete the IDs from my old collection as they were added to
the new one, there’s not much use in keeping both if you always want
the new one.

Let’s assume you do this, the next issue is making sure all of your docs in 
the new collection are deleted from the old one, and your process will
inevitably have a hiccough or two. You could periodically use streaming to 
produce a list of IDs common to both collections, and have a cleanup
process you occasionally ran to make up for any glitches in the normal
delete-from-the-old-collection process, see:
https://lucene.apache.org/solr/guide/6_6/stream-decorators.html#stream-decorators

If that’s not the case, then having the same id in the different collections
doesn’t matter. Solr doesn’t use the ID for combining results, just routing and
then updating.

This is illustrated by the fact that, through user error, you can even get the same
document repeated in a result set if it gets indexed to two different shards.

And if neither of those are on target, what about “handling” unique IDs across
two collections do you think might go wrong?

Best,
Erick

> On May 13, 2020, at 4:26 AM, Bernd Fehling <be...@uni-bielefeld.de> wrote:
> 
> Dear list,
> 
> in my SolrCloud 6.6 I have a huge collection and now I will get
> much more data from a different source to be indexed.
> So I'm thinking about a new collection and combine both, the existing
> one and the new one with an alias.
> 
> But how to handle the unique key accross collections within a datacenter?
> Is it at all possible?
> 
> I don't see any problems with add, update and delete of documents because
> these operations are not using the alias.
> 
> But searching accross collections with alias and then fetching documents
> by id from the result may lead to results where the id is in both collections?
> 
> I have no idea, but there are SolrClouds with a lot of collections out there.
> How do they handle uniqueness accross collections within a datacenter?
> 
> Regards
> Bernd


Re: unique key accross collections within datacenter

Posted by ART GALLERY <al...@goretoy.com>.
check out the videos on this website TROO.TUBE don't be such a
sheep/zombie/loser/NPC. Much love!
https://troo.tube/videos/watch/aaa64864-52ee-4201-922f-41300032f219

On Wed, May 13, 2020 at 7:24 AM Bernd Fehling
<be...@uni-bielefeld.de> wrote:
>
> Thanks Eric for your answer.
>
> I was thinking to complex and seeing problems which are not there.
>
> I have your second scenario. The first huge collection still remains
> and will grow further while the second will start with same schema but
> content from a new source. Sure I could also load the content
> from the new source into the first huge collection but I want to
> have source, loading, maintenance handling separated.
> May be I also start the new collection with a new instance.
>
> Regards
> Bernd
>
> Am 13.05.20 um 13:40 schrieb Erick Erickson:
> > So a doc in your new collection is expected to supersede a doc
> > with the same ID in the old one, right?
> >
> > What I’d do is delete the IDs from my old collection as they were added to
> > the new one, there’s not much use in keeping both if you always want
> > the new one.
> >
> > Let’s assume you do this, the next issue is making sure all of your docs in
> > the new collection are deleted from the old one, and your process will
> > inevitably have a hiccough or two. You could periodically use streaming to
> > produce a list of IDs common to both collections, and have a cleanup
> > process you occasionally ran to make up for any glitches in the normal
> > delete-from-the-old-collection process, see:
> > https://lucene.apache.org/solr/guide/6_6/stream-decorators.html#stream-decorators
> >
> > If that’s not the case, then having the same id in the different collections
> > doesn’t matter. Solr doesn’t use the ID for combining results, just routing and
> > then updating.
> >
> > This is illustrated by the fact that, through user error, you can even get the same
> > document repeated in a result set if it gets indexed to two different shards.
> >
> > And if neither of those are on target, what about “handling” unique IDs across
> > two collections do you think might go wrong?
> >
> > Best,
> > Erick
> >
> >> On May 13, 2020, at 4:26 AM, Bernd Fehling <be...@uni-bielefeld.de> wrote:
> >>
> >> Dear list,
> >>
> >> in my SolrCloud 6.6 I have a huge collection and now I will get
> >> much more data from a different source to be indexed.
> >> So I'm thinking about a new collection and combine both, the existing
> >> one and the new one with an alias.
> >>
> >> But how to handle the unique key accross collections within a datacenter?
> >> Is it at all possible?
> >>
> >> I don't see any problems with add, update and delete of documents because
> >> these operations are not using the alias.
> >>
> >> But searching accross collections with alias and then fetching documents
> >> by id from the result may lead to results where the id is in both collections?
> >>
> >> I have no idea, but there are SolrClouds with a lot of collections out there.
> >> How do they handle uniqueness accross collections within a datacenter?
> >>
> >> Regards
> >> Bernd
> >

Re: unique key accross collections within datacenter

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
Thanks Eric for your answer.

I was thinking to complex and seeing problems which are not there.

I have your second scenario. The first huge collection still remains
and will grow further while the second will start with same schema but
content from a new source. Sure I could also load the content
from the new source into the first huge collection but I want to
have source, loading, maintenance handling separated.
May be I also start the new collection with a new instance.

Regards
Bernd

Am 13.05.20 um 13:40 schrieb Erick Erickson:
> So a doc in your new collection is expected to supersede a doc
> with the same ID in the old one, right? 
> 
> What I’d do is delete the IDs from my old collection as they were added to
> the new one, there’s not much use in keeping both if you always want
> the new one.
> 
> Let’s assume you do this, the next issue is making sure all of your docs in 
> the new collection are deleted from the old one, and your process will
> inevitably have a hiccough or two. You could periodically use streaming to 
> produce a list of IDs common to both collections, and have a cleanup
> process you occasionally ran to make up for any glitches in the normal
> delete-from-the-old-collection process, see:
> https://lucene.apache.org/solr/guide/6_6/stream-decorators.html#stream-decorators
> 
> If that’s not the case, then having the same id in the different collections
> doesn’t matter. Solr doesn’t use the ID for combining results, just routing and
> then updating.
> 
> This is illustrated by the fact that, through user error, you can even get the same
> document repeated in a result set if it gets indexed to two different shards.
> 
> And if neither of those are on target, what about “handling” unique IDs across
> two collections do you think might go wrong?
> 
> Best,
> Erick
> 
>> On May 13, 2020, at 4:26 AM, Bernd Fehling <be...@uni-bielefeld.de> wrote:
>>
>> Dear list,
>>
>> in my SolrCloud 6.6 I have a huge collection and now I will get
>> much more data from a different source to be indexed.
>> So I'm thinking about a new collection and combine both, the existing
>> one and the new one with an alias.
>>
>> But how to handle the unique key accross collections within a datacenter?
>> Is it at all possible?
>>
>> I don't see any problems with add, update and delete of documents because
>> these operations are not using the alias.
>>
>> But searching accross collections with alias and then fetching documents
>> by id from the result may lead to results where the id is in both collections?
>>
>> I have no idea, but there are SolrClouds with a lot of collections out there.
>> How do they handle uniqueness accross collections within a datacenter?
>>
>> Regards
>> Bernd
>