You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Hasmik Sarkezians <ha...@zoominfo.com.INVALID> on 2022/05/18 16:56:29 UTC

Re: [External Email] Re: Shard Split and composite id

Thanks for the reply.

It doesn't matter to me which shard the document ends up in, just matters
how many shards the document ends up with:

And seems like I wouldn't have control over that as the number of shards
grows.

thanks,
hasmik



On Wed, May 18, 2022 at 11:38 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 5/18/22 08:42, Hasmik Sarkezians wrote:
> > Have a question about shard splitting and compositeId usage. We are
> > starting a solr collection with X number of shards for our multi-tenant
> > application. We are assuming that the number of shards will increase over
> > time as the number of customers grows as well as the customer data.
> >
> > We are thinking of using the <customerId>/num!docId format to specify
> > multiple shards for my tenants depending on the number of records that we
> > will index. We will start with 4 shards and then my assumption is that we
> > use the shard split to add more shards to the collection.
> >
> > customer size X = 1 shard and as such the compositeId would be
> > customer1!docId
> > customer size 5*X = 2 shards and as such the compositeId would be
> > customer2/1!docId
> >
> > And now if I split the shards and the number of shards becomes 5, 6, 7, 8
> > what happens to the data? The point is I don't want the customer2 endup
> in
> > 4 shards when we get to have 8 shards. If someone can shed some light
> here
> > I would appreciate it.
>
> I wonder if you have a good understanding of how a compositeId works.
>
> The prefix does not directly dictate what shard a document will end up
> in.  It determines how many bits of the full 32-bit ID hash will be
> computed from the prefix and how many from the rest of the ID.
>
>
> https://solr.apache.org/guide/8_11/shards-and-indexing-data-in-solrcloud.html#document-routing
> <https://solr.apache.org/guide/8_11/shards-and-indexing-data-in-solrcloud.html#document-routing>
>
> Something not stated there is how many bits are used if the number is
> not specified.  Looking at the code, the default appears to be 16 if the
> number of parts in the ID is 2, and 8 if the number of parts in the ID
> is 3.  I don't think it supports more than 3 parts.
>
> When you split a shard, the hash range for the shard will be split, and
> the range for the new shards will be smaller than any other shards that
> were not split.  So it may not be completely predictable which shards a
> composite ID will be stored in when you split them.  If you split ALL
> shards in half, then a prefix that limited the number of shards to 2
> could result in those documents being split across 4 shards, but
> depending on how many documents there are with that prefix and EXACTLY
> how the hashes end up being divided, it could be as low as 2 shards and
> as high as 4.  If there are a lot of documents with that prefix, chances
> are that it would be 4 shards.
>
> If you want explicit control over which shard a document ends up in, you
> cannot use compositeId.  You'll have to use the implicit router and
> designate a field where the name of the shard will go.  I don't think
> splitting shards is possible with the implicit router.
>
> Thanks,
> Shawn
>


-- 

Hasmik Sarkezians

VP, Applications

M:

O:

E: hasmik.sarkezians@zoominfo.com

805 Broadway Street, Suite 900
Vancouver, WA 98660

www.zoominfo.com




[image: Start Learning with Zoominfo!]
<https://signatures.zoominfo.com/uc/6228bfdaf1c5381fbaa4708b/c_604aad8859ca88009e007950/b_607da8e5f7d89f0025e881fa>

Re: [External Email] Re: Shard Split and composite id

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/18/22 10:56, Hasmik Sarkezians wrote:
> Thanks for the reply.
>
> It doesn't matter to me which shard the document ends up in, just matters
> how many shards the document ends up with:
>
> And seems like I wouldn't have control over that as the number of shards
> grows.

I've been thinking about some documentation to provide more detail about 
how this works.

Here's some of what I would include in that documentation:

Lets say that the prefix is IBM, and we're using composite IDs like 
IBM!12345in the index.  This means that the prefix will define 16 bits 
of the hash and the rest of the ID will define the other 16 bits.  Each 
shard has a defined hash range, and the combined hash will determine 
which shard the document ends up on.

The effective result of this if there are a lot of documents means that 
each prefix will land on 1/65536 (65536 being 2 to the power of 16) of 
the shards.  So if you have 65536 or fewer shards, then every document 
with that prefix will end up on only one shard. But you can't directly 
control which shard gets a given prefix -- that will be decided by what 
hash value the prefix generates. Unless there a lot of prefixes and a 
lot of documents in each prefix, there could be an imbalance of 
documents across the different shards, and that imbalance could be very 
extreme.

If you specify the number of bits with something like "IBM/3!12345" then 
for that example each prefix will end up on 1/8 of your shards.  (8 
being 2 to the power of 3)

Shard splitting can make things very complicated, unless you split ALL 
shards to the same number of target shards.  If you do a completely 
uniform split, then the same rules I just mentioned will apply to the 
new shards.

Thanks,
Shawn