You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by "Natarajan, Rajeswari" <ra...@sap.com> on 2021/04/09 22:38:48 UTC

Document routing with composted router

Trying to understand how solr is co-locating documents with a prefix using composite id router scheme.

Created a collection with 2 shards with composite id router. Published 3 docs , 2 docs with  prefix  "tenant1!" in the docId field and 1 doc with prefix "tenant2!" in the docId.
Queried the collections with shards=shard1 and shards=shard2 parameter.

Saw that 3 documents are placed in shard1 and on shard2 there are no documents.  Is there a certain threshold number of docs  to be present in shard1 ,before shard2 is considered.

According to https://sematext.com/blog/solrcloud-large-tenants-and-routing/ , documents with first level prefix will be routed to one shard.  Is it a possibility to send documents of one tenant to occupy one shard in a collection in composite id router scheme.


Thanks,
Rajeswari

On 4/7/21, 2:07 PM, "Natarajan, Rajeswari" <ra...@sap.com> wrote:

    Thanks much for your reply.
    Thanks,
    Rajeswari

    On 4/7/21, 1:16 PM, "Shawn Heisey" <ap...@elyograg.org> wrote:

        On 4/7/2021 1:41 PM, Natarajan, Rajeswari wrote:
        > If there is any way to get the size of the index of tenant in a collection where multiple tenants co-exist with composite id router scheme ,let me know
        > We need to somehow track the tenant's index size to see if it grows too big and document count is not proportional to index size in our case.

        There isn't any way to do that.  The way that Lucene's indexes are 
        designed, obtaining that information is currently impossible, and it 
        would likely take a VERY large amount of development effort to make it 
        possible.  I would guess that even if it were possible, obtaining that 
        information would be very expensive in terms of system resources and time.

        The best you can do with current technology is estimate the size based 
        on document count compared to the whole index.  But if each tenant has 
        very different kinds of data in the index, that method would probably 
        give you inaccurate information.

        One thing you could do to have each one be its own collection is set up 
        multiple cloud installs, which can share one zookeeper ensemble by using 
        different chroot values for each one, and only put a few hundred 
        collections in each cloud.  This would probably require a lot of 
        additional hardware, and because of Lucene's economies of scale that 
        Walter was talking about, multiple collections WILL be larger on disk 
        than multiple tenants in one collection.

        Thanks,
        Shawn



Re: Document routing with composted router

Posted by Shawn Heisey <ap...@elyograg.org>.
On 4/9/2021 4:38 PM, Natarajan, Rajeswari wrote:
> Trying to understand how solr is co-locating documents with a prefix using composite id router scheme.
> 
> Created a collection with 2 shards with composite id router. Published 3 docs , 2 docs with  prefix  "tenant1!" in the docId field and 1 doc with prefix "tenant2!" in the docId.
> Queried the collections with shards=shard1 and shards=shard2 parameter.
> 
> Saw that 3 documents are placed in shard1 and on shard2 there are no documents.  Is there a certain threshold number of docs  to be present in shard1 ,before shard2 is considered.
> 
> According to https://sematext.com/blog/solrcloud-large-tenants-and-routing/ , documents with first level prefix will be routed to one shard.  Is it a possibility to send documents of one tenant to occupy one shard in a collection in composite id router scheme.

Composite routing like that does not exactly let you choose which shards 
will be used.

Here's a relevant quote from the reference guide:

'So "IBM/3!12345" will take 3 bits from the shard key and 29 bits from 
the unique doc id, spreading the tenant over 1/8th of the shards in the 
collection. Likewise if the num value was 2 it would spread the 
documents across 1/4th the number of shards. At query time, you include 
the prefix(es) along with the number of bits into your query with the 
_route_ parameter (i.e., q=solr&_route_=IBM/3!) to direct queries to 
specific shards.'

The part before the ! is hashed as is the part after the ! character. 
The hash bits are then combined, and that full hash decides which shard 
will get the document.

You can't say "use these specific shards" with that capability.  The 
tenant part just tells Solr to only use a certain reduced number of 
shards, but because it utilizes hashing to figure out which shards to 
use, there's never any guarantee that tenant1 will choose different 
shards from tenant2.  So you cannot use this to accomplish your original 
goal of determining the index size of a single tenant within a collection.

Thanks,
Shawn