You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shamik Bandopadhyay <sh...@gmail.com> on 2015/03/26 18:26:55 UTC

Uneven index distribution using composite router

Hi,

   I'm using a three level composite router in a solr cloud environment,
primarily for multi-tenant and field collapsing. The format is as follows.

*language!topic!url*.

An example would be :

ENU!12345!www.testurl.com/enu/doc1
GER!12345!www.testurl.com/ger/doc2
CHS!67890!www.testurl.com/chs/doc3

The Solr Cloud cluster contains 2 shard, each having 3 replicas. After
indexing around 10 million documents, I'm observing that the index size in
shard 1 is around 60gb while shard 2 is 15gb. So the bulk of the data is
getting indexed in shard 1. Since 60% of the document is english, I expect
the index size to be higher on one shard, but the difference seem little
too high.

The idea is to make sure that all ENU!12345 documents are routed to one
shard so that distributed field collapsing works. Is there something I can
do differently here to make a better distribution ?

Any pointers will be appreciated.

Regards,
Shamik

Re: Uneven index distribution using composite router

Posted by shamik <sh...@gmail.com>.
Thanks for your reply Eric.

In my case, I've 14 languages, out of which 50% of the documents belong to
English. German and CHS will probably constitute another 25%. I'm not using
copyfield, rather, each language has it's dedicated field such as title_enu,
text_enu, title_ger,text_ger, etc. Since I know the language prior to index
time, this works for, me. 

I've added one more sample key in the example. 

ENU!12345!www.testurl.com/enu/doc1 
ENU!12345!www.testurl.com/enu/doc10 
GER!12345!www.testurl.com/ger/doc2 
CHS!67890!www.testurl.com/chs/doc3 

As you can see, there are 2 documents in english having same topic id
(12345). I added topicid as part of the key to make sure that they are
residing in the same shard in order to make field collapsing work on topic
id. I can perhaps remove the composite key and only have language and url,
something like, 

ENU!www.testurl.com/enu/doc1

But that'll probably not solve the distribution issue. You mentioned "when
you take over routing, making sure the distribution is even is now your
responsibility." I'm wondering, what's the best practice to make it happen ?
I can get away from composite router and manually assign a bunch of language
to a dedicated shard, both during index and query time. But I'm not sure
keeping a map is an efficient way of dealing with it. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Uneven-index-distribution-using-composite-router-tp4195569p4195591.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Uneven index distribution using composite router

Posted by Erick Erickson <er...@gmail.com>.
right, when you take over routing, making sure the distribution is
even is now your responsibility.

Your assumption is that the amount of _text_ in each doc is roughly
the same between your three languages, have you verified this? And are
you doing anything like copyFields that are kicking in on one shard
but not the others (e.g. if you have text_en fields you might be
copying that to text_en_all but not doing so with text_ger to
text_ger_all). that's totally a shot in the dark though.

Best,
Erick

On Thu, Mar 26, 2015 at 10:26 AM, Shamik Bandopadhyay <sh...@gmail.com> wrote:
> Hi,
>
>    I'm using a three level composite router in a solr cloud environment,
> primarily for multi-tenant and field collapsing. The format is as follows.
>
> *language!topic!url*.
>
> An example would be :
>
> ENU!12345!www.testurl.com/enu/doc1
> GER!12345!www.testurl.com/ger/doc2
> CHS!67890!www.testurl.com/chs/doc3
>
> The Solr Cloud cluster contains 2 shard, each having 3 replicas. After
> indexing around 10 million documents, I'm observing that the index size in
> shard 1 is around 60gb while shard 2 is 15gb. So the bulk of the data is
> getting indexed in shard 1. Since 60% of the document is english, I expect
> the index size to be higher on one shard, but the difference seem little
> too high.
>
> The idea is to make sure that all ENU!12345 documents are routed to one
> shard so that distributed field collapsing works. Is there something I can
> do differently here to make a better distribution ?
>
> Any pointers will be appreciated.
>
> Regards,
> Shamik