You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thomas Egense <th...@gmail.com> on 2013/10/01 15:28:50 UTC

SolrCloud. Scale-test by duplicating same index to the shards and make it behave each index is different (uniqueId).

Hello everyone,
I have a small challenge performance testing a SolrCloud setup. I have 10
shards, and each shard is supposed to have index-size ~200GB. However I
only have a single index of 200GB because it will take too long to build
another index with different data,  and I hope to somehow use this index on
all 10 shards and make it behave as all documents are different on each
shard. So building more indexes from new data is not an option.

Making a query to a SolrCloud is a two-phase operation. First all shards
receive the query and return ID's and ranking. The merger will then remove
duplicate ID's and then the full documents will be retreived.

When I copy this index to all shards and make a request the following will
happen: Phase one: All shards will receive the query and return ids+ranking
(actually same set from all shards). This part is realistic enough.
Phase two: ID's will be merged and retrieving the documents is not
realistic as if they were spread out between shards (IO wise).

Is there any way I can 'fake' this somehow and have shards return a
prefixed_id for phase1 etc., which then also have to be undone when
retriving the documents for phase2.  I have tried making the hack in
org.apache.solr.handler.component.QueryComponent and a few other classes,
but no success. (The resultset are always empty). I do not need to index
any new documents, which would also be a challenge due to the ID
hash-interval for the shards with this hack.

Anyone has a good idea how to make this hack work?

From,
Thomas Egense

Re: SolrCloud. Scale-test by duplicating same index to the shards and make it behave each index is different (uniqueId).

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi,

I don't know. But,  unless something outside Solr is a bottleneck, it may
be wise to see if you can speed up indexing. Maybe we can help here...

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Oct 1, 2013 9:29 AM, "Thomas Egense" <th...@gmail.com> wrote:

> Hello everyone,
> I have a small challenge performance testing a SolrCloud setup. I have 10
> shards, and each shard is supposed to have index-size ~200GB. However I
> only have a single index of 200GB because it will take too long to build
> another index with different data,  and I hope to somehow use this index on
> all 10 shards and make it behave as all documents are different on each
> shard. So building more indexes from new data is not an option.
>
> Making a query to a SolrCloud is a two-phase operation. First all shards
> receive the query and return ID's and ranking. The merger will then remove
> duplicate ID's and then the full documents will be retreived.
>
> When I copy this index to all shards and make a request the following will
> happen: Phase one: All shards will receive the query and return ids+ranking
> (actually same set from all shards). This part is realistic enough.
> Phase two: ID's will be merged and retrieving the documents is not
> realistic as if they were spread out between shards (IO wise).
>
> Is there any way I can 'fake' this somehow and have shards return a
> prefixed_id for phase1 etc., which then also have to be undone when
> retriving the documents for phase2.  I have tried making the hack in
> org.apache.solr.handler.component.QueryComponent and a few other classes,
> but no success. (The resultset are always empty). I do not need to index
> any new documents, which would also be a challenge due to the ID
> hash-interval for the shards with this hack.
>
> Anyone has a good idea how to make this hack work?
>
> From,
> Thomas Egense
>