You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Phil Hoy <ph...@friendsreunited.co.uk> on 2012/03/07 19:37:31 UTC

Custom Sharding on solrcloud

Hi,

We have a large index and would like to shard by a particular field value, in our case surname. This way we can scale out to multiple machines, yet as most queries filter on surname we can use some application logic to hit just the one core to get the results we need.

Furthermore as we anticipate the index will grow over time so it make sense (to us) to host a number of shards on a single machine until they get too big at which point we can then move them to another machine.

We are using solrcloud and it is set up using a solrcore per shard, that way we can direct both queries and updates to the appropriate core/shard. To do this our solr.xml looks a bit like this:

<cores defaultCoreName="default" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" >
<core shard="default" name="aaa-ava" instanceDir="/data/recordsets/shards/aaa-ava" collection="recordsets" />
               <core shard="aaa-ava" name="aaa-ava" instanceDir="/data/recordsets/shards/aaa-ava" collection="recordsets" />
               <core shard="avb-bel" name="avb-bel" instanceDir="/data/recordsets/shards/avb-bel" collection="recordsets" />            .......

Directed updates via:
http:/server/solr/aaa-ava/update/json  [{surname:"adams"}]

Directed queries via:
http:/server/solr/select?surname:adams&shards=aaa-ava

This setup used to work in version apache-solr-4.0-2011-12-12_09-14-13  before the more recent solrcloud changes but now the update is not directed to the appropriate core. Is there a better way to achieve our needs?

Phil


Re: Custom Sharding on solrcloud

Posted by Mark Miller <ma...@gmail.com>.
Hmm...let me think. At a minimum we intend to make the hashing mechanism pluggable...need to think if there is something you else you could try now...

On Mar 8, 2012, at 4:28 AM, Phil Hoy wrote:

> Hi,
> 
> If I remove the DistributedUpdateProcessorFactory I will have to manage a master slave setup myself by updating solely to the master and replicating to any slave. I wonder is it possible to have distributed updates but confined to the sub-set of cores and replicas within  a collection that share the same name?
> 
> Phil
> 
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com] 
> Sent: 08 March 2012 01:02
> To: solr-user@lucene.apache.org
> Subject: Re: Custom Sharding on solrcloud
> 
> Hi Phil - 
> 
> The default update chain now includes the distributed update processor by default - and if in solrcloud mode it will be active.
> 
> Probably, what you want to do is define your own update chain (see the wiki). Then you can add that update chain as the default for your json update handler in solrconfig.xml.
> 
> <!-- referencing it in an update handler -->  <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler" >
>   <lst name="defaults">
>     <str name="update.chain">mychain</str>
>   </lst>
> </requestHandler>
> 
> The default chain is: 
> 
>              new LogUpdateProcessorFactory(),
>              new DistributedUpdateProcessorFactory(),
>              new RunUpdateProcessorFactory()
> 
> So just use Log and Run instead to get your old behavior.
> 
> - Mark
> 
> On Mar 7, 2012, at 1:37 PM, Phil Hoy wrote:
> 
>> Hi,
>> 
>> We have a large index and would like to shard by a particular field value, in our case surname. This way we can scale out to multiple machines, yet as most queries filter on surname we can use some application logic to hit just the one core to get the results we need.
>> 
>> Furthermore as we anticipate the index will grow over time so it make sense (to us) to host a number of shards on a single machine until they get too big at which point we can then move them to another machine.
>> 
>> We are using solrcloud and it is set up using a solrcore per shard, that way we can direct both queries and updates to the appropriate core/shard. To do this our solr.xml looks a bit like this:
>> 
>> <cores defaultCoreName="default" adminPath="/admin/cores" 
>> zkClientTimeout="10000" hostPort="8983" > <core shard="default" name="aaa-ava" instanceDir="/data/recordsets/shards/aaa-ava" collection="recordsets" />
>>              <core shard="aaa-ava" name="aaa-ava" instanceDir="/data/recordsets/shards/aaa-ava" collection="recordsets" />
>>              <core shard="avb-bel" name="avb-bel" instanceDir="/data/recordsets/shards/avb-bel" collection="recordsets" />            .......
>> 
>> Directed updates via:
>> http:/server/solr/aaa-ava/update/json  [{surname:"adams"}]
>> 
>> Directed queries via:
>> http:/server/solr/select?surname:adams&shards=aaa-ava
>> 
>> This setup used to work in version apache-solr-4.0-2011-12-12_09-14-13  before the more recent solrcloud changes but now the update is not directed to the appropriate core. Is there a better way to achieve our needs?
>> 
>> Phil
>> 
> 
> - Mark Miller
> lucidimagination.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ______________________________________________________________________
> This email has been scanned by the brightsolid Email Security System. Powered by MessageLabs ______________________________________________________________________

- Mark Miller
lucidimagination.com












RE: Custom Sharding on solrcloud

Posted by Phil Hoy <ph...@friendsreunited.co.uk>.
Hi,

If I remove the DistributedUpdateProcessorFactory I will have to manage a master slave setup myself by updating solely to the master and replicating to any slave. I wonder is it possible to have distributed updates but confined to the sub-set of cores and replicas within  a collection that share the same name?

Phil

-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com] 
Sent: 08 March 2012 01:02
To: solr-user@lucene.apache.org
Subject: Re: Custom Sharding on solrcloud

Hi Phil - 

The default update chain now includes the distributed update processor by default - and if in solrcloud mode it will be active.

Probably, what you want to do is define your own update chain (see the wiki). Then you can add that update chain as the default for your json update handler in solrconfig.xml.

 <!-- referencing it in an update handler -->  <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler" >
   <lst name="defaults">
     <str name="update.chain">mychain</str>
   </lst>
 </requestHandler>

The default chain is: 

              new LogUpdateProcessorFactory(),
              new DistributedUpdateProcessorFactory(),
              new RunUpdateProcessorFactory()

So just use Log and Run instead to get your old behavior.

- Mark

On Mar 7, 2012, at 1:37 PM, Phil Hoy wrote:

> Hi,
> 
> We have a large index and would like to shard by a particular field value, in our case surname. This way we can scale out to multiple machines, yet as most queries filter on surname we can use some application logic to hit just the one core to get the results we need.
> 
> Furthermore as we anticipate the index will grow over time so it make sense (to us) to host a number of shards on a single machine until they get too big at which point we can then move them to another machine.
> 
> We are using solrcloud and it is set up using a solrcore per shard, that way we can direct both queries and updates to the appropriate core/shard. To do this our solr.xml looks a bit like this:
> 
> <cores defaultCoreName="default" adminPath="/admin/cores" 
> zkClientTimeout="10000" hostPort="8983" > <core shard="default" name="aaa-ava" instanceDir="/data/recordsets/shards/aaa-ava" collection="recordsets" />
>               <core shard="aaa-ava" name="aaa-ava" instanceDir="/data/recordsets/shards/aaa-ava" collection="recordsets" />
>               <core shard="avb-bel" name="avb-bel" instanceDir="/data/recordsets/shards/avb-bel" collection="recordsets" />            .......
> 
> Directed updates via:
> http:/server/solr/aaa-ava/update/json  [{surname:"adams"}]
> 
> Directed queries via:
> http:/server/solr/select?surname:adams&shards=aaa-ava
> 
> This setup used to work in version apache-solr-4.0-2011-12-12_09-14-13  before the more recent solrcloud changes but now the update is not directed to the appropriate core. Is there a better way to achieve our needs?
> 
> Phil
> 

- Mark Miller
lucidimagination.com












______________________________________________________________________
This email has been scanned by the brightsolid Email Security System. Powered by MessageLabs ______________________________________________________________________

Re: Custom Sharding on solrcloud

Posted by Mark Miller <ma...@gmail.com>.
Hi Phil - 

The default update chain now includes the distributed update processor by default - and if in solrcloud mode it will be active.

Probably, what you want to do is define your own update chain (see the wiki). Then you can add that update chain as the default for your json update handler in solrconfig.xml.

 <!-- referencing it in an update handler -->
 <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler" >
   <lst name="defaults">
     <str name="update.chain">mychain</str>
   </lst>
 </requestHandler>

The default chain is: 

              new LogUpdateProcessorFactory(),
              new DistributedUpdateProcessorFactory(),
              new RunUpdateProcessorFactory()

So just use Log and Run instead to get your old behavior.

- Mark

On Mar 7, 2012, at 1:37 PM, Phil Hoy wrote:

> Hi,
> 
> We have a large index and would like to shard by a particular field value, in our case surname. This way we can scale out to multiple machines, yet as most queries filter on surname we can use some application logic to hit just the one core to get the results we need.
> 
> Furthermore as we anticipate the index will grow over time so it make sense (to us) to host a number of shards on a single machine until they get too big at which point we can then move them to another machine.
> 
> We are using solrcloud and it is set up using a solrcore per shard, that way we can direct both queries and updates to the appropriate core/shard. To do this our solr.xml looks a bit like this:
> 
> <cores defaultCoreName="default" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" >
> <core shard="default" name="aaa-ava" instanceDir="/data/recordsets/shards/aaa-ava" collection="recordsets" />
>               <core shard="aaa-ava" name="aaa-ava" instanceDir="/data/recordsets/shards/aaa-ava" collection="recordsets" />
>               <core shard="avb-bel" name="avb-bel" instanceDir="/data/recordsets/shards/avb-bel" collection="recordsets" />            .......
> 
> Directed updates via:
> http:/server/solr/aaa-ava/update/json  [{surname:"adams"}]
> 
> Directed queries via:
> http:/server/solr/select?surname:adams&shards=aaa-ava
> 
> This setup used to work in version apache-solr-4.0-2011-12-12_09-14-13  before the more recent solrcloud changes but now the update is not directed to the appropriate core. Is there a better way to achieve our needs?
> 
> Phil
> 

- Mark Miller
lucidimagination.com