You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andre Bois-Crettez <an...@kelkoo.com> on 2012/02/28 11:40:21 UTC

Re: SolrCloud on Trunk

Consistent hashing seem like a solution to reduce the shuffling of keys
when adding/deleting shards :
http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/

Twitter describe a more flexible sharding in section "Gizzard handles
partitioning through a forwarding table"
https://github.com/twitter/gizzard
An explicit mapping could allow to take advantage of heterogeneous
servers, and still allow for reduced shuffling of document when
expanding/reducing the cluster.

Are there any ideas or progress in this direction, be it in a branch or
in JIRA issues ?


Andre


Jamie Johnson wrote:
> The case is actually anytime you need to add another shard.  With the
> current implementation if you need to add a new shard the current
> hashing approach breaks down.  Even with many small shards I think you
> still have this issue when you're adding/updating/deleting docs.  I'm
> definitely interested in hearing other approaches that would work
> though if there are any.
>
> On Sat, Jan 28, 2012 at 7:53 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> If this is to do load balancing, the usual solution is to use many
>> small shards, so you can just move one or two without doing any
>> surgery on indexes.
>>
>> On Sat, Jan 28, 2012 at 2:46 PM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>>
>>> On Sat, Jan 28, 2012 at 3:45 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>
>>>> Second question, I know there are discussion about storing the shard
>>>> assignments in ZK (i.e. shard 1 is responsible for hashed values
>>>> between 0 and 10, shard 2 is responsible for hashed values between 11
>>>> and 20, etc), this isn't done yet right?  So currently the hashing is
>>>> based on the number of shards instead of having the assignments being
>>>> calculated the first time you start the cluster (i.e. based on
>>>> numShards) so it could be adjusted later, right?
>>>>
>>> Right.  Storing the hash range for each shard/node is something we'll
>>> need to dynamically change the number of shards (as opposed to
>>> replicas), so we'll need to start doing it sooner or later.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
>

--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: SolrCloud on Trunk

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Mar 1, 2012 at 12:27 AM, Jamie Johnson <je...@gmail.com> wrote:
> Is there a ticket around doing this?

Around splitting shards?

The easiest thing to consider is just splitting a single shard in two
reusing some of the existing buffering/replication mechanisms we have.
1) create two new shards to represent each half of the old index
2) make sure leaders are forwarding udpates to them and that the
shards are buffering them
3) do a commit and split the current index
4) proceed with recovery as normal on the two new shards (replicate
the halfs, apply the buffered updates)
5) some unresolved stuff such as how to transition leadership from the
single big shard to the smaller shards.  maybe just handle like leader
failure.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10

Re: SolrCloud on Trunk

Posted by Jamie Johnson <je...@gmail.com>.

Mark,

Is there a ticket around doing this?  If the work/design was written
down somewhere the community might have a better idea of how exactly
we could help.

On Wed, Feb 29, 2012 at 11:21 PM, Mark Miller <ma...@gmail.com> wrote:
>
> On Feb 28, 2012, at 9:33 AM, Jamie Johnson wrote:
>
>> where specifically this is on the roadmap for SolrCloud.  Anyone
>> else have those details?
>
> I think we would like to do this sometime in the near future, but I don't know exactly what time frame fits in yet. There is a lot to do still, and we also need to get a 4 release of both Lucene and Solr out to users soon. It could be in a point release later - but it's open source - it really just depends on what people start doing it and get it done. I will say it's something I'd like to see done.
>
> With what we have now, one option we have talked about in the past was to just install multiple shards on a single machine - later you can start up a replica on a new machine when you are ready to grow and kill the original shard.
>
> i.e. you could startup 15 shards on a single machine, and then over time migrate shards off nodes and onto new hardware. It's as simple as starting up a new replica on the new hardware and removing the core on machines you want to stop serving that shard from. This would let you expand to a 15 shard/machine cluster with N replicas (scaling replicas is as simple as starting a new node or stopping an old one).
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>

Re: SolrCloud on Trunk

Posted by Mark Miller <ma...@gmail.com>.

On Feb 28, 2012, at 9:33 AM, Jamie Johnson wrote:

> where specifically this is on the roadmap for SolrCloud.  Anyone
> else have those details?

I think we would like to do this sometime in the near future, but I don't know exactly what time frame fits in yet. There is a lot to do still, and we also need to get a 4 release of both Lucene and Solr out to users soon. It could be in a point release later - but it's open source - it really just depends on what people start doing it and get it done. I will say it's something I'd like to see done.

With what we have now, one option we have talked about in the past was to just install multiple shards on a single machine - later you can start up a replica on a new machine when you are ready to grow and kill the original shard.

i.e. you could startup 15 shards on a single machine, and then over time migrate shards off nodes and onto new hardware. It's as simple as starting up a new replica on the new hardware and removing the core on machines you want to stop serving that shard from. This would let you expand to a 15 shard/machine cluster with N replicas (scaling replicas is as simple as starting a new node or stopping an old one).

- Mark Miller
lucidimagination.com

Re: SolrCloud on Trunk

Posted by Jamie Johnson <je...@gmail.com>.

Very interesting Andre.  I believe this is inline with the larger
vision, specifically you'd use the hashing algorithm to create the
initial splits in the forwarding table, then if you needed to add a
new shard you'd need to split/merge an existing range.  I think
creating the algorithm is probably the easier part (maybe I'm wrong?),
the harder part to me appears to be splitting the index based on the
new ranges and then moving that split to a new core.  I'm aware of the
index splitter contrib which could be used for this, but I am unaware
of where specifically this is on the roadmap for SolrCloud.  Anyone
else have those details?

On Tue, Feb 28, 2012 at 5:40 AM, Andre Bois-Crettez
<an...@kelkoo.com> wrote:
> Consistent hashing seem like a solution to reduce the shuffling of keys
> when adding/deleting shards :
> http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
>
> Twitter describe a more flexible sharding in section "Gizzard handles
> partitioning through a forwarding table"
> https://github.com/twitter/gizzard
> An explicit mapping could allow to take advantage of heterogeneous
> servers, and still allow for reduced shuffling of document when
> expanding/reducing the cluster.
>
> Are there any ideas or progress in this direction, be it in a branch or
> in JIRA issues ?
>
>
> Andre
>
>
>
> Jamie Johnson wrote:
>>
>> The case is actually anytime you need to add another shard.  With the
>> current implementation if you need to add a new shard the current
>> hashing approach breaks down.  Even with many small shards I think you
>> still have this issue when you're adding/updating/deleting docs.  I'm
>> definitely interested in hearing other approaches that would work
>> though if there are any.
>>
>> On Sat, Jan 28, 2012 at 7:53 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>>> If this is to do load balancing, the usual solution is to use many
>>> small shards, so you can just move one or two without doing any
>>> surgery on indexes.
>>>
>>> On Sat, Jan 28, 2012 at 2:46 PM, Yonik Seeley
>>> <yo...@lucidimagination.com> wrote:
>>>
>>>> On Sat, Jan 28, 2012 at 3:45 PM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>>
>>>>> Second question, I know there are discussion about storing the shard
>>>>> assignments in ZK (i.e. shard 1 is responsible for hashed values
>>>>> between 0 and 10, shard 2 is responsible for hashed values between 11
>>>>> and 20, etc), this isn't done yet right?  So currently the hashing is
>>>>> based on the number of shards instead of having the assignments being
>>>>> calculated the first time you start the cluster (i.e. based on
>>>>> numShards) so it could be adjusted later, right?
>>>>>
>>>> Right.  Storing the hash range for each shard/node is something we'll
>>>> need to dynamically change the number of shards (as opposed to
>>>> replicas), so we'll need to start doing it sooner or later.
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>
>>
>
> --
> André Bois-Crettez
>
> Search technology, Kelkoo
> http://www.kelkoo.com/
>
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à l'attention
> exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce
> message, merci de le détruire et d'en avertir l'expéditeur.