You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Jason Rutherglen <ja...@gmail.com> on 2010/01/15 22:12:16 UTC

Solr Cloud wiki and branch notes

Here's some rough notes after running the unit tests, reviewing
some of the code (though not understanding it), and reviewing
the wiki page http://wiki.apache.org/solr/SolrCloud


We need a protocol in the URL, otherwise it's inflexible

I'm overwhelmed with all the ?? question areas of the document.

The page is huge, which signals to me maybe we're trying to do
too much

Revamping distributed search could be in a different branch
(this includes partial results)

Having a single solrconfig and schema for each core/shard in a
collection won't work for me. I need to define each core
externally, and I don't want Solr-Cloud to manage this, how will
this scenario work?

A host is about the same as node, I don't see the difference, or
enough of one

Cluster resizing and rebalancing can and should be built
externally and hopefully after an initial release that does the
basics well

Collection is a group of cores?

I like the model -> reality system. However how does the
versioning work? We need to know what the conversion progress
is? How will the queuing of in-progress alterations work (this
seems hard, I'd rather focus on this, make it work well, than
mess with other things like load balancing in the first release?
i.e. if this doesn't work well, Solr-Cloud isn't production
ready for me)

Shard Identification, this falls under too ambitious right now
IMO

I think we need a wiki page of just the basics of core/shard
management, implement that, then build all the rest of the features on top...
Otherwise this thing feels like it's going to be a nightmare to
test and deploy in production.

Re: Solr Cloud wiki and branch notes

Posted by Jason Rutherglen <ja...@gmail.com>.

> This is really about doing not-so-much in the very near term,
> while thinking ahead to the longer term.

Lets have a page dedicated to release 1.0 of cloud? I feel
uncomfortable editing the existing wiki because I don't know
what the plans are for the first release.

I need to revisit Katta as my short term plans include using
Zookeeper (not for failover) but simply for deploying
shards/cores to servers, and nothing else. I can use the core
admin interface to bring them online, update them etc. Or I'll
just implement something and make a patch to Solr... Thinking
out loud:

/anyname/shardlist-v1.txt /anyname/shardlist-v2.txt

where shardlist-v1.txt contains:
corename,coredownloadpath,instanceDir

Where coredownloadpath can be any URL including hftp, hdfs, ftp, http, https.

Where the system automagically uninstalls cores that should no
longer exist on a given server. Cores with the same name
deployed to the same server would use the reload command,
otherwise the create command.

Where there's a ZK listener on the /anyname directory for new
files that are greater than the last known installed
shardlist.txt.

Alternatively, an even simpler design would be uploading a
solr.xml file per server, something like:
/anyname/solr-prod01.solr.xml

Which a directory listener on each server parses and makes the
necessary changes (without restarting Tomcat).

On the search side in this system, I'd need to wait for the
cores to complete their install, then swap in a new core on the
search proxy that represents the new version of the corelist,
then the old cores could go away. This isn't very different than
the segmentinfos system used in Lucene IMO.

On Fri, Jan 15, 2010 at 1:53 PM, Yonik Seeley <ys...@gmail.com> wrote:
> On Fri, Jan 15, 2010 at 4:12 PM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>> The page is huge, which signals to me maybe we're trying to do
>> too much
>
> This is really about doing not-so-much in the very near term, while
> thinking ahead to the longer term.
>
>> Revamping distributed search could be in a different branch
>> (this includes partial results)
>
> That could just be a separate patch - it's scope is not that broad (I
> think there may already be a JIRA issue open for it).
>
>> Having a single solrconfig and schema for each core/shard in a
>> collection won't work for me. I need to define each core
>> externally, and I don't want Solr-Cloud to manage this, how will
>> this scenario work?
>
> We do plan on each core being able to have it's own schema (so one
> could try out a version of a schema and gradually migrate the
> cluster).
>
> It could also be possible to define a schema as "local" (i.e. use the
> one on the local file system)
>
>> A host is about the same as node, I don't see the difference, or
>> enough of one
>
> A host is the hardware. It will have limited disk, limited CPU, etc.
> At some point we will want to model this... multiple nodes could be
> launched on one box.  We're not doing anything with it now, and won't
> in the near future.
>
>> Cluster resizing and rebalancing can and should be built
>> externally and hopefully after an initial release that does the
>> basics well
>
> The initial release will certainly not be doing any resizing or rebalancing.
> We should allow this to be done externally.  In the future, we
> shouldn't require that this be done externally though (i.e. we should
> somehow alow the cluster to grow w/o people having to write code).
>
>> Collection is a group of cores?
>
> A collection of documents - the complete search index.  It has a
> single schema, etc.
>
> -Yonik
> http://www.lucidimagination.com
>

Re: Solr Cloud wiki and branch notes

Posted by Ted Dunning <te...@gmail.com>.

+1

Hadoop still calls it a copy of a block if you have replication factor of
1.  Why not?

(for that matter, I still call it an integer if it has a value of 1)

On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> I originally started off with "replica" too... but there may only be
>> one copy of a physical shard, it seemed strange to call it a replica.
>>
>
> Yeah .. it's a replica with a replication factor of 1 :)

-- 
Ted Dunning, CTO
DeepDyve

Re: Solr Cloud wiki and branch notes

Posted by Ted Dunning <te...@gmail.com>.

Control is easily retained if you make pluggable the selection of shards to
which you want to do the horizontal broadcast.  The shard management layer
shouldn't know or care what query you are doing and in most cases it should
just use the trivial all-shards selection policy.

On Sun, Jan 17, 2010 at 7:34 AM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> > I would argue that the current model has been adopted out of necessity,
> and
> > not because of the users' preference.
>
> I think it's both - I've seen quite a few people that really wanted to
> partition by time for example (and they made some compelling cases for
> doing so).  Seems like a good goal would be to support the customer
> having various levels of control.

-- 
Ted Dunning, CTO
DeepDyve

Re: Solr Cloud wiki and branch notes

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Sun, Jan 17, 2010 at 9:06 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
> On 2010-01-16 21:11, Yonik Seeley wrote:
>> If we were building from scratch perhaps - but it seems like if we can
>> just model what people do today with Solr (but just make it a lot
>> easier), that's a good start.  The opaque model is what we have today,
>> and it's conceptually simple... the complete collection consists of
>> all the unique shard ids (or slices) you know about.
>
> I would argue that the current model has been adopted out of necessity, and
> not because of the users' preference.

I think it's both - I've seen quite a few people that really wanted to
partition by time for example (and they made some compelling cases for
doing so).  Seems like a good goal would be to support the customer
having various levels of control.

> Unless you want an expert-level total
> control over what node runs what part of the index, isn't it much more
> convenient to delegate all the partitioning and deployment to your "search
> cluster" instead of managing the partitioning and deployment yourself?

Certainly - we do want to get to the "just handle everything for me"
phase.  It just feels like there is a lot more development work to do
before we can make that happen.  Reliably supporting near realtime
updates in a replicated environment is hard and will take some time.

-Yonik
http://www.lucidimagination.com

Re: Solr Cloud wiki and branch notes

Posted by Ted Dunning <te...@gmail.com>.

Jason V and Jason R have done just that.

Great idea.  Cool work.  But a unified management interface would *really*
be nice.

On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Well, then if we don't intend to support updates in this iteration then
> perhaps there is no need to change anything in Solr, just extend Katta to
> run Solr searchers ... :P
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Solr Cloud wiki and branch notes

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-01-16 21:11, Yonik Seeley wrote:

>> Agreed - but it could be as simple as qualifying this with "from shardX on
>> node2".
>
> Right - it's pretty clear there are both physical and logical
> shards... but it's less clear to me at this point if distinguishing
> them in the vocabulary helps or hurts.

You _are_ distinguishing them, you just use "physical" and "logical" :) 
I'm in favor of using "shard" for the logical entity, and "copy" or 
"replica" for the physical one. Whichever term we choose, we need to be 
clear about this distinction because multiple physical copies (replicas) 
may be deployed to multiple nodes, even though they contribute only one 
logical shard.

>
>> The opaque model means it's more difficult to support updates.
>> IMHO it makes
>> sense to start with a set of stricter assumptions
>
> If we were building from scratch perhaps - but it seems like if we can
> just model what people do today with Solr (but just make it a lot
> easier), that's a good start.  The opaque model is what we have today,
> and it's conceptually simple... the complete collection consists of
> all the unique shard ids (or slices) you know about.

I would argue that the current model has been adopted out of necessity, 
and not because of the users' preference. Unless you want an 
expert-level total control over what node runs what part of the index, 
isn't it much more convenient to delegate all the partitioning and 
deployment to your "search cluster" instead of managing the partitioning 
and deployment yourself? Users have to do it now because Solr has no 
mechanism for this.

>
> And we don't need to support everything in this model - I think we
> should and will also support shards where Solr does all the
> partitioning and mapping of the ID space (pluggable of course) and
> then we can offer more services based on that knowledge.

Well, then if we don't intend to support updates in this iteration then 
perhaps there is no need to change anything in Solr, just extend Katta 
to run Solr searchers ... :P

>
>>> You've also used some slightly new terminology... "shard ID" as
>>> opposed to just shard, which reinforces the need for different
>>> terminology for the physical vs the logical.
>>
>> You got me ;) yes, when I say "shard" I mean the logical entity, as defined
>> by a set of documents - physical shard I would call a replica.
>
> I originally started off with "replica" too... but there may only be
> one copy of a physical shard, it seemed strange to call it a replica.

Yeah .. it's a replica with a replication factor of 1 :)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Solr Cloud wiki and branch notes

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Sat, Jan 16, 2010 at 2:40 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> I avoided the word "collection", because Solr deploys various cores under
> "collectionX" names, leading users to assume that core == collection.

For distributed search, it's already common to name the cores the same
thing for shards of the same collection on different boxes.  In fact,
we're currently using the core name as a default for the collection
name when bootstrapping.

>> Even the statement "what shard did that response come from" becomes
>> ambiguous since we could be talking a part of the index (ShardX) or we
>> could be talking about the specific physical shard/server (it came
>> from node2).
>
> Agreed - but it could be as simple as qualifying this with "from shardX on
> node2".

Right - it's pretty clear there are both physical and logical
shards... but it's less clear to me at this point if distinguishing
them in the vocabulary helps or hurts.

> The opaque model means it's more difficult to support updates.
> IMHO it makes
> sense to start with a set of stricter assumptions

If we were building from scratch perhaps - but it seems like if we can
just model what people do today with Solr (but just make it a lot
easier), that's a good start.  The opaque model is what we have today,
and it's conceptually simple... the complete collection consists of
all the unique shard ids (or slices) you know about.

And we don't need to support everything in this model - I think we
should and will also support shards where Solr does all the
partitioning and mapping of the ID space (pluggable of course) and
then we can offer more services based on that knowledge.

>> You've also used some slightly new terminology... "shard ID" as
>> opposed to just shard, which reinforces the need for different
>> terminology for the physical vs the logical.
>
> You got me ;) yes, when I say "shard" I mean the logical entity, as defined
> by a set of documents - physical shard I would call a replica.

I originally started off with "replica" too... but there may only be
one copy of a physical shard, it seemed strange to call it a replica.

-Yonik
http://www.lucidimagination.com

Re: Solr Cloud wiki and branch notes

Posted by Ted Dunning <te...@gmail.com>.

My experience with Katta is that very quickly my developers adopted index as
the aggregate of all the shards which is exactly what Andrzej is proposing.
Confusion with the "index contains shards", "nodes host shards" terminology
has been minimal.

On Sat, Jan 16, 2010 at 11:40 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> We're currently using "collection".  Notice how you had to add
>> (global) to clarify what you meant.  I fear that a sentence like "what
>> index are you querying" would need constant clarification.
>>
>
> I avoided the word "collection", because Solr deploys various cores under
> "collectionX" names, leading users to assume that core == collection.
> "Global index" is two words but it's unambiguous. I'm fine with the
> "collection" if we clarify the definition and avoid using this term for
> other stuff.

-- 
Ted Dunning, CTO
DeepDyve

Re: Solr Cloud wiki and branch notes

Posted by Mark Miller <ma...@gmail.com>.

Andrzej Bialecki wrote:
>
> I avoided the word "collection", because Solr deploys various cores
> under "collectionX" names, leading users to assume that core ==
> collection. "Global index" is two words but it's unambiguous. I'm fine
> with the "collection" if we clarify the definition and avoid using
> this term for other stuff.
No, currently it does not. There is a proposal for this to make certain
boostrapping "stuff" with cloud easier, but Solr uses no such
collectionX convention in Solr at the moment (that I have ever seen).

-- 
- Mark

http://www.lucidimagination.com

Re: Solr Cloud wiki and branch notes

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-01-16 18:18, Yonik Seeley wrote:
> On Fri, Jan 15, 2010 at 7:36 PM, Andrzej Bialecki<ab...@getopt.org>  wrote:
>> Hi,
>>
>> My 0.02 PLN on the subject ...
>>
>> Terminology
>> -----------
>> First the terminology: reading your emails I have a feeling that my head is
>> about to explode. We have to agree on the vocabulary, otherwise we have no
>> hope of reaching any consensus.
>
> We not only need more standardized terminology for email, but for
> exact strings to put in zookeeper.

Indeed.

>> I propose the following vocabulary that has
>> been in use and is generally understood:
>>
>> * (global) search index: a complete collection of all indexed documents.
>>  From a conceptual point of view, this is our complete search space.
>
> We're currently using "collection".  Notice how you had to add
> (global) to clarify what you meant.  I fear that a sentence like "what
> index are you querying" would need constant clarification.

I avoided the word "collection", because Solr deploys various cores 
under "collectionX" names, leading users to assume that core == 
collection. "Global index" is two words but it's unambiguous. I'm fine 
with the "collection" if we clarify the definition and avoid using this 
term for other stuff.

>
>> * index shard: a non-overlapping part of the search index.
>
> When you get down to modeling it, this gets a little squishy and is
> hard to avoid using two words.
> Say the complete collection is covered by ShardX and ShardY.
>
> A way to model this is like so:
>
> /collection
>    /ShardX
>      /node1 [url=..., version=...]
>      /node2 [url=..., version=...]
>      /node3 [url=..., version=...]
>
> It becomes clearer that there are logical shards and physical shards.
> If shards are updateable, they may have different versions at
> different times.

Yes, but they are supposed to be ultimately consistent - that's where 
the replication comes in.

> It may also be that all the physical shards go down,
> but the logical "ShardX" remains.

Yes, as a missing piece of the global index not served currently by any 
node, thus leading to incomplete results.

>
> Even the statement "what shard did that response come from" becomes
> ambiguous since we could be talking a part of the index (ShardX) or we
> could be talking about the specific physical shard/server (it came
> from node2).

Agreed - but it could be as simple as qualifying this with "from shardX 
on node2".

This would be quite natural if you consider that even the same query 
submitted again could be answered by a different set of nodes that 
manage the same set of shards. E.g. with two nodes {n1, n2} and 2 shards 
{s1,s2}, and the replication factor of 2, the selection of what shard on 
what node contributes to the list of results could look like this (time 
in the Y axis):

q1 {n1:s1,n2:s2}
q2 {n1:s2,n2:s1}
...


>
>> All shards in the
>> system form together the complete search space of the search index. E.g.
>> having initially one big index I could divide it into multiple shards using
>> MultiPassIndexSplitter, and if I combined all the shards again, using
>> IndexMerger, I should obtain the original complete search index (modulo
>> changed Lucene docids .. doesn't matter). I strongly believe in
>> micro-sharding, because they are much easier to handle and replicate. Also,
>> since we control the shards we don't have to deal with overlapping shards,
>> which is the curse of P2P search.
>
> Prohibiting overlapping shards effectively prohibits ever merging or
> splitting shards online (it could only be an offline or blocking
> operation).  Anyway, in the opaque shard model (where clients create
> shards, and we don't know how they partitioned them), shards would
> have to be non-overlapping.

The opaque model means it's more difficult to support updates. IMHO it 
makes sense to start with a set of stricter assumptions in order to 
build something workable, and then relax them as we gain experience.

>
> As far as the future (allocation and rebalancing), I'm happy with a
> small-shard approach that avoids merging and splitting.  It carries
> some other nice little side benefits as well.
>
>> * partitioning: a method whereby we can determine the target shard ID based
>> on a doc ID.
>
> I think we're all using partitioning the same way, but that's a
> narrower definition than needed.
> A user may partition the index, and Solr may not have the mapping of
> docid to shard.

See above - of course this would be cool and extra convenient to users, 
but much more difficult to implement so that it supports updates.

>
> You've also used some slightly new terminology... "shard ID" as
> opposed to just shard, which reinforces the need for different
> terminology for the physical vs the logical.

You got me ;) yes, when I say "shard" I mean the logical entity, as 
defined by a set of documents - physical shard I would call a replica.


>> Now, to translate this into Solr-speak: depending on the details of the
>> design, and the evolution of Solr, one search node could be one Solr
>> instance that manages one shard per core.
>
> A solr core is a bit too heavyweight for a microshard though.
> I think a single solr core really needs to be able to handle multiple
> shards for this to become practical.

Ok. This is actually related to the issue below (witness SOLR-1366).

>
>> Let's forget here about the
>> current distributed search component, and the current replication
>
> Heh.  I think this is what is causing some of the mismatches...
> different starting points and different assumptions.

They work well with the current assumptions, and are known to work 
poorly with the design that we are discussing.

>
>> - they
>> could be useful in this design as a raw transport mechanism, but someone
>> else would be calling the shots (see below).
>
> Seems like we need to be flexible in allowing customers to call the
> shots to varying degrees.

Eventually, yes - but initially I fear we won't be able to come up with 
a model that allows this much flexibility and is still implementable in 
a reasonable time ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Solr Cloud wiki and branch notes

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Jan 15, 2010 at 7:36 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> Hi,
>
> My 0.02 PLN on the subject ...
>
> Terminology
> -----------
> First the terminology: reading your emails I have a feeling that my head is
> about to explode. We have to agree on the vocabulary, otherwise we have no
> hope of reaching any consensus.

We not only need more standardized terminology for email, but for
exact strings to put in zookeeper.

> I propose the following vocabulary that has
> been in use and is generally understood:
>
> * (global) search index: a complete collection of all indexed documents.
> From a conceptual point of view, this is our complete search space.

We're currently using "collection".  Notice how you had to add
(global) to clarify what you meant.  I fear that a sentence like "what
index are you querying" would need constant clarification.

> * index shard: a non-overlapping part of the search index.

When you get down to modeling it, this gets a little squishy and is
hard to avoid using two words.
Say the complete collection is covered by ShardX and ShardY.

A way to model this is like so:

/collection
  /ShardX
    /node1 [url=..., version=...]
    /node2 [url=..., version=...]
    /node3 [url=..., version=...]

It becomes clearer that there are logical shards and physical shards.
If shards are updateable, they may have different versions at
different times.  It may also be that all the physical shards go down,
but the logical "ShardX" remains.

Even the statement "what shard did that response come from" becomes
ambiguous since we could be talking a part of the index (ShardX) or we
could be talking about the specific physical shard/server (it came
from node2).

> All shards in the
> system form together the complete search space of the search index. E.g.
> having initially one big index I could divide it into multiple shards using
> MultiPassIndexSplitter, and if I combined all the shards again, using
> IndexMerger, I should obtain the original complete search index (modulo
> changed Lucene docids .. doesn't matter). I strongly believe in
> micro-sharding, because they are much easier to handle and replicate. Also,
> since we control the shards we don't have to deal with overlapping shards,
> which is the curse of P2P search.

Prohibiting overlapping shards effectively prohibits ever merging or
splitting shards online (it could only be an offline or blocking
operation).  Anyway, in the opaque shard model (where clients create
shards, and we don't know how they partitioned them), shards would
have to be non-overlapping.

As far as the future (allocation and rebalancing), I'm happy with a
small-shard approach that avoids merging and splitting.  It carries
some other nice little side benefits as well.

> * partitioning: a method whereby we can determine the target shard ID based
> on a doc ID.

I think we're all using partitioning the same way, but that's a
narrower definition than needed.
A user may partition the index, and Solr may not have the mapping of
docid to shard.

You've also used some slightly new terminology... "shard ID" as
opposed to just shard, which reinforces the need for different
terminology for the physical vs the logical.

> * search node: an application that provides search and update to one or more
> shards.
>
> * search host: a machine that may run 1 or more search nodes.
>
> * Shard Manager: a component that keeps track of allocation of shards to
> nodes (plus more, see below).
>
> Now, to translate this into Solr-speak: depending on the details of the
> design, and the evolution of Solr, one search node could be one Solr
> instance that manages one shard per core.

A solr core is a bit too heavyweight for a microshard though.
I think a single solr core really needs to be able to handle multiple
shards for this to become practical.

> Let's forget here about the
> current distributed search component, and the current replication

Heh.  I think this is what is causing some of the mismatches...
different starting points and different assumptions.

> - they
> could be useful in this design as a raw transport mechanism, but someone
> else would be calling the shots (see below).

Seems like we need to be flexible in allowing customers to call the
shots to varying degrees.

-Yonik
http://www.lucidimagination.com

> Architecture
> ------------
> The replication and load balancing is a problem with many existing
> solutions, and this one in particular reminds me strongly of the Hadoop
> HDFS. In fact, early on during the development of Hadoop [1] I wondered
> whether we could reuse HDFS to manage Lucene indexes instead of opaque
> blocks of fixed size. It turned out to be infeasible, but the model of
> Namenode/Datanode still looks useful in our case, too.
>
> I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper
> that we could reuse in our design. The following is just a straightforward
> port of the Namenode/Datanode concept.
>
> Let's imagine a component called ShardManager that is responsible for
> managing the following data:
>
> * list of shard ID-s that together form the complete search index,
> * for each shard ID, list of search nodes that serve this shard.
> * issuing replication requests
> * maintaining the partitioning function (see below), so that updates are
> directed to correct shards
> * maintaining heartbeat to check for dead nodes
> * providing search clients with a list of nodes to query in order to obtain
> all results from the search index.
>
> Whenever a new search node comes up, it reports its local shard ID-s
> (versioned) to the ShardManager. Based on these reports from the currently
> active nodes, the ShardManager builds this mapping of shards to nodes, and
> requests replication if some shards are too old, or if the replication count
> is too low, allocating these shards to selected nodes (based on a policy of
> some kind).
>
> I believe most of the above functionality could be facilitated by Zookeeper,
> including the election of the node that runs the ShardManager.
>
> Updates
> -------
> We need a partitioning schema that splits documents more or less evenly
> among shards, and at the same time allows us to split or merge unbalanced
> shards. The simplest function that we could imagine is the following:
>
>        hash(docId) % numShards
>
> though this has the disadvantage that any larger update will affect multiple
> shards, thus creating an avalanche of replication requests ... so a
> sequential model would be probably better, where ranges of docIds are
> assigned to shards.
>
> Now, if any particular shard is too unbalanced, e.g. too large, it could be
> further split in two halves, and the ShardManager would have to record this
> exception. This is a very similar process to a region split in HBase, or a
> page split in btree DBs. Conversely, shards that are too small could be
> joined. This is the icing on the cake, so we can leave it for later.
>
> After commit, a node contacts the ShardManager to report a new version of
> the shard. ShardManager issues replication requests to other nodes that hold
> a replica of this shard.
>
> Search
> ------
> There should be a component sometimes referred to as query integrator (or
> search front-end) that is the entry and exit point for user search requests.
> On receiving a search request this component gets a list of randomly
> selected nodes from SearchManager to contact (the list containing all shards
> that form the global index), sends the query and integrates partial results
> (under a configurable policy for timeouts/early termination), and sends back
> the assembled results to the user.
>
> Again, somewhere in the background the knowledge of who to contact should be
> handled by Zookeeper.
>
> That's it for now from the top of my head ...
>
> -----------
>
> [1]
> http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg02273.html
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

Re: Solr Cloud wiki and branch notes

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Jan 15, 2010 at 4:36 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> My 0.02 PLN on the subject ...
>

Polish currency seems pretty strong lately.  There are a lot of good ideas
for this small sum.

>
> Terminology
>
> * (global) search index
> * index shard:
> * partitioning:
> * search node:
> * search host:
> * Shard Manager:
>

I think that these terms are excellent.

> The replication and load balancing is a problem with many existing
> solutions, and this one in particular reminds me strongly of the Hadoop
> HDFS. In fact, early on during the development of Hadoop [1] I wondered
> whether we could reuse HDFS to manage Lucene indexes instead of opaque
> blocks of fixed size. It turned out to be infeasible, but the model of
> Namenode/Datanode still looks useful in our case, too.
>

I have seen the analogy with hadoop in managing a Katta cluster.  The
randomized assignment provides very many of the same robustness benefits as
a map-reduce architecture provides for parallel computing.

> I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper
> that we could reuse in our design. The following is just a straightforward
> port of the Namenode/Datanode concept.
>
> Let's imagine a component called ShardManager that is responsible for
> managing the following data:
>
> * list of shard ID-s that together form the complete search index,
> * for each shard ID, list of search nodes that serve this shard.
> * issuing replication requests
> * maintaining the partitioning function (see below), so that updates are
> directed to correct shards
> * maintaining heartbeat to check for dead nodes
> * providing search clients with a list of nodes to query in order to obtain
> all results from the search index.
>

I think that this is close.

I think that the list of search nodes that serve each shard should be
maintained by the nodes themselves.  Moreover, ZK provides the ability to
have this list magically update if the node dies.

This means that the need for heartbeats virtually disappears.

In addition, I think that a substrate like ZK should be used to provide
search clients with the information about which nodes have which shards and
the clients should themselves decide how to cover the set of shards with a
list of nodes.  This means that the ShardManager is *completely* out of the
real-time pathway.

> ... I believe most of the above functionality could be facilitated by
> Zookeeper, including the election of the node that runs the ShardManager.
>

Absolutely.

> Updates
> -------
> We need a partitioning schema that splits documents more or less evenly
> among shards, and at the same time allows us to split or merge unbalanced
> shards. The simplest function that we could imagine is the following:
>
>        hash(docId) % numShards
>
> though this has the disadvantage that any larger update will affect
> multiple shards, thus creating an avalanche of replication requests ... so a
> sequential model would be probably better, where ranges of docIds are
> assigned to shards.
>

A hybrid is quite possible:

hash(floor(docId / sequence-size)) % numShards

this gives sequential assignment of sequence-size documents at a time.
Sequence-size should be small to distribute query results and update loads
across all nodes.  Sequence size should be large to avoid replication of all
shards after a focussed update.  Balance is necessary.

> Now, if any particular shard is too unbalanced, e.g. too large, it could be
> further split in two halves, and the ShardManager would have to record this
> exception. This is a very similar process to a region split in HBase, or a
> page split in btree DBs. Conversely, shards that are too small could be
> joined. This is the icing on the cake, so we can leave it for later.
>

Leaving for later is a great idea.  With relatively small shards, I am able
to parallelize indexing to the point that a terabyte or so of documents
index in a few hours.  Combined with a small sequence-size in the shard
distribution function so that all shards grow together, it is easy to plan
for 3x growth or more without the need to shard splitting.  With a complete
index being so cheap, I can afford to simply reindex from scratch with a
different shard count if I feel like it.

> Search
> ------
> There should be a component sometimes referred to as query integrator (or
> search front-end) that is the entry and exit point for user search requests.
> On receiving a search request this component gets a list of randomly
> selected nodes from SearchManager to contact (the list containing all shards
> that form the global index), sends the query and integrates partial results
> (under a configurable policy for timeouts/early termination), and sends back
> the assembled results to the user.
>

Yes in outline.

A few details:

I think that the shard cover computation should, in fact, be done on the
client side.  One reason is that the node/shard state is relatively static
and if all clients retrieve the full state this is cachable and simple.

Another detail is that however the cover of all shards is computed, the
result is a distinct list of all shards grouped by one of the nodes that
shard resides on.  Each group should represent exactly one query to the
correspoding node.  The node should perform the query on each of the request
shards (but NOT all shards on the node) and return the results unmerged.
Additionally, it should return an exception that may have occurred in place
of a result list.

As results are returned, the query integrator needs to do several things:

a) if a node cannot be reached, the shards on that node should be found on
other nodes and the query repeated for those nodes.  If no other nodes can
be found for any of those shards, an error result should be recorded for the
missing shards.

b) if an exception is returned as a result for a shard, the query integrator
can try the query for the failing shards analogously to (a) or it can
immediately report an error.  My experience is that reporting the error is
generally better because shards tend to be replicated cleanly and errors one
place will just occur elsewhere.  Mileage may vary on this point.

c) as results are recorded per shard the query integrator needs to decide
whether to return partial results or to wait for more results.  I have found
it very useful to have a pluggable policy here and have typically used a
policy with dual deadlines.  The first deadline is how long the query
integrator will wait for full results for every shard to be recorded.  The
second deadline is one after which whatever results are available will be
returned.  I also like a minimum percentage at the second deadline which
determines whether to mark the partial results as success or failure.

These three points give robustness to the process in a fashion very similar
to the way that map-reduce with task retry gives robustness to a parallel
program.  The result is very nice.

That's it for now from the top of my head ...
>

One additional concept that I find useful is the idea of different kinds of
broadcast.  For search, we want to do what I call "horizontal broadcast" of
the query to exactly one copy of each shard.  For update, we often want to
do what I call "vertical broadcast" to each copy of a particular shard.

With a little bit of functional programming, these two kinds of broadcast
are nearly the complete repertoire of functions that need to be exposed to
the caller.  In particular, this shard management layer does not need to
know at all what kinds of queries or updates are being done.  Instead, it
just needs to know what the query object is, what arguments it wants and
whether to do a horizontal or vertical broadcast.   This makes the parallel
with map-reduce even stronger.  Map-reduce is a way of taking a few
functions that define a map-reduce program and passing the data from one
function to another in a highly ritualized fashion.  Shard management for
retrieval should be very much the same.  The shard management framework
doesn't need to care what the functions that implement the queries DO, it
just has to invoke them next to the right shards and assemble the results
and failures.  Of course, with retrieval the set of functions being invoked
is far more static than with, say, hadoop map functions, but the principal
of isolation should be the same.

All of these shard management concepts are present and well worked out in
Katta.  Katta provides at the least a useful proof of concept for these
ideas even if the specific implementation isn't what is desired for
Solr/Cloud.

Re: Solr Cloud wiki and branch notes

Posted by Andrzej Bialecki <ab...@getopt.org>.

Hi,

My 0.02 PLN on the subject ...

Terminology
-----------
First the terminology: reading your emails I have a feeling that my head 
is about to explode. We have to agree on the vocabulary, otherwise we 
have no hope of reaching any consensus. I propose the following 
vocabulary that has been in use and is generally understood:

* (global) search index: a complete collection of all indexed documents. 
 From a conceptual point of view, this is our complete search space.

* index shard: a non-overlapping part of the search index. All shards in 
the system form together the complete search space of the search index. 
E.g. having initially one big index I could divide it into multiple 
shards using MultiPassIndexSplitter, and if I combined all the shards 
again, using IndexMerger, I should obtain the original complete search 
index (modulo changed Lucene docids .. doesn't matter). I strongly 
believe in micro-sharding, because they are much easier to handle and 
replicate. Also, since we control the shards we don't have to deal with 
overlapping shards, which is the curse of P2P search.

* partitioning: a method whereby we can determine the target shard ID 
based on a doc ID.

* search node: an application that provides search and update to one or 
more shards.

* search host: a machine that may run 1 or more search nodes.

* Shard Manager: a component that keeps track of allocation of shards to 
nodes (plus more, see below).

Now, to translate this into Solr-speak: depending on the details of the 
design, and the evolution of Solr, one search node could be one Solr 
instance that manages one shard per core. Let's forget here about the 
current distributed search component, and the current replication - they 
could be useful in this design as a raw transport mechanism, but someone 
else would be calling the shots (see below).

Architecture
------------
The replication and load balancing is a problem with many existing 
solutions, and this one in particular reminds me strongly of the Hadoop 
HDFS. In fact, early on during the development of Hadoop [1] I wondered 
whether we could reuse HDFS to manage Lucene indexes instead of opaque 
blocks of fixed size. It turned out to be infeasible, but the model of 
Namenode/Datanode still looks useful in our case, too.

I believe there are many useful lessons lurking in 
Hadoop/HBase/Zookeeper that we could reuse in our design. The following 
is just a straightforward port of the Namenode/Datanode concept.

Let's imagine a component called ShardManager that is responsible for 
managing the following data:

* list of shard ID-s that together form the complete search index,
* for each shard ID, list of search nodes that serve this shard.
* issuing replication requests
* maintaining the partitioning function (see below), so that updates are 
directed to correct shards
* maintaining heartbeat to check for dead nodes
* providing search clients with a list of nodes to query in order to 
obtain all results from the search index.

Whenever a new search node comes up, it reports its local shard ID-s 
(versioned) to the ShardManager. Based on these reports from the 
currently active nodes, the ShardManager builds this mapping of shards 
to nodes, and requests replication if some shards are too old, or if the 
replication count is too low, allocating these shards to selected nodes 
(based on a policy of some kind).

I believe most of the above functionality could be facilitated by 
Zookeeper, including the election of the node that runs the ShardManager.

Updates
-------
We need a partitioning schema that splits documents more or less evenly 
among shards, and at the same time allows us to split or merge 
unbalanced shards. The simplest function that we could imagine is the 
following:

	hash(docId) % numShards

though this has the disadvantage that any larger update will affect 
multiple shards, thus creating an avalanche of replication requests ... 
so a sequential model would be probably better, where ranges of docIds 
are assigned to shards.

Now, if any particular shard is too unbalanced, e.g. too large, it could 
be further split in two halves, and the ShardManager would have to 
record this exception. This is a very similar process to a region split 
in HBase, or a page split in btree DBs. Conversely, shards that are too 
small could be joined. This is the icing on the cake, so we can leave it 
for later.

After commit, a node contacts the ShardManager to report a new version 
of the shard. ShardManager issues replication requests to other nodes 
that hold a replica of this shard.

Search
------
There should be a component sometimes referred to as query integrator 
(or search front-end) that is the entry and exit point for user search 
requests. On receiving a search request this component gets a list of 
randomly selected nodes from SearchManager to contact (the list 
containing all shards that form the global index), sends the query and 
integrates partial results (under a configurable policy for 
timeouts/early termination), and sends back the assembled results to the 
user.

Again, somewhere in the background the knowledge of who to contact 
should be handled by Zookeeper.

That's it for now from the top of my head ...

-----------

[1] 
http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg02273.html

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Solr Cloud wiki and branch notes

Posted by Yonik Seeley <ys...@gmail.com>.

On Fri, Jan 15, 2010 at 4:12 PM, Jason Rutherglen
<ja...@gmail.com> wrote:
> The page is huge, which signals to me maybe we're trying to do
> too much

This is really about doing not-so-much in the very near term, while
thinking ahead to the longer term.

> Revamping distributed search could be in a different branch
> (this includes partial results)

That could just be a separate patch - it's scope is not that broad (I
think there may already be a JIRA issue open for it).

> Having a single solrconfig and schema for each core/shard in a
> collection won't work for me. I need to define each core
> externally, and I don't want Solr-Cloud to manage this, how will
> this scenario work?

We do plan on each core being able to have it's own schema (so one
could try out a version of a schema and gradually migrate the
cluster).

It could also be possible to define a schema as "local" (i.e. use the
one on the local file system)

> A host is about the same as node, I don't see the difference, or
> enough of one

A host is the hardware. It will have limited disk, limited CPU, etc.
At some point we will want to model this... multiple nodes could be
launched on one box.  We're not doing anything with it now, and won't
in the near future.

> Cluster resizing and rebalancing can and should be built
> externally and hopefully after an initial release that does the
> basics well

The initial release will certainly not be doing any resizing or rebalancing.
We should allow this to be done externally.  In the future, we
shouldn't require that this be done externally though (i.e. we should
somehow alow the cluster to grow w/o people having to write code).

> Collection is a group of cores?

A collection of documents - the complete search index.  It has a
single schema, etc.

-Yonik
http://www.lucidimagination.com