You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Jon Gifford <jo...@gmail.com> on 2010/02/10 06:02:39 UTC

SolrCloud - Using collections, slices and shards in the wild

I've been following the progress of the SolrCloud branch closely, and
wanted to explain how I intend to use it, and what that means for how
the collections, slices and shards could work.

I should say up front that Mark, Yonik and I have exchanged a few
emails on this already, and Mark suggested that I switch to this list
to "drum up more interest in others playing with the branch and
chiming in with thoughts." I realize the code is still at a very early
stage, so this is really intended to be more grist for the mill, not a
criticism of the current implementation, whih seems to me to be a very
nice base for what I need to do. Also, I'll apologize up front for the
length of this email, but I wanted to paint as clear a picture as I
could of how I intend to use this stuff.

The system I'll be building needs to be able to:

1) Support one index per customer, and many customers (thus, many
independent indices)

2) Share the same schema across all indicies

3) Allow for time-based shards within a single customers index.

4) as an added twist, some customers will be sending data faster than
can be indexed in a single core, so we'll also need to split the input
stream to multiple cores. Thus, for a given time-based shard, we're
likely to have multiple parallel indexers building independent shards.

Mapping these requirements to the current state of SolrCloud, I could
use a single collection (i.e. a single schema) that all customer
indicies are part of, then create slices of that collection to
represent an individual customers index, each made up of a set of time
based shards, which may themselves be built in parallel on independent
cores..

Alternatively, I could create a collection per customer, which removes
the need for slices, but means duplicating the schema many times. From
an operational standpoint, a single collection makes more sense to me.

The current state of the branch allows me to do some, but not all, of
what I need to do, and I wanted to walk through how I could see myself
using it

Firstly, I'd like to be able to use the REST interface to create new
cores/shards - I'm not going to bet that this is what the final system
will do, but for the stage I'm at now, its the simplest, quickest way
to get going. The current code uses the core name as the collection
name, which won't work for me if I use a single collection. For
example, if I want to create a new core for customer_1 for todays
index, I'd do the following:

    http://localhost:8983/solr/admin/cores?action=CREATE&instanceDir=.&name=collection_1&dataDir=data/customer_1_20100209

This approach is going to lead to a lot of solr instances ;-)

Revising the code to use the core name as a slice, I'd get:

    http://localhost:8983/solr/admin/cores?action=CREATE&instanceDir=.&name=customer_1&dataDir=data/customer_1_20100209

but would need to explicitly add a collection=collection_1 parameter
to the call to make sure it uses the correct collection. The problem
with this approach is that I'm now limited to only being able deliver
one shard per customer from each Solr instance.

Revising again, to use the core name as the shard name, I'd get:

    http://localhost:8983/solr/admin/cores?action=CREATE&instanceDir=.&name=customer_1_20100209&dataDir=data/customer_1_201-0209

and would need explicit collection= and slice= parameters. This is the
ideal situation, because I can run as many hards from the same
customer as I like on a single Solr instance.

So, essentially what I'm saying is that cores and shards really are
identical, and when a core is created, we should be able to specify
the collection and slice that they belong to, via the REST interface.

Here's Marks' comments on this...

> I think we simply haven't thought much about creating cores dynamically
> with http requests yet. You can set a custom shard id initially in the
> solr.xml, or using the CloudDescriptor on the CoreDescriptor when doing
> it programmatically.
>
> Its a good issue to bring up  - I think we will want support to handle
> this stuff with the core admin handler. I can add the basics pretty soon
> I think.
>
> The way things default now (core name is collection) is really only for
> simple bootstrap situations.

and

> Yeah, I think you can do quite a bit with it now, but there is def still
> a lot planned. We are actually working on polishing off what we have as
> a first plateau now. We have mostly been working from either static
> configuration and/or java code in building it up though, so personally
> it hadn't even yet hit me to take care of the HTTP CoreAdmin side of
> things. From a dev side, I just havn't had to use it much, so when I
> think dynamic cores I'm usually thinking java code style.

The second part of what I need is to be able to search a single
customers index, which I'm assuming will be a slice. Something like:

    http://localhost:8983/solr/collection1/select?distrib=true&slice=customer_1

would do the trick, assuming we have the slice => shards mapping available.

Reading over some of the previous discussions, slices seem to be
somewhat contentious, and I wanted to chime in on them a bit here. It
seems to me that slices are loosely defined, and I think thats a good
thing. If you think of slices as being similar to tags, then its easy
to imagine that any given shard can belong to many different slices.
For example,

*  I'm going to create a slice per customer, but I don't think there
should be anything stopping me from creating subslices of that slice -
maybe I have a "customer_1_mostRecentWeek" slice, that is the last 7
days worth of daily shards from the customer_1 slice.

*  I may want to build combined indices for some set of customers
(customer_2 through customer_10, say). If I do that, it may still be
desirable (to simplify the front end system that is using Solr)  to
create individual customer_2, customer_3, ..., customer_10 slices,
even though they all map to the same shards

* I may want to combine results from two different customers. It may
be easier to just define a third slice that is the union of the two
customers slices.

One advantage of using slices in this fairly promiscuous fashion is
that it may simplify some of the more interesting use-cases. If we
know that there is an automatically maintained "mostRecentWeek" slice
available, for example, then some of the more complex shard selection
logic can be moved out of the query-time processing entirely.

Enough writing for now, on to some experimentation :-)

Re: SolrCloud - Using collections, slices and shards in the wild

Posted by Jon Gifford <jo...@gmail.com>.

By the way, and in case its not obvious, I don't mean to suggest that
we should remove the ability to specify a set of shards in the search
interface. What I'm saying is that using something like "group"
instead is simpler in most cases.

Jon

On Wed, Feb 10, 2010 at 9:14 AM, Jon Gifford <jo...@gmail.com> wrote:
> On Wed, Feb 10, 2010 at 8:04 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> On Wed, Feb 10, 2010 at 12:02 AM, Jon Gifford <jo...@gmail.com> wrote:
>>> Alternatively, I could create a collection per customer, which removes
>>> the need for slices, but means duplicating the schema many times.
>>
>> Multiple collections should be able to share a single config (schema
>> and related config files).
>
> OK, this solves the top level problem (how do I manage a single
> customers index, while guaranteeing that all indices have the same
> schema), which is good.
>
>> Note: I've backed off of the use of "slice" in the public APIs since
>> it was contentious (although I still think it's a useful concept and
>> it does remain in some of the code).  "shard" is kind of ambiguous,
>> but people are pretty good at dealing with ambiguity (and removing
>> that ambiguity by introducing another term seemed to add more
>> perceived complexity).
>
> I agree that its a very useful concept, and wonder how much of the
> contention is just a terminology issue? If we used subcollection
> instead, the intent becomes clearer for some use-cases. If we used
> tag, or taggroup, then a slightly different (more powerful?) intent is
> suggested.
>
>>> The second part of what I need is to be able to search a single
>>> customers index, which I'm assuming will be a slice. Something like:
>>>
>>>    http://localhost:8983/solr/collection1/select?distrib=true&slice=customer_1
>>
>> The URLs on the SolrCloud page have been updated - this would now be
>> http://localhost:8983/solr/collection1/select?distrib=true&shards=customer_1
>>
>> This will work as long as no customer becomes bigger than a shard.  If
>> that's not the case, you could query the entire collection and filter
>> on customer_1, or create a collection per customer (or do both, if you
>> mave many small customers that you want to pack in a single shard).
>
> right. I'd most likely default to using a collection per customer
> (assuming that collections can share a single config) because a single
> customers index will be larger than a single shard.
>
>>
>> http://localhost:8983/solr/collection1/select?distrib=true&collection=customer_1
>>
>>> Reading over some of the previous discussions, slices seem to be
>>> somewhat contentious, and I wanted to chime in on them a bit here. It
>>> seems to me that slices are loosely defined, and I think thats a good
>>> thing. If you think of slices as being similar to tags, then its easy
>>> to imagine that any given shard can belong to many different slices.
>>
>> I wouldn't call it a "slice" but I've also been thinking about how to
>> select groups of nodes.
>> Extending that to shards would also make sense.
>
> I think the important points here are that if there is the concept of
> a group (or slice or subcollection or tag - whatever terminology we
> end up using), then
>
> 1)  the client (typically some front end code) can use a simpler
> interface, which I think is a good thing. Solr doesn't need to expose
> how many shards there really are, or what they're named, and the FE
> doesn't have to try and generate a list of shard id's just to do a
> search.
>
> 2) Some piece of code has to decide what shards to actually search,
> and that piece of code has to know exactly what shards actually exist.
> If that decision is made in the client, then it has to be made in
> every client (your customer-facing search interface, any and all
> background tasks you have running, any ad-hoc searches you do for
> analysis or spot checking or...). For the sake of simplicity and
> sanity, you don't want to have to replicate that decision making code
> across multiple apps or languages.
>
> 3) the collection and shard entities are at opposite ends of a fairly
> wide divide, and there are cases where you need something
> "in-between".
>
> In most cases, a simple collection search will suffice, but in those
> cases where you want to limit the search to particular shards, it
> makes more sense to me to manage that set of shards within solr, and
> expose only the fact that the "groups" are available.
>
> Here's another example:
>
> Lets say you're generating hourly shards, to limit the maximum size of
> the shard that is taking updates, for performance reasons. Lets also
> assume that you want to roll those hourlies up into daily or weekly or
> maximum size shards once they become less active, so Solr isn't trying
> to search 24 shards to get a single days worth of results. If the
> "group" concept exists, then you can hide all of the mechanics of how
> and when that happens from the client, while still allowing it to have
> some control over how far back it can search, by exposing "groups"
> that limit it to the last day or week or whatever makes sense for your
> app.
>
> cheers
>
> Jon
>
>
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>

Re: SolrCloud - Using collections, slices and shards in the wild

Posted by Jon Gifford <jo...@gmail.com>.

On Wed, Feb 10, 2010 at 8:04 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Wed, Feb 10, 2010 at 12:02 AM, Jon Gifford <jo...@gmail.com> wrote:
>> Alternatively, I could create a collection per customer, which removes
>> the need for slices, but means duplicating the schema many times.
>
> Multiple collections should be able to share a single config (schema
> and related config files).

OK, this solves the top level problem (how do I manage a single
customers index, while guaranteeing that all indices have the same
schema), which is good.

> Note: I've backed off of the use of "slice" in the public APIs since
> it was contentious (although I still think it's a useful concept and
> it does remain in some of the code).  "shard" is kind of ambiguous,
> but people are pretty good at dealing with ambiguity (and removing
> that ambiguity by introducing another term seemed to add more
> perceived complexity).

I agree that its a very useful concept, and wonder how much of the
contention is just a terminology issue? If we used subcollection
instead, the intent becomes clearer for some use-cases. If we used
tag, or taggroup, then a slightly different (more powerful?) intent is
suggested.

>> The second part of what I need is to be able to search a single
>> customers index, which I'm assuming will be a slice. Something like:
>>
>>    http://localhost:8983/solr/collection1/select?distrib=true&slice=customer_1
>
> The URLs on the SolrCloud page have been updated - this would now be
> http://localhost:8983/solr/collection1/select?distrib=true&shards=customer_1
>
> This will work as long as no customer becomes bigger than a shard.  If
> that's not the case, you could query the entire collection and filter
> on customer_1, or create a collection per customer (or do both, if you
> mave many small customers that you want to pack in a single shard).

right. I'd most likely default to using a collection per customer
(assuming that collections can share a single config) because a single
customers index will be larger than a single shard.

>
> http://localhost:8983/solr/collection1/select?distrib=true&collection=customer_1
>
>> Reading over some of the previous discussions, slices seem to be
>> somewhat contentious, and I wanted to chime in on them a bit here. It
>> seems to me that slices are loosely defined, and I think thats a good
>> thing. If you think of slices as being similar to tags, then its easy
>> to imagine that any given shard can belong to many different slices.
>
> I wouldn't call it a "slice" but I've also been thinking about how to
> select groups of nodes.
> Extending that to shards would also make sense.

I think the important points here are that if there is the concept of
a group (or slice or subcollection or tag - whatever terminology we
end up using), then

1)  the client (typically some front end code) can use a simpler
interface, which I think is a good thing. Solr doesn't need to expose
how many shards there really are, or what they're named, and the FE
doesn't have to try and generate a list of shard id's just to do a
search.

2) Some piece of code has to decide what shards to actually search,
and that piece of code has to know exactly what shards actually exist.
If that decision is made in the client, then it has to be made in
every client (your customer-facing search interface, any and all
background tasks you have running, any ad-hoc searches you do for
analysis or spot checking or...). For the sake of simplicity and
sanity, you don't want to have to replicate that decision making code
across multiple apps or languages.

3) the collection and shard entities are at opposite ends of a fairly
wide divide, and there are cases where you need something
"in-between".

In most cases, a simple collection search will suffice, but in those
cases where you want to limit the search to particular shards, it
makes more sense to me to manage that set of shards within solr, and
expose only the fact that the "groups" are available.

Here's another example:

Lets say you're generating hourly shards, to limit the maximum size of
the shard that is taking updates, for performance reasons. Lets also
assume that you want to roll those hourlies up into daily or weekly or
maximum size shards once they become less active, so Solr isn't trying
to search 24 shards to get a single days worth of results. If the
"group" concept exists, then you can hide all of the mechanics of how
and when that happens from the client, while still allowing it to have
some control over how far back it can search, by exposing "groups"
that limit it to the last day or week or whatever makes sense for your
app.

cheers

Jon

>
> -Yonik
> http://www.lucidimagination.com
>

Re: SolrCloud - Using collections, slices and shards in the wild

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Wed, Feb 10, 2010 at 12:02 AM, Jon Gifford <jo...@gmail.com> wrote:
> Alternatively, I could create a collection per customer, which removes
> the need for slices, but means duplicating the schema many times.

Multiple collections should be able to share a single config (schema
and related config files).

[...]
> when a core is created, we should be able to specify
> the collection and slice that they belong to, via the REST interface.

Yep.

Note: I've backed off of the use of "slice" in the public APIs since
it was contentious (although I still think it's a useful concept and
it does remain in some of the code).  "shard" is kind of ambiguous,
but people are pretty good at dealing with ambiguity (and removing
that ambiguity by introducing another term seemed to add more
perceived complexity).

> The second part of what I need is to be able to search a single
> customers index, which I'm assuming will be a slice. Something like:
>
>    http://localhost:8983/solr/collection1/select?distrib=true&slice=customer_1

The URLs on the SolrCloud page have been updated - this would now be
http://localhost:8983/solr/collection1/select?distrib=true&shards=customer_1

This will work as long as no customer becomes bigger than a shard.  If
that's not the case, you could query the entire collection and filter
on customer_1, or create a collection per customer (or do both, if you
mave many small customers that you want to pack in a single shard).

http://localhost:8983/solr/collection1/select?distrib=true&collection=customer_1

> Reading over some of the previous discussions, slices seem to be
> somewhat contentious, and I wanted to chime in on them a bit here. It
> seems to me that slices are loosely defined, and I think thats a good
> thing. If you think of slices as being similar to tags, then its easy
> to imagine that any given shard can belong to many different slices.

I wouldn't call it a "slice" but I've also been thinking about how to
select groups of nodes.
Extending that to shards would also make sense.

-Yonik
http://www.lucidimagination.com

Re: SolrCloud - Using collections, slices and shards in the wild

Posted by Ted Dunning <te...@gmail.com>.

Yeah.... my decision likely would have been different if it weren't for the
fact that Katta is industrial strength already.

On Wed, Feb 10, 2010 at 12:59 PM, Jon Gifford <jo...@gmail.com> wrote:

> > We currently use Katta and this system works really, really well.
>
> Katta does look nice, but the SolrCloud stuff seems simpler and closer
> to what I need. We shall see :-)

-- 
Ted Dunning, CTO
DeepDyve

Re: SolrCloud - Using collections, slices and shards in the wild

Posted by Jon Gifford <jo...@gmail.com>.

On Wed, Feb 10, 2010 at 10:47 AM, Ted Dunning <te...@gmail.com> wrote:
> This is analogous to the multiple data sources that we have at
> Deepdyve<http://www.deepdyve.com>.
> In a fully sharded and balanced environment, I have found it *much* more
> efficient to put all data sources into a single collection and use a filter
> to select one or the other.

I suspect I'll be forced into the position of combining some subsets
of customers if (when?) multi-core performance becomes an issue, but
even if thats true, there are still likely to be customers that
(because of input volume) require their own collection. So I think the
capability to do what I'm suggesting is still key to me. To give you
an idea of what I mean by "input volume", we're talking to potential
customers who have streams of 15-20k "documents" per second, coming in
24/7, and the hardware we're going to be using can handle something
like 4-5k/second per core. So for a customer like that, we'll create a
collection and split the input 4 or 5 or 6 ways to make sure we're
keeping up to date.

> The rationale is that data sources are
> distributed in size according to a rough long-tail distribution.  For the
> largest ones, the filters are about as efficient as a separate index because
> they are such a large fraction of the index.  For the small ones, the
> filtered query is so fast that other issues form the bottleneck anyway.  The
> operational economies of not managing hundreds of indexes and the much
> better load balancing makes the integrated solution massively better for
> me.

Yeah, I'm a little concerned, to be honest, about the idea of having
hundreds or thousands of collections floating around. No matter how
you look at it, thats some serious overhead. Combining the "long-tail"
customers (i.e. low data rate) into one or more combo-indices makes a
fiar amount of sense, and I've been pondering that question for a
while now.

> We currently use Katta and this system works really, really well.

Katta does look nice, but the SolrCloud stuff seems simpler and closer
to what I need. We shall see :-)

> One big difference in our environments is that for me, the dominant query
> pattern involves most data sources while for you, the dominant pattern will
> likely involve a single data source.

Yeah. In fact, we have a hard requirement that customer X can't see
customer Y's data. Other differences are that (1) most customers are
interested in the newest data, so we're likely to have a very hot core
with the most recent data, and a bunch of cooler ones with historical
data that is accessed infrequently, and (2) we don't expect that
search load will be either high or continuous, since we're primarily
serving as a forensics tool, so will be very bursty for each customer.

>
> On Tue, Feb 9, 2010 at 9:02 PM, Jon Gifford <jo...@gmail.com> wrote:
>
>> 1) Support one index per customer, and many customers (thus, many
>> independent indices)
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: SolrCloud - Using collections, slices and shards in the wild

Posted by Ted Dunning <te...@gmail.com>.

This is analogous to the multiple data sources that we have at
Deepdyve<http://www.deepdyve.com>.
In a fully sharded and balanced environment, I have found it *much* more
efficient to put all data sources into a single collection and use a filter
to select one or the other.  The rationale is that data sources are
distributed in size according to a rough long-tail distribution.  For the
largest ones, the filters are about as efficient as a separate index because
they are such a large fraction of the index.  For the small ones, the
filtered query is so fast that other issues form the bottleneck anyway.  The
operational economies of not managing hundreds of indexes and the much
better load balancing makes the integrated solution massively better for
me.  We currently use Katta and this system works really, really well.

One big difference in our environments is that for me, the dominant query
pattern involves most data sources while for you, the dominant pattern will
likely involve a single data source.

On Tue, Feb 9, 2010 at 9:02 PM, Jon Gifford <jo...@gmail.com> wrote:

> 1) Support one index per customer, and many customers (thus, many
> independent indices)
>

-- 
Ted Dunning, CTO
DeepDyve