You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Troullis <cp...@gmail.com> on 2017/04/15 13:08:21 UTC

Sharding strategy for optimal NRT performance

Hi!

I am looking for some advice on an sharding strategy that will produce
optimal performance in the NRT search case for my setup. I have come up
with a strategy that I think will work based on my experience, testing, and
reading of similar questions on the mailing list, but I was hoping to run
my idea by some experts to see if I am on the right track or am completely
off base.

*Let's start off with some background info on my use case*:

We are currently using Solr (5.5.2) with the classic Master/Slave setup.
Because of our NRT requirements, the slave is pretty much only used for
failover, all writes/reads go to the master (which I know is not ideal, but
that's what we're working with!). We have 6 different indexes with
completely different schemas for various searches in our application. We
have just over 300 tenants, which currently all reside within the same
index for each of our indexes. We separate our tenants at query time via a
filter query with a tenant identifier (which works fine). Each index is not
tremendously large, they range from 1M documents to the largest being
around 12M documents. Our load is not huge as search is not the core
functionality of our application, but merely a tool to get to what they are
looking for in the app. I believe our peak load barely goes over 1 QPS.
Even though our number of documents isn't super high, we do some pretty
complex faceting, and block joins in some cases, which along with crappy
hardware in our data center (no SSDs) initially led to some pretty poor
query times for our customers. This was due to the fact that we are
constantly indexing throughout the day (job that runs once per minute), and
we auto soft commit (openSearcher=true) every 1 minute. Because of the
nature of our application, NRT updates are necessary. As we all know,
opening searches this frequently has the drawback of invalidating all of
our searcher-based caches, causing query times to be erratic, and slower on
average. With our current setup, we have solved our query performance times
by setting up autowarming, both on the filter cache, and via static warming
queries.

*The problem:*

So now for the problem. While we are now running great from a performance
perspective, we are receiving complaints from customers saying that the
changes they are making are slow to be reflected in search. Because of the
nature of our application, this has significant impact on their user
experience, and is an issue we need to solve. Overall, we would like to be
able to reduce our NRT visibility from the minutes we have now down to
seconds. The problem is doing this in a way that won't significantly affect
our query performance. We are already seeing maxWarmingSearchers warnings
in our logs occasionally with our current setup, so just indexing more
frequently is not a viable solution. In addition to this, autowarming in
itself is problematic for the NRT use case, as the new searcher won't start
serving requests until it is fully warmed anyway, which is sort of counter
to the goal of decreasing the time it takes for new documents to be visible
in search. And so this is the predicament we find ourselves in. We can
index more frequently (and soft commit more frequently), but we will have
to remove (or greatly decrease) our autowarming, which will destroy our
search performance. Obviously there is some give and take here, we can't
have true NRT search with optimal query performance, but I am hoping to
find a solution that will provide acceptable results for both.

*Proposed solution:*

I have done a lot of research and experimentation on this issue and have
started coming up with what I believe will be a decent solution to the
aforementioned problem. First off, I would like to make the move over to
Solr Cloud. We had been contemplating this for a while anyway, as we
currently have no load balancing at all (since our slave is just used for
failover), but I am also thinking that by using the right sharding strategy
we can improve our NRT experience as well. I first started looking at the
standard composite id routing, and while we can ensure that all of a single
tenant's data is located on the same shard, because there is a large
discrepancy between the amounts of data our tenants have, our shards would
be very unevenly distributed in terms of number of documents. Ideally, we
would like all of our tenants to be isolated from a performance perspective
(from a security perspective we are not really concerned, as all of our
queries have a tenant identifier filter query already). Basically, we don't
want tiny tenant A to be screwed over because they were unlucky enough to
land on Huge tenant B's shard. We do know the footprint of each tenant in
terms of number of documents, so technically we could work out a sharding
strategy manually which would evenly distribute the tenants based on number
of documents, but since we have 6 different indexes, and with each index
the tenant's document distribution will be different, this would become a
headache to manage. And so the conclusion that I have come to is that for
optimal performance for our NRT use case, we would yield the best results
by having a separate shard (and replica) per tenant for each index
(collection). The benefit to this approach is that each tenant is isolated
from a performance perspective (in terms of # of documents and commits,
obviously the resources on the node are still shared). With this approach,
each individual shard should be small enough that caching/autowarming
becomes less important, and in addition to this, even though we will be
indexing more frequently, and auto soft commiting more frequently (say
every second), new searchers shouldn't be opened as frequently on the
individual shards, as they will only have to be reopened if document is
re-indexed on that particular shard. In other words, tenant A could be
making a bunch of changes and having to open new searchers, but it
shouldn't affect tenant B at all (since searchers are tied to shards, at
least that is my understanding).

*Questions:*

Hopefully I have done a good enough job explaining our current setup, the
problem I am trying to solve, and the solution I have come up with. I would
love to hear any feedback anyone has on my proposed solution or any
corrections anyone might see with some of the assumptions I am making. As
far as specific questions goes, other than "Is this viable?", I did have a
couple of specific questions around separating concerns at a collection
level vs a shard level:

1. I have read other discussions on the mailing list around strategies for
implementing multi-tenancy, but most of them seem to revolve around having
a *collection *per tenant, as opposed to what I am thinking about doing in
having a single collection but having a *shard *per tenant. As I understand
it, each shard is essentially a separate lucene index, and there is a base
overhead cost associated with that. This cost obviously applies whether or
not you are using collection per tenant or shard per tenant, as each
collection has to have at least 1 shard. Basically I am wondering what the
pros and cons would be between the 2 approaches? Is there additional
overhead in creating additional collections? IE, is there more overhead
using the collection per tenant approach than there would be having a
single collection with 1 shard/tenant?

2. Regarding the overhead of collections/shards per tenant, I have read in
multiple places that you should be wary about creating thousands of
collections with Solr Cloud. Does this also apply for shards? For example,
for my proposed solution, I would have say 300 shards per collection x 6
collections x 2 (1 replica per shard), which would put me at 3600
shards/replicas (across 6 collections). Obviously this would be split over
multiple nodes, but is this likely to cause problems? Is the issue caused
by having too many shards on one node (aka, would adding nodes help the
issue), or is it a broader issue with zookeeper coordination?


I apologize for the giant wall of text, I just wanted to make sure I gave a
thorough explanation of my problem. Again, any advice anyone could provide
would be greatly appreciated!

Thanks!