You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jie Sun <js...@yahoo.com> on 2012/11/05 22:26:20 UTC

load balance with SolrCloud

we are using solr 3.5 in production and we deal with customers data of
terabytes.

we are using shards for large customers and write our own replica management
in our software.

Now with the rapid growth of data, we are looking into solrcloud for its
robustness of sharding and replications.

I understand by read some documents on line that there is no SPOF using
solrcloud, so any instance in the cluster can server the query/index.
However, is it true that we need to write our own load balancer in front of
solrCloud? 

For example if we want to implement a model similar to Loggly, i.e. each
customer start indexing into the small shard of its own, then if any of the
customers grow more than the small shard's limit, we switch to index into
another small shard (we call it front end shard), meanwhile merge the just
released small shard to next level larger shard. 

Since the merge can happen between two instances on different servers, we
probably end up with synch the index files for the merging shards and then
use solr merge.

I am curious if there is anything solr provide to help on these kind of
strategy dealing with unevenly grow big customer data (a core)? or do we
have to write these in our software layer from scratch?

thanks
Jie



--
View this message in context: http://lucene.472066.n3.nabble.com/load-balance-with-SolrCloud-tp4018367.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: load balance with SolrCloud

Posted by Jie Sun <js...@yahoo.com>.

thanks for your feedback Erick.

I am also aware of the current limitation of shard number in a collection is
fixed. changing the number will need re-config and re-index. Let's say if
the limitation gets levitated in near future release, I would then consider
setup collection for each customer, which will include varies number of
shards and their replicas (depend on the customer size and it should grow
dynamically).

 so this will lead to having multiple collections on one solr server
instance... I assume setup n collections on one server is not an issue? or
is it? I am skeptical, see example on solr wiki below, it seems it is
starting a solr instance with a specific collection and its config:
cd example
java -Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

thanks
Jie



--
View this message in context: http://lucene.472066.n3.nabble.com/load-balance-with-SolrCloud-tp4018367p4018659.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: load balance with SolrCloud

Posted by Erick Erickson <er...@gmail.com>.

This is a complex setup, all right.

A pluggable sharding strategy is definitely something that is on the
roadmap for SolrCloud, but hasn't made it into the code base yet.

Keep in mind, though, that all the SolrCloud goodness centers around the
idea of a single index that may be sharded. I don't think SolrCloud has had
time to really think about handling the situation in which you have a bunch
of cores that may or may not be sharded but are running on the same server.
I don't know that it _doesn't_ work, mind you, but that scenario doesn't
seem like the prime use-case for SolrCloud.

That said, I don't know that such a situation is  _not_ do-able in
SolrCloud. Mostly I haven't explored that kind of functionality yet.

Not much help, I know. I suspect that this is one of those cases where _we_
will learn from _you_ if you try to meld SolrCloud with your setup. Sounds
like a great Wiki page if you do pursue this!


Best
Erick


On Tue, Nov 6, 2012 at 4:58 PM, Jie Sun <js...@yahoo.com> wrote:

> Hi Eric,
> thanks for your information. I read all the related issues with SOLR-1293
> as
> your just pointed me to.
>
> It seems they are not very suitable for our scenario.
>
> We do have couple of hundreds cores (you are right each customer will be
> corresponded to a core) typically on one solr instance. and all of them
> need
> to be actively working with indexing and queries. So we are not having like
> 10s of thousands of cores that only part of them need to be loaded.
>
> Our issues are on some servers that host very large customers, it runs out
> of disk space after some time due to the large among of index data. I have
> written a restful service that is being deployed with solr on tomcat to
> identify the large customer (core) indexing requests and consult with a dns
> service, it then off loads the indexing requests to additional solr
> servers,
> and support queries using solr shards on these servers going forward.
>
> We also have replicas for each shard, managed by our own software using
> peer
> model (I am thinking about using solr replications after 1.4).
>
> to me, SolrCould is like sharding+replication+zookeeper. I could be wrong.
> But if I am right, with very big existing data in our service, and we
> already have a lot of software in place working pretty well utilizing solr
> 1.4, I am just trying to figure out if it will worth it to migrate the
> production system to use SolrCloud.
>
> The problem we need to fix is in one area : I need to automate the off-load
> (sharding) process. Right now we use some monitor system to watch for the
> growth on each server. When we find a fast growing large core(customer), we
> will start to manually configure our dns directory and start adding
> shard(s)
> to it (basically we create a same core name on a different solr
> server/instance). my restful service going forward will then direct the
> queries for the customer onto these sharded cores using solr shards.
>
> If SolrCloud can not really help me automate this process, it is not very
> attractive to me right now. I have read some of the topics, I looked into
> distributing indexing, distributed update processor ... none of them can
> help the way I have been looking for. So I guess using solrcloud or not, I
> will need to write my own kind of 'load balancer' for indexing, unless I am
> wrong.
>
> I did come across Jon's white paper on Loggly, I have designed a model
> based
> on what he has done. The solution should be able to automatically creating
> shards, but it will need rsych index files for a core to different server
> and use solr merge to merge small core into larger cores, or use core admin
> to add new core on the fly.
>
> is this approach sounds like someone is already familiar with and had
> out-of-box solution? When I looked into solrcloud, I was expecting some
> pluggable index distributing policy factory I can customize.
> The closest thing I found was  SOLR-2593 (A new core admin action 'split'
> for splitting index ) but not exactly what I wanted.  Let me know if you
> can
> advice me on this more.
>
> thanks
> Jie
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/load-balance-with-SolrCloud-tp4018367p4018609.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: load balance with SolrCloud

Posted by Jie Sun <js...@yahoo.com>.

Hi Eric,
thanks for your information. I read all the related issues with SOLR-1293 as
your just pointed me to.

It seems they are not very suitable for our scenario.

We do have couple of hundreds cores (you are right each customer will be
corresponded to a core) typically on one solr instance. and all of them need
to be actively working with indexing and queries. So we are not having like
10s of thousands of cores that only part of them need to be loaded.

Our issues are on some servers that host very large customers, it runs out
of disk space after some time due to the large among of index data. I have
written a restful service that is being deployed with solr on tomcat to
identify the large customer (core) indexing requests and consult with a dns
service, it then off loads the indexing requests to additional solr servers,
and support queries using solr shards on these servers going forward.

We also have replicas for each shard, managed by our own software using peer
model (I am thinking about using solr replications after 1.4).

to me, SolrCould is like sharding+replication+zookeeper. I could be wrong.
But if I am right, with very big existing data in our service, and we
already have a lot of software in place working pretty well utilizing solr
1.4, I am just trying to figure out if it will worth it to migrate the
production system to use SolrCloud.

The problem we need to fix is in one area : I need to automate the off-load
(sharding) process. Right now we use some monitor system to watch for the
growth on each server. When we find a fast growing large core(customer), we
will start to manually configure our dns directory and start adding shard(s)
to it (basically we create a same core name on a different solr
server/instance). my restful service going forward will then direct the
queries for the customer onto these sharded cores using solr shards.

If SolrCloud can not really help me automate this process, it is not very
attractive to me right now. I have read some of the topics, I looked into
distributing indexing, distributed update processor ... none of them can
help the way I have been looking for. So I guess using solrcloud or not, I
will need to write my own kind of 'load balancer' for indexing, unless I am
wrong.

I did come across Jon's white paper on Loggly, I have designed a model based
on what he has done. The solution should be able to automatically creating
shards, but it will need rsych index files for a core to different server
and use solr merge to merge small core into larger cores, or use core admin
to add new core on the fly.

is this approach sounds like someone is already familiar with and had
out-of-box solution? When I looked into solrcloud, I was expecting some
pluggable index distributing policy factory I can customize.
The closest thing I found was SOLR-2593 (A new core admin action 'split'
for splitting index ) but not exactly what I wanted. Let me know if you can
advice me on this more.

thanks
Jie

--
View this message in context: http://lucene.472066.n3.nabble.com/load-balance-with-SolrCloud-tp4018367p4018609.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: load balance with SolrCloud

Posted by Erick Erickson <er...@gmail.com>.

I think you're conflating shards and cores. Shards are physical slices of a
singe logical index. An incoming query is sent to each and every shard and
the results tallied.

The case you're talking about seems to be more you have N separate indexes
(cores), where each core is for a specific user. This is vastly different
from SolrCloud, which puts all the data into one huge logical index!

Furthermore, presently there's no way to direct specific documents to
specific shards in SolrCloud (although a pluggable sharding mechanism is
under development).

You might be interested in SOLR-1293 (under development) for managing lots
of cores.

On Mon, Nov 5, 2012 at 4:26 PM, Jie Sun <js...@yahoo.com> wrote:

> we are using solr 3.5 in production and we deal with customers data of
> terabytes.
>
> we are using shards for large customers and write our own replica
> management
> in our software.
>
> Now with the rapid growth of data, we are looking into solrcloud for its
> robustness of sharding and replications.
>
> I understand by read some documents on line that there is no SPOF using
> solrcloud, so any instance in the cluster can server the query/index.
> However, is it true that we need to write our own load balancer in front of
> solrCloud?
>
> For example if we want to implement a model similar to Loggly, i.e. each
> customer start indexing into the small shard of its own, then if any of the
> customers grow more than the small shard's limit, we switch to index into
> another small shard (we call it front end shard), meanwhile merge the just
> released small shard to next level larger shard.
>
> Since the merge can happen between two instances on different servers, we
> probably end up with synch the index files for the merging shards and then
> use solr merge.
>
> I am curious if there is anything solr provide to help on these kind of
> strategy dealing with unevenly grow big customer data (a core)? or do we
> have to write these in our software layer from scratch?
>
> thanks
> Jie
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/load-balance-with-SolrCloud-tp4018367.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>