You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Arnon Yogev <AR...@il.ibm.com> on 2015/06/14 13:31:33 UTC

Limitation on Collections Number

We're running some tests on Solr and would like to have a deeper 
understanding of its limitations.

Specifically, We have tens of millions of documents (say 50M) and are 
comparing several "#collections X #docs_per_collection" configurations.
For example, we could have a single collection with 50M docs or 5000 
collections with 10K docs each.
When trying to create the 5000 collections, we start getting frequent 
errors after 1000-1500 collections have been created. Feels like some 
limit has been reached.
These tests are done on a single node + an additional node for replica.

Can someone elaborate on what could limit Solr to a high number of 
collections (if at all)?
i.e. if we wanted to have 5K or 10K (or 100K) collections, is there 
anything in Solr that can prevent it? Where would it break?

Thanks,
Arnon

Re: Limitation on Collections Number

Posted by Erick Erickson <er...@gmail.com>.

To my knowledge there's nothing built in to Solr to limit the number
of collections. There's nothing explicitly in place to handle
many hundreds of collections either so you're really in uncharted,
certainly untested waters. Anecdotally we've heard of the problem
you're describing.

You say you start seeing errors. What are they? OOMs? deadlocks?

If you are _not_ in SolrCloud, then there's the "Lots of cores" solution,
see: http://wiki.apache.org/solr/LotsOfCores. Pay attention to the
warning at the top: NOT FOR SOLRCLOUD!

Also note that the "lots of cores" option really is built for the pattern
where a particular core is searched sporadically. Indexing dropbox
files is a good example. A user may sign on and search her documents
just a few times a day, for a few minutes at a time. Because cores
are loaded/unloaded on demand, supporting
many hundreds of simultaneous users would cause a lot of core
loading/unloading and impact performance.

Best,
Erick

On Sun, Jun 14, 2015 at 8:00 AM, Shai Erera <se...@gmail.com> wrote:
> Thanks Jack for your response. But I think Arnon's question was different.
>
> If you need to index 10,000 different collection of documents in Solr (say
> a collection denotes someone's Dropbox files), then you have two options:
> index all collections in one Solr collection, and add a field like
> collectionID to each document and query, or index each user's private
> collection in a different Solr collection.
>
> The pros of the latter is that you don't need to add a collectionID filter
> to each query. Also from a security/privacy standpoint (and search quality)
> - a user can only ever search what he has access to -- e.g. it cannot get a
> spelling correction for words he never saw in his documents, nor document
> suggestions (even though the 'context' in some of Lucene suggesters allow
> one to do that too). From a quality standpoint you don't mix different term
> statistics etc.
>
> So from a single node's point of view, you can either index 100M documents
> in one index (Collection, shard, replica -- whatever -- a single Solr
> core), or in 10,000 such cores. From node capacity perspectives the two are
> the same -- same amount of documents will be indexed overall, same query
> workload etc.
>
> So the question is purely about Solr and its collections management -- is
> there anything in that process that can prevent one from managing thousands
> of collections on a single node, or within a single SolrCloud instance? If
> so, what is it -- are these the ZK watchers? Is there a thread per
> collection at work? Others?
>
> Shai
>
> On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> As a general rule, there are only two ways that Solr scales to large
>> numbers: large number of documents and moderate number of nodes (shards and
>> replicas). All other parameters should be kept relatively small, like
>> dozens or low hundreds. Even shards and replicas should probably kept down
>> to that same guidance of dozens or low hundreds.
>>
>> Tens of millions of documents should be no problem. I recommend 100 million
>> as the rough limit of documents per node. Of course it all depends on your
>> particular data model and data and hardware and network, so that number
>> could be smaller or larger.
>>
>> The main guidance has always been to simply do a proof of concept
>> implementation to test for your particular data model and data values.
>>
>> -- Jack Krupansky
>>
>> On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <AR...@il.ibm.com> wrote:
>>
>> > We're running some tests on Solr and would like to have a deeper
>> > understanding of its limitations.
>> >
>> > Specifically, We have tens of millions of documents (say 50M) and are
>> > comparing several "#collections X #docs_per_collection" configurations.
>> > For example, we could have a single collection with 50M docs or 5000
>> > collections with 10K docs each.
>> > When trying to create the 5000 collections, we start getting frequent
>> > errors after 1000-1500 collections have been created. Feels like some
>> > limit has been reached.
>> > These tests are done on a single node + an additional node for replica.
>> >
>> > Can someone elaborate on what could limit Solr to a high number of
>> > collections (if at all)?
>> > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
>> > anything in Solr that can prevent it? Where would it break?
>> >
>> > Thanks,
>> > Arnon
>>

Re: Limitation on Collections Number

Posted by Jack Krupansky <ja...@gmail.com>.

My answer remains the same - a large number of collections (cores) in a
single Solr instance is not one of the ways in which Solr is designed to
scale. To repeat, there are only two ways to scale Solr, number of
documents and number of nodes.



-- Jack Krupansky

On Sun, Jun 14, 2015 at 11:00 AM, Shai Erera <se...@gmail.com> wrote:

> Thanks Jack for your response. But I think Arnon's question was different.
>
> If you need to index 10,000 different collection of documents in Solr (say
> a collection denotes someone's Dropbox files), then you have two options:
> index all collections in one Solr collection, and add a field like
> collectionID to each document and query, or index each user's private
> collection in a different Solr collection.
>
> The pros of the latter is that you don't need to add a collectionID filter
> to each query. Also from a security/privacy standpoint (and search quality)
> - a user can only ever search what he has access to -- e.g. it cannot get a
> spelling correction for words he never saw in his documents, nor document
> suggestions (even though the 'context' in some of Lucene suggesters allow
> one to do that too). From a quality standpoint you don't mix different term
> statistics etc.
>
> So from a single node's point of view, you can either index 100M documents
> in one index (Collection, shard, replica -- whatever -- a single Solr
> core), or in 10,000 such cores. From node capacity perspectives the two are
> the same -- same amount of documents will be indexed overall, same query
> workload etc.
>
> So the question is purely about Solr and its collections management -- is
> there anything in that process that can prevent one from managing thousands
> of collections on a single node, or within a single SolrCloud instance? If
> so, what is it -- are these the ZK watchers? Is there a thread per
> collection at work? Others?
>
> Shai
>
> On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
> > As a general rule, there are only two ways that Solr scales to large
> > numbers: large number of documents and moderate number of nodes (shards
> and
> > replicas). All other parameters should be kept relatively small, like
> > dozens or low hundreds. Even shards and replicas should probably kept
> down
> > to that same guidance of dozens or low hundreds.
> >
> > Tens of millions of documents should be no problem. I recommend 100
> million
> > as the rough limit of documents per node. Of course it all depends on
> your
> > particular data model and data and hardware and network, so that number
> > could be smaller or larger.
> >
> > The main guidance has always been to simply do a proof of concept
> > implementation to test for your particular data model and data values.
> >
> > -- Jack Krupansky
> >
> > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <AR...@il.ibm.com> wrote:
> >
> > > We're running some tests on Solr and would like to have a deeper
> > > understanding of its limitations.
> > >
> > > Specifically, We have tens of millions of documents (say 50M) and are
> > > comparing several "#collections X #docs_per_collection" configurations.
> > > For example, we could have a single collection with 50M docs or 5000
> > > collections with 10K docs each.
> > > When trying to create the 5000 collections, we start getting frequent
> > > errors after 1000-1500 collections have been created. Feels like some
> > > limit has been reached.
> > > These tests are done on a single node + an additional node for replica.
> > >
> > > Can someone elaborate on what could limit Solr to a high number of
> > > collections (if at all)?
> > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> > > anything in Solr that can prevent it? Where would it break?
> > >
> > > Thanks,
> > > Arnon
> >
>

Re: Limitation on Collections Number

Posted by Arnon Yogev <AR...@il.ibm.com>.

Thank you for the replies.

The shard-per-user approach is interesting. We will look into it as well.

The errors we're getting when having ~1500 collections vary depending on 
the action (restarting the server, creating a new collection etc).
The frequent ones are:

1. Connection refused when starting solr (happens when Solr fails to 
start):
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:806)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
453171 [main-SendThread(localhost.localdomain:2181)] WARN 
org.apache.zookeeper.ClientCnxn  ? Session 0x14df5cd0f900008 for server 
null, unexpected error, closing socket connection and attempting reconnect


2. "Error getting leader" when starting Solr (happens when solr does 
start):
:org.apache.solr.common.SolrException: Error getting leader from zk for 
shard shard1
                 at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:871)
                 at 
org.apache.solr.cloud.ZkController.register(ZkController.java:783)
                 at 
org.apache.solr.cloud.ZkController.register(ZkController.java:731)
                 at 
org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:262)
                 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
                 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
                 at java.lang.Thread.run(Thread.java:809)
Caused by: org.apache.solr.common.SolrException: No registered leader was 
found after waiting for 1560000ms , collection: owner_234409 slice: shard1
                 at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:531)
                 at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderUrl(ZkStateReader.java:505)
                 at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:850)
                 ... 6 more

3. Collection already exists (though it does not) when trying to create a 
collection

15/06/2015, 11:35:41
WARN
OverseerCollectionProcessor
OverseerCollectionProcessor.processMessage : createcollection ,? {

15/06/2015, 11:35:41
ERROR
OverseerCollectionProcessor
Collection createcollection of createcollection 
failed:org.apache.solr.common.SolrException: collection already exists: 
owner_484011
Collection createcollection of createcollection 
failed:org.apache.solr.common.SolrException: collection already exists: 
owner_484011
                 at 
org.apache.solr.cloud.OverseerCollectionProcessor.createCollection(OverseerCollectionProcessor.java:1545)
                 at 
org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:385)
                 at 
org.apache.solr.cloud.OverseerCollectionProcessor.run(OverseerCollectionProcessor.java:198)
                 at java.lang.Thread.run(Thread.java:809)





From:   Erick Erickson <er...@gmail.com>
To:     solr-user@lucene.apache.org
Date:   14/06/2015 08:47 PM
Subject:        Re: Limitation on Collections Number



re: hybrid approach.

Hmmm, _assuming_ that no single user has a really huge number of
documents you might be able to use a single collection (or much
smaller group of collections), by using custom routing. That allows
you to send all the docs for a particular user to a particular shard.
There are some obvious issues here with the long-tail users, most of
your users have +/- X docs on average, and three of them have 100,000X
docs. There are probably some not-so-obvious gotcha's too....

True, for user X you'd send sub-requests to all shards, but all but
one of them wouldn't find anything so would _probably_ be close to a
no-op. Conceptually, each shard then becomes N of your current
collections. Maybe there's a sweet spot performance-wise here where
you're hosting some number of users per shard (or aggregate N docs per
shard or...).

Of course there's more maintenance here, particularly you have to
manage the size of shards yourself since the possibility of them
getting lopsided is higher etc.

FWIW,
Erick

On Sun, Jun 14, 2015 at 9:48 AM, Shai Erera <se...@gmail.com> wrote:
>>
>> My answer remains the same - a large number of collections (cores) in a
>> single Solr instance is not one of the ways in which Solr is designed 
to
>> scale. To repeat, there are only two ways to scale Solr, number of
>> documents and number of nodes.
>>
>
> Jack, I understand that, but I still feel you're missing the point. We
> didn't ask about scaling Solr at all - it's a question about indexing
> strategy when you need to index multiple disparate collections of 
documents
> -- one collection w/ a collectionID field, or a Solr collection per set 
of
> documents.
>
> If you are _not_ in SolrCloud, then there's the "Lots of cores" 
solution,
>> see: http://wiki.apache.org/solr/LotsOfCores. Pay attention to the
>> warning at the top: NOT FOR SOLRCLOUD!
>>
>
> Thanks Erick. We did read this a while ago. We are in SolrCloud mode 
cause
> we want to keep a replica per collection and SolrCloud makes it easy for
> us. However, we aren't in a real/common SolrCloud mode, where we just 
need
> to index 1B documents and sharding + replication comes to our aid.
>
> If we were not in a SolrCloud mode, I imagine we'd need to manage the
> replicas ourselves and also index a document to both replicas manually?
> That is, there is no way in _non_ SolrCloud mode to tell two cores that
> they are replicas of one another - correct?
>
> A user may sign on and search her documents
>> just a few times a day, for a few minutes at a time.
>>
>
> This is almost true -- you may visit your Dropbox once an hour (or it 
may
> be open in the background on your computer), but the server still 
receives
> documents (e.g. shares) frequently by other users, and need to index it 
for
> your collection. Not saying this isn't a good fit, just mentioning that
> it's not only the user who can update his/her collection, and therefore
> one's collection may be constantly active. Eventually this needs to be
> benchmarked.
>
> Our benchmarks show that on 1000 such collections, we achieve 
significant
> better response times from the multi-collection setup (one Solr 
collection
> per user) vs the single-collection setup (one Solr collection for *all*
> users, with a collectionID field added to all documents). Our next step 
is
> to try perhaps a hybrid mode where we store groups of users in the same
> Solr collection, but not all of them in the same Solr collection. So 
maybe
> if Solr works well w/ 1000 collections, we will index 10 users in one 
such
> collection ... we'll give it a try.
>
> I think SOLR-7191 may solve the general use case though I haven't yet 
read
> through it thoroughly.
>
> Shai
>
> On Sun, Jun 14, 2015 at 6:50 PM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
>> Yes, there are some known problems while scaling to large number of
>> collections, say 1000 or above. See
>> https://issues.apache.org/jira/browse/SOLR-7191
>>
>> On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera <se...@gmail.com> wrote:
>>
>> > Thanks Jack for your response. But I think Arnon's question was
>> different.
>> >
>> > If you need to index 10,000 different collection of documents in Solr
>> (say
>> > a collection denotes someone's Dropbox files), then you have two 
options:
>> > index all collections in one Solr collection, and add a field like
>> > collectionID to each document and query, or index each user's private
>> > collection in a different Solr collection.
>> >
>> > The pros of the latter is that you don't need to add a collectionID
>> filter
>> > to each query. Also from a security/privacy standpoint (and search
>> quality)
>> > - a user can only ever search what he has access to -- e.g. it cannot
>> get a
>> > spelling correction for words he never saw in his documents, nor 
document
>> > suggestions (even though the 'context' in some of Lucene suggesters 
allow
>> > one to do that too). From a quality standpoint you don't mix 
different
>> term
>> > statistics etc.
>> >
>> > So from a single node's point of view, you can either index 100M
>> documents
>> > in one index (Collection, shard, replica -- whatever -- a single Solr
>> > core), or in 10,000 such cores. From node capacity perspectives the 
two
>> are
>> > the same -- same amount of documents will be indexed overall, same 
query
>> > workload etc.
>> >
>> > So the question is purely about Solr and its collections management 
-- is
>> > there anything in that process that can prevent one from managing
>> thousands
>> > of collections on a single node, or within a single SolrCloud 
instance?
>> If
>> > so, what is it -- are these the ZK watchers? Is there a thread per
>> > collection at work? Others?
>> >
>> > Shai
>> >
>> > On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <
>> jack.krupansky@gmail.com>
>> > wrote:
>> >
>> > > As a general rule, there are only two ways that Solr scales to 
large
>> > > numbers: large number of documents and moderate number of nodes 
(shards
>> > and
>> > > replicas). All other parameters should be kept relatively small, 
like
>> > > dozens or low hundreds. Even shards and replicas should probably 
kept
>> > down
>> > > to that same guidance of dozens or low hundreds.
>> > >
>> > > Tens of millions of documents should be no problem. I recommend 100
>> > million
>> > > as the rough limit of documents per node. Of course it all depends 
on
>> > your
>> > > particular data model and data and hardware and network, so that 
number
>> > > could be smaller or larger.
>> > >
>> > > The main guidance has always been to simply do a proof of concept
>> > > implementation to test for your particular data model and data 
values.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <AR...@il.ibm.com>
>> wrote:
>> > >
>> > > > We're running some tests on Solr and would like to have a deeper
>> > > > understanding of its limitations.
>> > > >
>> > > > Specifically, We have tens of millions of documents (say 50M) and 
are
>> > > > comparing several "#collections X #docs_per_collection"
>> configurations.
>> > > > For example, we could have a single collection with 50M docs or 
5000
>> > > > collections with 10K docs each.
>> > > > When trying to create the 5000 collections, we start getting 
frequent
>> > > > errors after 1000-1500 collections have been created. Feels like 
some
>> > > > limit has been reached.
>> > > > These tests are done on a single node + an additional node for
>> replica.
>> > > >
>> > > > Can someone elaborate on what could limit Solr to a high number 
of
>> > > > collections (if at all)?
>> > > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is 
there
>> > > > anything in Solr that can prevent it? Where would it break?
>> > > >
>> > > > Thanks,
>> > > > Arnon
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

Re: Limitation on Collections Number

Posted by Erick Erickson <er...@gmail.com>.

re: hybrid approach.

Hmmm, _assuming_ that no single user has a really huge number of
documents you might be able to use a single collection (or much
smaller group of collections), by using custom routing. That allows
you to send all the docs for a particular user to a particular shard.
There are some obvious issues here with the long-tail users, most of
your users have +/- X docs on average, and three of them have 100,000X
docs. There are probably some not-so-obvious gotcha's too....

True, for user X you'd send sub-requests to all shards, but all but
one of them wouldn't find anything so would _probably_ be close to a
no-op. Conceptually, each shard then becomes N of your current
collections. Maybe there's a sweet spot performance-wise here where
you're hosting some number of users per shard (or aggregate N docs per
shard or...).

Of course there's more maintenance here, particularly you have to
manage the size of shards yourself since the possibility of them
getting lopsided is higher etc.

FWIW,
Erick

On Sun, Jun 14, 2015 at 9:48 AM, Shai Erera <se...@gmail.com> wrote:
>>
>> My answer remains the same - a large number of collections (cores) in a
>> single Solr instance is not one of the ways in which Solr is designed to
>> scale. To repeat, there are only two ways to scale Solr, number of
>> documents and number of nodes.
>>
>
> Jack, I understand that, but I still feel you're missing the point. We
> didn't ask about scaling Solr at all - it's a question about indexing
> strategy when you need to index multiple disparate collections of documents
> -- one collection w/ a collectionID field, or a Solr collection per set of
> documents.
>
> If you are _not_ in SolrCloud, then there's the "Lots of cores" solution,
>> see: http://wiki.apache.org/solr/LotsOfCores. Pay attention to the
>> warning at the top: NOT FOR SOLRCLOUD!
>>
>
> Thanks Erick. We did read this a while ago. We are in SolrCloud mode cause
> we want to keep a replica per collection and SolrCloud makes it easy for
> us. However, we aren't in a real/common SolrCloud mode, where we just need
> to index 1B documents and sharding + replication comes to our aid.
>
> If we were not in a SolrCloud mode, I imagine we'd need to manage the
> replicas ourselves and also index a document to both replicas manually?
> That is, there is no way in _non_ SolrCloud mode to tell two cores that
> they are replicas of one another - correct?
>
> A user may sign on and search her documents
>> just a few times a day, for a few minutes at a time.
>>
>
> This is almost true -- you may visit your Dropbox once an hour (or it may
> be open in the background on your computer), but the server still receives
> documents (e.g. shares) frequently by other users, and need to index it for
> your collection. Not saying this isn't a good fit, just mentioning that
> it's not only the user who can update his/her collection, and therefore
> one's collection may be constantly active. Eventually this needs to be
> benchmarked.
>
> Our benchmarks show that on 1000 such collections, we achieve significant
> better response times from the multi-collection setup (one Solr collection
> per user) vs the single-collection setup (one Solr collection for *all*
> users, with a collectionID field added to all documents). Our next step is
> to try perhaps a hybrid mode where we store groups of users in the same
> Solr collection, but not all of them in the same Solr collection. So maybe
> if Solr works well w/ 1000 collections, we will index 10 users in one such
> collection ... we'll give it a try.
>
> I think SOLR-7191 may solve the general use case though I haven't yet read
> through it thoroughly.
>
> Shai
>
> On Sun, Jun 14, 2015 at 6:50 PM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
>> Yes, there are some known problems while scaling to large number of
>> collections, say 1000 or above. See
>> https://issues.apache.org/jira/browse/SOLR-7191
>>
>> On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera <se...@gmail.com> wrote:
>>
>> > Thanks Jack for your response. But I think Arnon's question was
>> different.
>> >
>> > If you need to index 10,000 different collection of documents in Solr
>> (say
>> > a collection denotes someone's Dropbox files), then you have two options:
>> > index all collections in one Solr collection, and add a field like
>> > collectionID to each document and query, or index each user's private
>> > collection in a different Solr collection.
>> >
>> > The pros of the latter is that you don't need to add a collectionID
>> filter
>> > to each query. Also from a security/privacy standpoint (and search
>> quality)
>> > - a user can only ever search what he has access to -- e.g. it cannot
>> get a
>> > spelling correction for words he never saw in his documents, nor document
>> > suggestions (even though the 'context' in some of Lucene suggesters allow
>> > one to do that too). From a quality standpoint you don't mix different
>> term
>> > statistics etc.
>> >
>> > So from a single node's point of view, you can either index 100M
>> documents
>> > in one index (Collection, shard, replica -- whatever -- a single Solr
>> > core), or in 10,000 such cores. From node capacity perspectives the two
>> are
>> > the same -- same amount of documents will be indexed overall, same query
>> > workload etc.
>> >
>> > So the question is purely about Solr and its collections management -- is
>> > there anything in that process that can prevent one from managing
>> thousands
>> > of collections on a single node, or within a single SolrCloud instance?
>> If
>> > so, what is it -- are these the ZK watchers? Is there a thread per
>> > collection at work? Others?
>> >
>> > Shai
>> >
>> > On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <
>> jack.krupansky@gmail.com>
>> > wrote:
>> >
>> > > As a general rule, there are only two ways that Solr scales to large
>> > > numbers: large number of documents and moderate number of nodes (shards
>> > and
>> > > replicas). All other parameters should be kept relatively small, like
>> > > dozens or low hundreds. Even shards and replicas should probably kept
>> > down
>> > > to that same guidance of dozens or low hundreds.
>> > >
>> > > Tens of millions of documents should be no problem. I recommend 100
>> > million
>> > > as the rough limit of documents per node. Of course it all depends on
>> > your
>> > > particular data model and data and hardware and network, so that number
>> > > could be smaller or larger.
>> > >
>> > > The main guidance has always been to simply do a proof of concept
>> > > implementation to test for your particular data model and data values.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <AR...@il.ibm.com>
>> wrote:
>> > >
>> > > > We're running some tests on Solr and would like to have a deeper
>> > > > understanding of its limitations.
>> > > >
>> > > > Specifically, We have tens of millions of documents (say 50M) and are
>> > > > comparing several "#collections X #docs_per_collection"
>> configurations.
>> > > > For example, we could have a single collection with 50M docs or 5000
>> > > > collections with 10K docs each.
>> > > > When trying to create the 5000 collections, we start getting frequent
>> > > > errors after 1000-1500 collections have been created. Feels like some
>> > > > limit has been reached.
>> > > > These tests are done on a single node + an additional node for
>> replica.
>> > > >
>> > > > Can someone elaborate on what could limit Solr to a high number of
>> > > > collections (if at all)?
>> > > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
>> > > > anything in Solr that can prevent it? Where would it break?
>> > > >
>> > > > Thanks,
>> > > > Arnon
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

Re: Limitation on Collections Number

Posted by Shai Erera <se...@gmail.com>.

>
> My answer remains the same - a large number of collections (cores) in a
> single Solr instance is not one of the ways in which Solr is designed to
> scale. To repeat, there are only two ways to scale Solr, number of
> documents and number of nodes.
>

Jack, I understand that, but I still feel you're missing the point. We
didn't ask about scaling Solr at all - it's a question about indexing
strategy when you need to index multiple disparate collections of documents
-- one collection w/ a collectionID field, or a Solr collection per set of
documents.

If you are _not_ in SolrCloud, then there's the "Lots of cores" solution,
> see: http://wiki.apache.org/solr/LotsOfCores. Pay attention to the
> warning at the top: NOT FOR SOLRCLOUD!
>

Thanks Erick. We did read this a while ago. We are in SolrCloud mode cause
we want to keep a replica per collection and SolrCloud makes it easy for
us. However, we aren't in a real/common SolrCloud mode, where we just need
to index 1B documents and sharding + replication comes to our aid.

If we were not in a SolrCloud mode, I imagine we'd need to manage the
replicas ourselves and also index a document to both replicas manually?
That is, there is no way in _non_ SolrCloud mode to tell two cores that
they are replicas of one another - correct?

A user may sign on and search her documents
> just a few times a day, for a few minutes at a time.
>

This is almost true -- you may visit your Dropbox once an hour (or it may
be open in the background on your computer), but the server still receives
documents (e.g. shares) frequently by other users, and need to index it for
your collection. Not saying this isn't a good fit, just mentioning that
it's not only the user who can update his/her collection, and therefore
one's collection may be constantly active. Eventually this needs to be
benchmarked.

Our benchmarks show that on 1000 such collections, we achieve significant
better response times from the multi-collection setup (one Solr collection
per user) vs the single-collection setup (one Solr collection for *all*
users, with a collectionID field added to all documents). Our next step is
to try perhaps a hybrid mode where we store groups of users in the same
Solr collection, but not all of them in the same Solr collection. So maybe
if Solr works well w/ 1000 collections, we will index 10 users in one such
collection ... we'll give it a try.

I think SOLR-7191 may solve the general use case though I haven't yet read
through it thoroughly.

Shai

On Sun, Jun 14, 2015 at 6:50 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Yes, there are some known problems while scaling to large number of
> collections, say 1000 or above. See
> https://issues.apache.org/jira/browse/SOLR-7191
>
> On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera <se...@gmail.com> wrote:
>
> > Thanks Jack for your response. But I think Arnon's question was
> different.
> >
> > If you need to index 10,000 different collection of documents in Solr
> (say
> > a collection denotes someone's Dropbox files), then you have two options:
> > index all collections in one Solr collection, and add a field like
> > collectionID to each document and query, or index each user's private
> > collection in a different Solr collection.
> >
> > The pros of the latter is that you don't need to add a collectionID
> filter
> > to each query. Also from a security/privacy standpoint (and search
> quality)
> > - a user can only ever search what he has access to -- e.g. it cannot
> get a
> > spelling correction for words he never saw in his documents, nor document
> > suggestions (even though the 'context' in some of Lucene suggesters allow
> > one to do that too). From a quality standpoint you don't mix different
> term
> > statistics etc.
> >
> > So from a single node's point of view, you can either index 100M
> documents
> > in one index (Collection, shard, replica -- whatever -- a single Solr
> > core), or in 10,000 such cores. From node capacity perspectives the two
> are
> > the same -- same amount of documents will be indexed overall, same query
> > workload etc.
> >
> > So the question is purely about Solr and its collections management -- is
> > there anything in that process that can prevent one from managing
> thousands
> > of collections on a single node, or within a single SolrCloud instance?
> If
> > so, what is it -- are these the ZK watchers? Is there a thread per
> > collection at work? Others?
> >
> > Shai
> >
> > On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <
> jack.krupansky@gmail.com>
> > wrote:
> >
> > > As a general rule, there are only two ways that Solr scales to large
> > > numbers: large number of documents and moderate number of nodes (shards
> > and
> > > replicas). All other parameters should be kept relatively small, like
> > > dozens or low hundreds. Even shards and replicas should probably kept
> > down
> > > to that same guidance of dozens or low hundreds.
> > >
> > > Tens of millions of documents should be no problem. I recommend 100
> > million
> > > as the rough limit of documents per node. Of course it all depends on
> > your
> > > particular data model and data and hardware and network, so that number
> > > could be smaller or larger.
> > >
> > > The main guidance has always been to simply do a proof of concept
> > > implementation to test for your particular data model and data values.
> > >
> > > -- Jack Krupansky
> > >
> > > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <AR...@il.ibm.com>
> wrote:
> > >
> > > > We're running some tests on Solr and would like to have a deeper
> > > > understanding of its limitations.
> > > >
> > > > Specifically, We have tens of millions of documents (say 50M) and are
> > > > comparing several "#collections X #docs_per_collection"
> configurations.
> > > > For example, we could have a single collection with 50M docs or 5000
> > > > collections with 10K docs each.
> > > > When trying to create the 5000 collections, we start getting frequent
> > > > errors after 1000-1500 collections have been created. Feels like some
> > > > limit has been reached.
> > > > These tests are done on a single node + an additional node for
> replica.
> > > >
> > > > Can someone elaborate on what could limit Solr to a high number of
> > > > collections (if at all)?
> > > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> > > > anything in Solr that can prevent it? Where would it break?
> > > >
> > > > Thanks,
> > > > Arnon
> > >
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Limitation on Collections Number

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Yes, there are some known problems while scaling to large number of
collections, say 1000 or above. See
https://issues.apache.org/jira/browse/SOLR-7191

On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera <se...@gmail.com> wrote:

> Thanks Jack for your response. But I think Arnon's question was different.
>
> If you need to index 10,000 different collection of documents in Solr (say
> a collection denotes someone's Dropbox files), then you have two options:
> index all collections in one Solr collection, and add a field like
> collectionID to each document and query, or index each user's private
> collection in a different Solr collection.
>
> The pros of the latter is that you don't need to add a collectionID filter
> to each query. Also from a security/privacy standpoint (and search quality)
> - a user can only ever search what he has access to -- e.g. it cannot get a
> spelling correction for words he never saw in his documents, nor document
> suggestions (even though the 'context' in some of Lucene suggesters allow
> one to do that too). From a quality standpoint you don't mix different term
> statistics etc.
>
> So from a single node's point of view, you can either index 100M documents
> in one index (Collection, shard, replica -- whatever -- a single Solr
> core), or in 10,000 such cores. From node capacity perspectives the two are
> the same -- same amount of documents will be indexed overall, same query
> workload etc.
>
> So the question is purely about Solr and its collections management -- is
> there anything in that process that can prevent one from managing thousands
> of collections on a single node, or within a single SolrCloud instance? If
> so, what is it -- are these the ZK watchers? Is there a thread per
> collection at work? Others?
>
> Shai
>
> On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
> > As a general rule, there are only two ways that Solr scales to large
> > numbers: large number of documents and moderate number of nodes (shards
> and
> > replicas). All other parameters should be kept relatively small, like
> > dozens or low hundreds. Even shards and replicas should probably kept
> down
> > to that same guidance of dozens or low hundreds.
> >
> > Tens of millions of documents should be no problem. I recommend 100
> million
> > as the rough limit of documents per node. Of course it all depends on
> your
> > particular data model and data and hardware and network, so that number
> > could be smaller or larger.
> >
> > The main guidance has always been to simply do a proof of concept
> > implementation to test for your particular data model and data values.
> >
> > -- Jack Krupansky
> >
> > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <AR...@il.ibm.com> wrote:
> >
> > > We're running some tests on Solr and would like to have a deeper
> > > understanding of its limitations.
> > >
> > > Specifically, We have tens of millions of documents (say 50M) and are
> > > comparing several "#collections X #docs_per_collection" configurations.
> > > For example, we could have a single collection with 50M docs or 5000
> > > collections with 10K docs each.
> > > When trying to create the 5000 collections, we start getting frequent
> > > errors after 1000-1500 collections have been created. Feels like some
> > > limit has been reached.
> > > These tests are done on a single node + an additional node for replica.
> > >
> > > Can someone elaborate on what could limit Solr to a high number of
> > > collections (if at all)?
> > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> > > anything in Solr that can prevent it? Where would it break?
> > >
> > > Thanks,
> > > Arnon
> >
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Limitation on Collections Number

Posted by Shai Erera <se...@gmail.com>.

Thanks Jack for your response. But I think Arnon's question was different.

If you need to index 10,000 different collection of documents in Solr (say
a collection denotes someone's Dropbox files), then you have two options:
index all collections in one Solr collection, and add a field like
collectionID to each document and query, or index each user's private
collection in a different Solr collection.

The pros of the latter is that you don't need to add a collectionID filter
to each query. Also from a security/privacy standpoint (and search quality)
- a user can only ever search what he has access to -- e.g. it cannot get a
spelling correction for words he never saw in his documents, nor document
suggestions (even though the 'context' in some of Lucene suggesters allow
one to do that too). From a quality standpoint you don't mix different term
statistics etc.

So from a single node's point of view, you can either index 100M documents
in one index (Collection, shard, replica -- whatever -- a single Solr
core), or in 10,000 such cores. From node capacity perspectives the two are
the same -- same amount of documents will be indexed overall, same query
workload etc.

So the question is purely about Solr and its collections management -- is
there anything in that process that can prevent one from managing thousands
of collections on a single node, or within a single SolrCloud instance? If
so, what is it -- are these the ZK watchers? Is there a thread per
collection at work? Others?

Shai

On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> As a general rule, there are only two ways that Solr scales to large
> numbers: large number of documents and moderate number of nodes (shards and
> replicas). All other parameters should be kept relatively small, like
> dozens or low hundreds. Even shards and replicas should probably kept down
> to that same guidance of dozens or low hundreds.
>
> Tens of millions of documents should be no problem. I recommend 100 million
> as the rough limit of documents per node. Of course it all depends on your
> particular data model and data and hardware and network, so that number
> could be smaller or larger.
>
> The main guidance has always been to simply do a proof of concept
> implementation to test for your particular data model and data values.
>
> -- Jack Krupansky
>
> On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <AR...@il.ibm.com> wrote:
>
> > We're running some tests on Solr and would like to have a deeper
> > understanding of its limitations.
> >
> > Specifically, We have tens of millions of documents (say 50M) and are
> > comparing several "#collections X #docs_per_collection" configurations.
> > For example, we could have a single collection with 50M docs or 5000
> > collections with 10K docs each.
> > When trying to create the 5000 collections, we start getting frequent
> > errors after 1000-1500 collections have been created. Feels like some
> > limit has been reached.
> > These tests are done on a single node + an additional node for replica.
> >
> > Can someone elaborate on what could limit Solr to a high number of
> > collections (if at all)?
> > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> > anything in Solr that can prevent it? Where would it break?
> >
> > Thanks,
> > Arnon
>

Re: Limitation on Collections Number

Posted by Jack Krupansky <ja...@gmail.com>.

As a general rule, there are only two ways that Solr scales to large
numbers: large number of documents and moderate number of nodes (shards and
replicas). All other parameters should be kept relatively small, like
dozens or low hundreds. Even shards and replicas should probably kept down
to that same guidance of dozens or low hundreds.

Tens of millions of documents should be no problem. I recommend 100 million
as the rough limit of documents per node. Of course it all depends on your
particular data model and data and hardware and network, so that number
could be smaller or larger.

The main guidance has always been to simply do a proof of concept
implementation to test for your particular data model and data values.

-- Jack Krupansky

On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <AR...@il.ibm.com> wrote:

> We're running some tests on Solr and would like to have a deeper
> understanding of its limitations.
>
> Specifically, We have tens of millions of documents (say 50M) and are
> comparing several "#collections X #docs_per_collection" configurations.
> For example, we could have a single collection with 50M docs or 5000
> collections with 10K docs each.
> When trying to create the 5000 collections, we start getting frequent
> errors after 1000-1500 collections have been created. Feels like some
> limit has been reached.
> These tests are done on a single node + an additional node for replica.
>
> Can someone elaborate on what could limit Solr to a high number of
> collections (if at all)?
> i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> anything in Solr that can prevent it? Where would it break?
>
> Thanks,
> Arnon