You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by shushuai zhu <ss...@yahoo.com> on 2014/03/21 00:51:41 UTC

Best approach to handle large volume of documents with constantly high incoming rate?

Hi, 

I am looking for some advice to handle large volume of documents with a very high incoming rate. The size of each document is about 0.5 KB and the incoming rate could be more than 20K per second and we want to store about one year's documents in Solr for near real=time searching. The goal is to achieve acceptable indexing and querying performance.

We will use techniques like soft commit, dedicated indexing servers, etc. My main question is about how to structure the collection/shard/core to achieve the goals. Since the incoming rate is very high, we do not want the incoming documents to affect the existing older indexes. Some thoughts are to create a latest index to hold the incoming documents (say latest half hour's data, about 36M docs) so queries on older data could be faster since the old indexes are not affected. There seem three ways to grow the time dimension by adding/splitting/creating a new object listed below every half hour:

collection
shard
core

Which is the best way to grow the time dimension? Any limitation in that direction? Or there is some better approach?

As an example, I am thinking about having 4 nodes with the following configuration to setup a Solr Cloud:

Memory: 128 GB
Storage: 4 TB

How to set the collection/shard/core to deal with the use case?

Thanks in advance.

Shushuai

Re: Best approach to handle large volume of documents with constantly high incoming rate?

Posted by shushuai zhu <ss...@yahoo.com>.

Jack, thanks. 

Actually the 20K events/sec is some low-end rate we estimated. It is not necessarily related to sensor; when you want to centralize data from many sources, regardless multi-tenancy, even for a single tenant, many events per second have to be handled.

I have a question regarding to the size of nodes used in Solr Cloud, what are the general pros/cons between using big or small nodes to setup Solr Clouds for similar cases as I described? For example, mainly considering memory:

256 (GB) x 4
vs.
32 (GB) x 32

or a little extreme:
256 (GB) x 4
vs.
8 (GB) x 128

Is it better to use fewer bigger nodes to setup a Solr Cloud or better to use more small nodes to setup a Solr Cloud? In the latter (a little extreme example), multiple Solr Clouds could be considered as Erick mentioned.

Regards.

Shushuai

________________________________
 From: Jack Krupansky <ja...@basetechnology.com>
To: solr-user@lucene.apache.org 
Sent: Sunday, March 23, 2014 1:03 AM
Subject: Re: Best approach to handle large volume of documents with constantly high incoming rate?

I defer to Erick on on this level of detail and experience.

Let's continue the discussion - some of it will be a matter of how to 
configure and tune Solr, how to select, configure, and tune hardware, the 
need for further Lucene/Solr improvements, and how much further we have to 
go to get to the next level with Big Data. I mean, 20K events/sec is not 
necessarily beyond the realm of reality these days with sensor data (20K/sec 
= 1 event every 50 microseconds)

-- Jack Krupansky

-----Original Message----- 
From: Erick Erickson
Sent: Saturday, March 22, 2014 11:02 PM
To: solr-user@lucene.apache.org ; shushuai zhu
Subject: Re: Best approach to handle large volume of documents with 
constantly high incoming rate?

Well, the "commonsense limits" Jack is referring to in that post are
more (IMO) scales you should count on having to do some _serious_
prototyping/configuring/etc. As you scale out, you'll run into edge
cases that aren't the common variety, aren't reliably tested every
night, etc. I mean how would you set up a test bed that had 1,000
nodes? Sure, it can be done, but nobody's volunteered yet to provide
the Apache Solr project that much hardware. I suspect that it would
make Uwe's week if someone did though.

In the "practical limit" vein, one example: You'll run up against "the
laggard problem". Let's assume that you successfully put up 2,000
nodes, for simplicity's sake, no replicas, just leaders and they all
stay up all the time. To successfully do a search, you need to send
out a request to all 2,000 nodes. The chance that one of them is slow
for _any_ reason (GC, high CPU load, it's just tired) increases the
more nodes you have. And since you have to wait until the slowest node
responds, your query rate will suffer correspondingly.

I've seen 4 node clusters handle 5,000 docs/sec update rate FWIW. YMMV
of course.

However, you say "...dedicated indexing servers...". There's no such
thing in SolrCloud. Every document gets sent to every member of the
slice it belongs to. How else could NRT be supported? When I saw that
comment I wondered how well you understand SolrCloud. I flat guarantee
you'll understand SolrCloud really, really well if yo try to scale as
you indicate :). There'll be a whole bunch of "learning experiences"
along the way, some will be painful. I guarantee that too.

Responding to your points

1) Yes, no, and maybe. For relatively small docs on relatively modern
hardware, it's a good place to start. Then you have to push it until
it falls over to determine your _real_ rates. See:
http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

2) Nobody knows. There's no theoretical reason why SolrCloud
shouldn't; no a-priori hard limits. I _strongly_ suspect you'll be on
the bleeding edge of size, though. Expect some things to be "learning
experiences".

3) No, it doesn't mean that at all. 64 is an arbitrary number that
means, IMO, "here there be dragons". As you start to scale out beyond
this you'll run into pesky issues I expect. Your network won't be as
reliable as you think. You'll find one of your VMs (which I expect
you'll be running on) has some glitches. Someone loaded a very CPU
intensive program on three of your machines and your Solrs on those
machines is being starved. Etc.

4) I've personally seen 1,000 node clusters. You ought to see the very
cool. SolrCloud admin graph I recently saw... But I expect you'll
actually be in for some kind of divide-and-conquer strategy whereby
you have a bunch of clusters that are significantly smaller. You
could, for instance, determine that the use-case you support is
searching across small ranges, say a week at a time and have 52
clusters of 128 machines or so. You could have 365 clusters of < 20
machines. It all depends on how the index will be used.

5) Not at all. See above, I've seen 5K/sec on 4 nodes, also supporting
simultaneous searching.

6) N/A

I really can't give you advice. You haven't, for instance, said
anything about searches. What kind of SLA are you aiming for? What
kind of queries? Faceting? Grouping? I can change the memory footprint
of Solr by firing off really ugly queries. In essence, you absolutely
must prototype, see the link above. And do a lot of homework defining
how you will search the corpus. Otherwise you're guessing. But at this
kind of scale, expect to do something other than throw all the docs at
a 64 node cluster and expect it to just work. It'll be a lot of work.

On Sat, Mar 22, 2014 at 6:48 PM, shushuai zhu <ss...@yahoo.com> wrote:
> Jack, thanks for your reply.
>
> Sorry for the confusion about 4 nodes. What I meant was to use 4 nodes to 
> do some POC, mainly focusing on handling the high incoming rate in a few 
> days instead of storing data over one year.
>
> You estimated the required nodes (6,308) and storage (322TB) based on the 
> incoming rate and doc size. I have a few questions regarding to them:
>
> 1) Is "100 million docs/node" some general capacity guideline for a Solr 
> node?
> 2) Assuming we can provide 6,308 nodes, can Solr Cloud really scale to 
> that level? I found you indicated some "common sense limits" of Solr 
> Cluster size of 64 nodes in the following mail thread 
> http://find.searchhub.org/document/d823643e65fe2015#84f0c89df2426990
> 3) If 64 nodes are something we know Solr Cloud can scale up to, then does 
> it mean I can only be sure that 1% of the mentioned workload can be handle 
> by Solr Cloud? (64 is about 1% of 6,308 nodes)
> 4) The above mentioned "Solr Limitations" mail thread did mention some 
> cluster with 512 nodes but not really verified whether it worked or not; 
> assuming it worked, it just means we may be able to handle a little less 
> than 10% of the desired workload.
> 5) Given above simple deduction, it seems 2K docs/sec (10% of the 
> mentioned incoming rate) is the practical limitation of Solr Cloud we can 
> guess for our use case?
> 6) If the incoming rate is controlled to be around 1k or 2k docs/sec and 
> we want to use Solr Cluster with 64 nodes (or more if it still works), 
> what kind of collection/shard/core structure should be?
>
> I am more looking for architectural advice regarding to Solr Cloud 
> structure to handle high incoming rate of relatively small docs.
>
> Regards.
>
> Shushuai
>
>
>
> On Saturday, March 22, 2014 2:17 PM, Jack Krupansky 
> <ja...@basetechnology.com> wrote:
>
> 20K docs/sec = 20,000 * 60 * 60 * 24 = 1,728,000,000 = 1.7 billion 
> docs/day
> * 365 = 630,720,000,000 = 631 billion docs/yr
>
> At 100 million docs/node = 6,308 nodes!
>
> And you think you can do it with 4 nodes?
>
> Oh, and that's before replication!
>
> 0.5K/doc * 631 billion docs = 322 TB.
>
> -- Jack Krupansky
>
>
> -----Original Message-----
> From: shushuai zhu
> Sent: Saturday, March 22, 2014 11:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Best approach to handle large volume of documents with
> constantly high incoming rate?
>
> Any thoughts? Can Solr Cloud support such use case with acceptable
> performance?
>
>
>
> On Thursday, March 20, 2014 7:51 PM, shushuai zhu <ss...@yahoo.com> wrote:
>
> Hi,
>
> I am looking for some advice to handle large volume of documents with a 
> very
> high incoming rate. The size of each document is about 0.5 KB and the
> incoming rate could be more than 20K per second and we want to store about
> one year's documents in Solr for near real=time searching. The goal is to
> achieve acceptable indexing and querying performance.
>
> We will use techniques like soft commit, dedicated indexing servers, etc. 
> My
> main question is about how to structure the collection/shard/core to 
> achieve
> the goals. Since the incoming rate is very high, we do not want the 
> incoming
> documents to affect the existing older indexes. Some thoughts are to 
> create
> a latest index to hold the incoming documents (say latest half hour's 
> data,
> about 36M docs) so queries on older data could be faster since the old
> indexes are not affected. There seem three ways to grow the time dimension
> by adding/splitting/creating a new object listed below every half hour:
>
> collection
> shard
> core
>
> Which is the best way to grow the time dimension? Any limitation in that
> direction? Or there is some better approach?
>
> As an example, I am thinking about having 4 nodes with the following
> configuration to setup a Solr Cloud:
>
> Memory: 128 GB
> Storage: 4 TB
>
> How to set the collection/shard/core to deal with the use case?
>
> Thanks in advance.
>
> Shushuai

Re: Best approach to handle large volume of documents with constantly high incoming rate?

Posted by Jack Krupansky <ja...@basetechnology.com>.

I defer to Erick on on this level of detail and experience.

Let's continue the discussion - some of it will be a matter of how to 
configure and tune Solr, how to select, configure, and tune hardware, the 
need for further Lucene/Solr improvements, and how much further we have to 
go to get to the next level with Big Data. I mean, 20K events/sec is not 
necessarily beyond the realm of reality these days with sensor data (20K/sec 
= 1 event every 50 microseconds)

-- Jack Krupansky

-----Original Message----- 
From: Erick Erickson
Sent: Saturday, March 22, 2014 11:02 PM
To: solr-user@lucene.apache.org ; shushuai zhu
Subject: Re: Best approach to handle large volume of documents with 
constantly high incoming rate?

Well, the "commonsense limits" Jack is referring to in that post are
more (IMO) scales you should count on having to do some _serious_
prototyping/configuring/etc. As you scale out, you'll run into edge
cases that aren't the common variety, aren't reliably tested every
night, etc. I mean how would you set up a test bed that had 1,000
nodes? Sure, it can be done, but nobody's volunteered yet to provide
the Apache Solr project that much hardware. I suspect that it would
make Uwe's week if someone did though.

In the "practical limit" vein, one example: You'll run up against "the
laggard problem". Let's assume that you successfully put up 2,000
nodes, for simplicity's sake, no replicas, just leaders and they all
stay up all the time. To successfully do a search, you need to send
out a request to all 2,000 nodes. The chance that one of them is slow
for _any_ reason (GC, high CPU load, it's just tired) increases the
more nodes you have. And since you have to wait until the slowest node
responds, your query rate will suffer correspondingly.

I've seen 4 node clusters handle 5,000 docs/sec update rate FWIW. YMMV
of course.

However, you say "...dedicated indexing servers...". There's no such
thing in SolrCloud. Every document gets sent to every member of the
slice it belongs to. How else could NRT be supported? When I saw that
comment I wondered how well you understand SolrCloud. I flat guarantee
you'll understand SolrCloud really, really well if yo try to scale as
you indicate :). There'll be a whole bunch of "learning experiences"
along the way, some will be painful. I guarantee that too.

Responding to your points

1) Yes, no, and maybe. For relatively small docs on relatively modern
hardware, it's a good place to start. Then you have to push it until
it falls over to determine your _real_ rates. See:
http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

2) Nobody knows. There's no theoretical reason why SolrCloud
shouldn't; no a-priori hard limits. I _strongly_ suspect you'll be on
the bleeding edge of size, though. Expect some things to be "learning
experiences".

3) No, it doesn't mean that at all. 64 is an arbitrary number that
means, IMO, "here there be dragons". As you start to scale out beyond
this you'll run into pesky issues I expect. Your network won't be as
reliable as you think. You'll find one of your VMs (which I expect
you'll be running on) has some glitches. Someone loaded a very CPU
intensive program on three of your machines and your Solrs on those
machines is being starved. Etc.

4) I've personally seen 1,000 node clusters. You ought to see the very
cool. SolrCloud admin graph I recently saw... But I expect you'll
actually be in for some kind of divide-and-conquer strategy whereby
you have a bunch of clusters that are significantly smaller. You
could, for instance, determine that the use-case you support is
searching across small ranges, say a week at a time and have 52
clusters of 128 machines or so. You could have 365 clusters of < 20
machines. It all depends on how the index will be used.

5) Not at all. See above, I've seen 5K/sec on 4 nodes, also supporting
simultaneous searching.

6) N/A

I really can't give you advice. You haven't, for instance, said
anything about searches. What kind of SLA are you aiming for? What
kind of queries? Faceting? Grouping? I can change the memory footprint
of Solr by firing off really ugly queries. In essence, you absolutely
must prototype, see the link above. And do a lot of homework defining
how you will search the corpus. Otherwise you're guessing. But at this
kind of scale, expect to do something other than throw all the docs at
a 64 node cluster and expect it to just work. It'll be a lot of work.

On Sat, Mar 22, 2014 at 6:48 PM, shushuai zhu <ss...@yahoo.com> wrote:
> Jack, thanks for your reply.
>
> Sorry for the confusion about 4 nodes. What I meant was to use 4 nodes to 
> do some POC, mainly focusing on handling the high incoming rate in a few 
> days instead of storing data over one year.
>
> You estimated the required nodes (6,308) and storage (322TB) based on the 
> incoming rate and doc size. I have a few questions regarding to them:
>
> 1) Is "100 million docs/node" some general capacity guideline for a Solr 
> node?
> 2) Assuming we can provide 6,308 nodes, can Solr Cloud really scale to 
> that level? I found you indicated some "common sense limits" of Solr 
> Cluster size of 64 nodes in the following mail thread 
> http://find.searchhub.org/document/d823643e65fe2015#84f0c89df2426990
> 3) If 64 nodes are something we know Solr Cloud can scale up to, then does 
> it mean I can only be sure that 1% of the mentioned workload can be handle 
> by Solr Cloud? (64 is about 1% of 6,308 nodes)
> 4) The above mentioned "Solr Limitations" mail thread did mention some 
> cluster with 512 nodes but not really verified whether it worked or not; 
> assuming it worked, it just means we may be able to handle a little less 
> than 10% of the desired workload.
> 5) Given above simple deduction, it seems 2K docs/sec (10% of the 
> mentioned incoming rate) is the practical limitation of Solr Cloud we can 
> guess for our use case?
> 6) If the incoming rate is controlled to be around 1k or 2k docs/sec and 
> we want to use Solr Cluster with 64 nodes (or more if it still works), 
> what kind of collection/shard/core structure should be?
>
> I am more looking for architectural advice regarding to Solr Cloud 
> structure to handle high incoming rate of relatively small docs.
>
> Regards.
>
> Shushuai
>
>
>
> On Saturday, March 22, 2014 2:17 PM, Jack Krupansky 
> <ja...@basetechnology.com> wrote:
>
> 20K docs/sec = 20,000 * 60 * 60 * 24 = 1,728,000,000 = 1.7 billion 
> docs/day
> * 365 = 630,720,000,000 = 631 billion docs/yr
>
> At 100 million docs/node = 6,308 nodes!
>
> And you think you can do it with 4 nodes?
>
> Oh, and that's before replication!
>
> 0.5K/doc * 631 billion docs = 322 TB.
>
> -- Jack Krupansky
>
>
> -----Original Message-----
> From: shushuai zhu
> Sent: Saturday, March 22, 2014 11:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Best approach to handle large volume of documents with
> constantly high incoming rate?
>
> Any thoughts? Can Solr Cloud support such use case with acceptable
> performance?
>
>
>
> On Thursday, March 20, 2014 7:51 PM, shushuai zhu <ss...@yahoo.com> wrote:
>
> Hi,
>
> I am looking for some advice to handle large volume of documents with a 
> very
> high incoming rate. The size of each document is about 0.5 KB and the
> incoming rate could be more than 20K per second and we want to store about
> one year's documents in Solr for near real=time searching. The goal is to
> achieve acceptable indexing and querying performance.
>
> We will use techniques like soft commit, dedicated indexing servers, etc. 
> My
> main question is about how to structure the collection/shard/core to 
> achieve
> the goals. Since the incoming rate is very high, we do not want the 
> incoming
> documents to affect the existing older indexes. Some thoughts are to 
> create
> a latest index to hold the incoming documents (say latest half hour's 
> data,
> about 36M docs) so queries on older data could be faster since the old
> indexes are not affected. There seem three ways to grow the time dimension
> by adding/splitting/creating a new object listed below every half hour:
>
> collection
> shard
> core
>
> Which is the best way to grow the time dimension? Any limitation in that
> direction? Or there is some better approach?
>
> As an example, I am thinking about having 4 nodes with the following
> configuration to setup a Solr Cloud:
>
> Memory: 128 GB
> Storage: 4 TB
>
> How to set the collection/shard/core to deal with the use case?
>
> Thanks in advance.
>
> Shushuai

Re: Best approach to handle large volume of documents with constantly high incoming rate?

Posted by shushuai zhu <ss...@yahoo.com>.

Erick,

Thanks a lot for the detailed answers. They are very helpful and I do get some idea from them.

As per our searches, we will mainly do term and field (AND/OR) searches, histogram, and faceting. Generally the queries are bound by time (e.g, last hour, last day, last week, or even last month). The queries are not complicated but the challenge is to support near real-time queries with high incoming rate.   
I have to admit that I just started looking into Solr although I used ElasticSearch for a little while. By default, ElasticSearch creates daily indexes to scale in time dimension and that is the reason I asked if I could customize Solr to do some similar thing to scale out in time dimension. I did notice the following slides talking about scaling time via multiple collections:

http://www.slideshare.net/sematext/solr-for-indexing-and-searching-logs

but I found more discussions about the limited number of collections Solr can support (not more than 1000 mainly due to znode 1 MB limit from zookeeper?). So, I feel it might be better to scale time via multiple shards or cores (Solr has lotsOfCores feature).

Will appreciate very much if more architectural advice could be given for this use case.

Regards.

Shushuai 

On Saturday, March 22, 2014 11:10 PM, Erick Erickson <er...@gmail.com> wrote:

Well, the "commonsense limits" Jack is referring to in that post are
more (IMO) scales you should count on having to do some _serious_
prototyping/configuring/etc. As you scale out, you'll run into edge
cases that aren't the common variety, aren't reliably tested every
night, etc. I mean how would you set up a test bed that had 1,000
nodes? Sure, it can be done, but nobody's volunteered yet to provide
the Apache Solr project that much hardware. I suspect that it would
make Uwe's week if someone did though.

In the "practical limit" vein, one example: You'll run up against "the
laggard problem". Let's assume that you successfully put up 2,000
nodes, for simplicity's sake, no replicas, just leaders and they all
stay up all the time. To successfully do a search, you need to send
out a request to all 2,000 nodes. The chance that one of them is slow
for _any_ reason (GC, high CPU load, it's just tired) increases the
more nodes you have. And since you have to wait until the slowest node
responds, your query rate will suffer correspondingly.

I've seen 4 node clusters handle 5,000 docs/sec update rate FWIW. YMMV
of course.

However, you say "...dedicated indexing servers...". There's no such
thing in SolrCloud. Every document gets sent to every member of the
slice it belongs to. How else could NRT be supported? When I saw that
comment I wondered how well you understand SolrCloud. I flat guarantee
you'll understand SolrCloud really, really well if yo try to scale as
you indicate :). There'll be a whole bunch of "learning experiences"
along the way, some will be painful. I guarantee that too.

Responding to your points

1) Yes, no, and maybe. For relatively small docs on relatively modern
hardware, it's a good place to start. Then you have to push it until
it falls over to determine your _real_ rates. See:
http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

2) Nobody knows. There's no theoretical reason why SolrCloud
shouldn't; no a-priori hard limits. I _strongly_ suspect you'll be on
the bleeding edge of size, though. Expect some things to be "learning
experiences".

3) No, it doesn't mean that at all. 64 is an arbitrary number that
means, IMO, "here there be dragons". As you start to scale out beyond
this you'll run into pesky issues I expect. Your network won't be as
reliable as you think. You'll find one of your VMs (which I expect
you'll be running on) has some glitches. Someone loaded a very CPU
intensive program on three of your machines and your Solrs on those
machines is being starved. Etc.

4) I've personally seen 1,000 node clusters. You ought to see the very
cool. SolrCloud admin graph I recently saw... But I expect you'll
actually be in for some kind of divide-and-conquer strategy whereby
you have a bunch of clusters that are significantly smaller. You
could, for instance, determine that the use-case you support is
searching across small ranges, say a week at a time and have 52
clusters of 128 machines or so. You could have 365 clusters of < 20
machines. It all depends on how the index will be used.

5) Not at all. See above, I've seen 5K/sec on 4 nodes, also supporting
simultaneous searching.

6) N/A

I really can't give you advice. You haven't, for instance, said
anything about searches. What kind of SLA are you aiming for? What
kind of queries? Faceting? Grouping? I can change the memory footprint
of Solr by firing off really ugly queries. In essence, you absolutely
must prototype, see the link above. And do a lot of homework defining
how you will search the corpus. Otherwise you're guessing. But at this
kind of scale, expect to do something other than throw all the docs at
a 64 node cluster and expect it to just work. It'll be a lot of work.

On Sat, Mar 22, 2014 at 6:48 PM, shushuai zhu <ss...@yahoo.com> wrote:
> Jack, thanks for your reply.
>
> Sorry for the confusion about 4 nodes. What I meant was to use 4 nodes to do some POC, mainly focusing on handling the high incoming rate in a few days instead of storing data over one year.
>
> You estimated the required nodes (6,308) and storage (322TB) based on the incoming rate and doc size. I have a few questions regarding to them:
>
> 1) Is "100 million docs/node" some general capacity guideline for a Solr node?
> 2) Assuming we can provide 6,308 nodes, can Solr Cloud really scale to that level? I found you indicated some "common sense limits" of Solr Cluster size of 64 nodes in the following mail thread http://find.searchhub.org/document/d823643e65fe2015#84f0c89df2426990
> 3) If 64 nodes are something we know Solr Cloud can scale up to, then does it mean I can only be sure that 1% of the mentioned workload can be handle by Solr Cloud? (64 is about 1% of 6,308 nodes)
> 4) The above mentioned "Solr Limitations" mail thread did mention some cluster with 512 nodes but not really verified whether it worked or not; assuming it worked, it just means we may be able to handle a little less than 10% of the desired workload.
> 5) Given above simple deduction, it seems 2K docs/sec (10% of the mentioned incoming rate) is the practical limitation of Solr Cloud we can guess for our use case?
> 6) If the incoming rate is controlled to be around 1k or 2k docs/sec and we want to use Solr Cluster with 64 nodes (or more if it still works), what kind of collection/shard/core structure should be?
>
> I am more looking for architectural advice regarding to Solr Cloud structure to handle high incoming rate of relatively small docs.
>
> Regards.
>
> Shushuai
>
>
>
> On Saturday, March 22, 2014 2:17 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>
> 20K docs/sec = 20,000 * 60 * 60 * 24 = 1,728,000,000 = 1.7 billion docs/day
> * 365 = 630,720,000,000 = 631 billion docs/yr
>
> At 100 million docs/node = 6,308 nodes!
>
> And you think you can do it with 4 nodes?
>
> Oh, and that's before replication!
>
> 0.5K/doc * 631 billion docs = 322 TB.
>
> -- Jack Krupansky
>
>
> -----Original Message-----
> From: shushuai zhu
> Sent: Saturday, March 22, 2014 11:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Best approach to handle large volume of documents with
> constantly high incoming rate?
>
> Any thoughts? Can Solr Cloud support such use case with acceptable
> performance?
>
>
>
> On Thursday, March 20, 2014 7:51 PM, shushuai zhu <ss...@yahoo.com> wrote:
>
> Hi,
>
> I am looking for some advice to handle large volume of documents with a very
> high incoming rate. The size of each document is about 0.5 KB and the
> incoming rate could be more than 20K per second and we want to store about
> one year's documents in Solr for near real=time searching. The goal is to
> achieve acceptable indexing and querying performance.
>
> We will use techniques like soft commit, dedicated indexing servers, etc. My
> main question is about how to structure the collection/shard/core to achieve
> the goals. Since the incoming rate is very high, we do not want the incoming
> documents to affect the existing older indexes. Some thoughts are to create
> a latest index to hold the incoming documents (say latest half hour's data,
> about 36M docs) so queries on older data could be faster since the old
> indexes are not affected. There seem three ways to grow the time dimension
> by adding/splitting/creating a new object listed below every half hour:
>
> collection
> shard
> core
>
> Which is the best way to grow the time dimension? Any limitation in that
> direction? Or there is some better approach?
>
> As an example, I am thinking about having 4 nodes with the following
> configuration to setup a Solr Cloud:
>
> Memory: 128 GB
> Storage: 4 TB
>
> How to set the collection/shard/core to deal with the use case?
>
> Thanks in advance.
>
> Shushuai

Re: Best approach to handle large volume of documents with constantly high incoming rate?

Posted by Erick Erickson <er...@gmail.com>.

Well, the "commonsense limits" Jack is referring to in that post are
more (IMO) scales you should count on having to do some _serious_
prototyping/configuring/etc. As you scale out, you'll run into edge
cases that aren't the common variety, aren't reliably tested every
night, etc. I mean how would you set up a test bed that had 1,000
nodes? Sure, it can be done, but nobody's volunteered yet to provide
the Apache Solr project that much hardware. I suspect that it would
make Uwe's week if someone did though.

In the "practical limit" vein, one example: You'll run up against "the
laggard problem". Let's assume that you successfully put up 2,000
nodes, for simplicity's sake, no replicas, just leaders and they all
stay up all the time. To successfully do a search, you need to send
out a request to all 2,000 nodes. The chance that one of them is slow
for _any_ reason (GC, high CPU load, it's just tired) increases the
more nodes you have. And since you have to wait until the slowest node
responds, your query rate will suffer correspondingly.

I've seen 4 node clusters handle 5,000 docs/sec update rate FWIW. YMMV
of course.

However, you say "...dedicated indexing servers...". There's no such
thing in SolrCloud. Every document gets sent to every member of the
slice it belongs to. How else could NRT be supported? When I saw that
comment I wondered how well you understand SolrCloud. I flat guarantee
you'll understand SolrCloud really, really well if yo try to scale as
you indicate :). There'll be a whole bunch of "learning experiences"
along the way, some will be painful. I guarantee that too.

Responding to your points

1) Yes, no, and maybe. For relatively small docs on relatively modern
hardware, it's a good place to start. Then you have to push it until
it falls over to determine your _real_ rates. See:
http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

2) Nobody knows. There's no theoretical reason why SolrCloud
shouldn't; no a-priori hard limits. I _strongly_ suspect you'll be on
the bleeding edge of size, though. Expect some things to be "learning
experiences".

3) No, it doesn't mean that at all. 64 is an arbitrary number that
means, IMO, "here there be dragons". As you start to scale out beyond
this you'll run into pesky issues I expect. Your network won't be as
reliable as you think. You'll find one of your VMs (which I expect
you'll be running on) has some glitches. Someone loaded a very CPU
intensive program on three of your machines and your Solrs on those
machines is being starved. Etc.

4) I've personally seen 1,000 node clusters. You ought to see the very
cool. SolrCloud admin graph I recently saw... But I expect you'll
actually be in for some kind of divide-and-conquer strategy whereby
you have a bunch of clusters that are significantly smaller. You
could, for instance, determine that the use-case you support is
searching across small ranges, say a week at a time and have 52
clusters of 128 machines or so. You could have 365 clusters of < 20
machines. It all depends on how the index will be used.

5) Not at all. See above, I've seen 5K/sec on 4 nodes, also supporting
simultaneous searching.

6) N/A

I really can't give you advice. You haven't, for instance, said
anything about searches. What kind of SLA are you aiming for? What
kind of queries? Faceting? Grouping? I can change the memory footprint
of Solr by firing off really ugly queries. In essence, you absolutely
must prototype, see the link above. And do a lot of homework defining
how you will search the corpus. Otherwise you're guessing. But at this
kind of scale, expect to do something other than throw all the docs at
a 64 node cluster and expect it to just work. It'll be a lot of work.

On Sat, Mar 22, 2014 at 6:48 PM, shushuai zhu <ss...@yahoo.com> wrote:
> Jack, thanks for your reply.
>
> Sorry for the confusion about 4 nodes. What I meant was to use 4 nodes to do some POC, mainly focusing on handling the high incoming rate in a few days instead of storing data over one year.
>
> You estimated the required nodes (6,308) and storage (322TB) based on the incoming rate and doc size. I have a few questions regarding to them:
>
> 1) Is "100 million docs/node" some general capacity guideline for a Solr node?
> 2) Assuming we can provide 6,308 nodes, can Solr Cloud really scale to that level? I found you indicated some "common sense limits" of Solr Cluster size of 64 nodes in the following mail thread http://find.searchhub.org/document/d823643e65fe2015#84f0c89df2426990
> 3) If 64 nodes are something we know Solr Cloud can scale up to, then does it mean I can only be sure that 1% of the mentioned workload can be handle by Solr Cloud? (64 is about 1% of 6,308 nodes)
> 4) The above mentioned "Solr Limitations" mail thread did mention some cluster with 512 nodes but not really verified whether it worked or not; assuming it worked, it just means we may be able to handle a little less than 10% of the desired workload.
> 5) Given above simple deduction, it seems 2K docs/sec (10% of the mentioned incoming rate) is the practical limitation of Solr Cloud we can guess for our use case?
> 6) If the incoming rate is controlled to be around 1k or 2k docs/sec and we want to use Solr Cluster with 64 nodes (or more if it still works), what kind of collection/shard/core structure should be?
>
> I am more looking for architectural advice regarding to Solr Cloud structure to handle high incoming rate of relatively small docs.
>
> Regards.
>
> Shushuai
>
>
>
> On Saturday, March 22, 2014 2:17 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>
> 20K docs/sec = 20,000 * 60 * 60 * 24 = 1,728,000,000 = 1.7 billion docs/day
> * 365 = 630,720,000,000 = 631 billion docs/yr
>
> At 100 million docs/node = 6,308 nodes!
>
> And you think you can do it with 4 nodes?
>
> Oh, and that's before replication!
>
> 0.5K/doc * 631 billion docs = 322 TB.
>
> -- Jack Krupansky
>
>
> -----Original Message-----
> From: shushuai zhu
> Sent: Saturday, March 22, 2014 11:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Best approach to handle large volume of documents with
> constantly high incoming rate?
>
> Any thoughts? Can Solr Cloud support such use case with acceptable
> performance?
>
>
>
> On Thursday, March 20, 2014 7:51 PM, shushuai zhu <ss...@yahoo.com> wrote:
>
> Hi,
>
> I am looking for some advice to handle large volume of documents with a very
> high incoming rate. The size of each document is about 0.5 KB and the
> incoming rate could be more than 20K per second and we want to store about
> one year's documents in Solr for near real=time searching. The goal is to
> achieve acceptable indexing and querying performance.
>
> We will use techniques like soft commit, dedicated indexing servers, etc. My
> main question is about how to structure the collection/shard/core to achieve
> the goals. Since the incoming rate is very high, we do not want the incoming
> documents to affect the existing older indexes. Some thoughts are to create
> a latest index to hold the incoming documents (say latest half hour's data,
> about 36M docs) so queries on older data could be faster since the old
> indexes are not affected. There seem three ways to grow the time dimension
> by adding/splitting/creating a new object listed below every half hour:
>
> collection
> shard
> core
>
> Which is the best way to grow the time dimension? Any limitation in that
> direction? Or there is some better approach?
>
> As an example, I am thinking about having 4 nodes with the following
> configuration to setup a Solr Cloud:
>
> Memory: 128 GB
> Storage: 4 TB
>
> How to set the collection/shard/core to deal with the use case?
>
> Thanks in advance.
>
> Shushuai

Re: Best approach to handle large volume of documents with constantly high incoming rate?

Posted by shushuai zhu <ss...@yahoo.com>.

Jack, thanks for your reply.

Sorry for the confusion about 4 nodes. What I meant was to use 4 nodes to do some POC, mainly focusing on handling the high incoming rate in a few days instead of storing data over one year. 

You estimated the required nodes (6,308) and storage (322TB) based on the incoming rate and doc size. I have a few questions regarding to them:

1) Is "100 million docs/node" some general capacity guideline for a Solr node?
2) Assuming we can provide 6,308 nodes, can Solr Cloud really scale to that level? I found you indicated some "common sense limits" of Solr Cluster size of 64 nodes in the following mail thread http://find.searchhub.org/document/d823643e65fe2015#84f0c89df2426990
3) If 64 nodes are something we know Solr Cloud can scale up to, then does it mean I can only be sure that 1% of the mentioned workload can be handle by Solr Cloud? (64 is about 1% of 6,308 nodes)
4) The above mentioned "Solr Limitations" mail thread did mention some cluster with 512 nodes but not really verified whether it worked or not; assuming it worked, it just means we may be able to handle a little less than 10% of the desired workload.
5) Given above simple deduction, it seems 2K docs/sec (10% of the mentioned incoming rate) is the practical limitation of Solr Cloud we can guess for our use case?
6) If the incoming rate is controlled to be around 1k or 2k docs/sec and we want to use Solr Cluster with 64 nodes (or more if it still works), what kind of collection/shard/core structure should be?

I am more looking for architectural advice regarding to Solr Cloud structure to handle high incoming rate of relatively small docs. 

Regards.

Shushuai 



On Saturday, March 22, 2014 2:17 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
  
20K docs/sec = 20,000 * 60 * 60 * 24 = 1,728,000,000 = 1.7 billion docs/day 
* 365 = 630,720,000,000 = 631 billion docs/yr

At 100 million docs/node = 6,308 nodes!

And you think you can do it with 4 nodes?

Oh, and that's before replication!

0.5K/doc * 631 billion docs = 322 TB.

-- Jack Krupansky


-----Original Message----- 
From: shushuai zhu
Sent: Saturday, March 22, 2014 11:32 AM
To: solr-user@lucene.apache.org
Subject: Re: Best approach to handle large volume of documents with 
constantly high incoming rate?

Any thoughts? Can Solr Cloud support such use case with acceptable 
performance?



On Thursday, March 20, 2014 7:51 PM, shushuai zhu <ss...@yahoo.com> wrote:

Hi,

I am looking for some advice to handle large volume of documents with a very 
high incoming rate. The size of each document is about 0.5 KB and the 
incoming rate could be more than 20K per second and we want to store about 
one year's documents in Solr for near real=time searching. The goal is to 
achieve acceptable indexing and querying performance.

We will use techniques like soft commit, dedicated indexing servers, etc. My 
main question is about how to structure the collection/shard/core to achieve 
the goals. Since the incoming rate is very high, we do not want the incoming 
documents to affect the existing older indexes. Some thoughts are to create 
a latest index to hold the incoming documents (say latest half hour's data, 
about 36M docs) so queries on older data could be faster since the old 
indexes are not affected. There seem three ways to grow the time dimension 
by adding/splitting/creating a new object listed below every half hour:

collection
shard
core

Which is the best way to grow the time dimension? Any limitation in that 
direction? Or there is some better approach?

As an example, I am thinking about having 4 nodes with the following 
configuration to setup a Solr Cloud:

Memory: 128 GB
Storage: 4 TB

How to set the collection/shard/core to deal with the use case?

Thanks in advance.

Shushuai

Re: Best approach to handle large volume of documents with constantly high incoming rate?

Posted by Jack Krupansky <ja...@basetechnology.com>.

20K docs/sec = 20,000 * 60 * 60 * 24 = 1,728,000,000 = 1.7 billion docs/day 
* 365 = 630,720,000,000 = 631 billion docs/yr

At 100 million docs/node = 6,308 nodes!

And you think you can do it with 4 nodes?

Oh, and that's before replication!

0.5K/doc * 631 billion docs = 322 TB.

-- Jack Krupansky

-----Original Message----- 
From: shushuai zhu
Sent: Saturday, March 22, 2014 11:32 AM
To: solr-user@lucene.apache.org
Subject: Re: Best approach to handle large volume of documents with 
constantly high incoming rate?

Any thoughts? Can Solr Cloud support such use case with acceptable 
performance?



On Thursday, March 20, 2014 7:51 PM, shushuai zhu <ss...@yahoo.com> wrote:

Hi,

I am looking for some advice to handle large volume of documents with a very 
high incoming rate. The size of each document is about 0.5 KB and the 
incoming rate could be more than 20K per second and we want to store about 
one year's documents in Solr for near real=time searching. The goal is to 
achieve acceptable indexing and querying performance.

We will use techniques like soft commit, dedicated indexing servers, etc. My 
main question is about how to structure the collection/shard/core to achieve 
the goals. Since the incoming rate is very high, we do not want the incoming 
documents to affect the existing older indexes. Some thoughts are to create 
a latest index to hold the incoming documents (say latest half hour's data, 
about 36M docs) so queries on older data could be faster since the old 
indexes are not affected. There seem three ways to grow the time dimension 
by adding/splitting/creating a new object listed below every half hour:

collection
shard
core

Which is the best way to grow the time dimension? Any limitation in that 
direction? Or there is some better approach?

As an example, I am thinking about having 4 nodes with the following 
configuration to setup a Solr Cloud:

Memory: 128 GB
Storage: 4 TB

How to set the collection/shard/core to deal with the use case?

Thanks in advance.

Shushuai

Re: Best approach to handle large volume of documents with constantly high incoming rate?

Posted by shushuai zhu <ss...@yahoo.com>.

Any thoughts? Can Solr Cloud support such use case with acceptable performance?



On Thursday, March 20, 2014 7:51 PM, shushuai zhu <ss...@yahoo.com> wrote:
  
Hi, 

I am looking for some advice to handle large volume of documents with a very high incoming rate. The size of each document is about 0.5 KB and the incoming rate could be more than 20K per second and we want to store about one year's documents in Solr for near real=time searching. The goal is to achieve acceptable indexing and querying performance.

We will use techniques like soft commit, dedicated indexing servers, etc. My main question is about how to structure the collection/shard/core to achieve the goals. Since the incoming rate is very high, we do not want the incoming documents to affect the existing older indexes. Some thoughts are to create a latest index to hold the incoming documents (say latest half hour's data, about 36M docs) so queries on older data could be faster since the old indexes are not affected. There seem three ways to grow the time dimension by adding/splitting/creating a new object listed below every half hour:

collection
shard
core

Which is the best way to grow the time dimension? Any limitation in that direction? Or there is some better approach?

As an example, I am thinking about having 4 nodes with the following configuration to setup a Solr Cloud:

Memory: 128 GB
Storage: 4 TB

How to set the collection/shard/core to deal with the use case?

Thanks in advance.

Shushuai