You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mugeesh Husain <mu...@gmail.com> on 2015/12/08 13:18:56 UTC

capacity of storage a single core

Capacity regarding 2 simple question:

1.) How many document we could store in single core(capacity of core
storage)
2.) How many core we could create in a single server(single node cluster)


Thanks,
Mugeesh



--
View this message in context: http://lucene.472066.n3.nabble.com/capacity-of-storage-a-single-core-tp4244197.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: capacity of storage a single core

Posted by Mugeesh Husain <mu...@gmail.com>.
@Upayavira,

could you provice the any link, that issue has been resolved.

>>So long as your joined-to collection is replicated across every box
wher i can find this related link or example.



--
View this message in context: http://lucene.472066.n3.nabble.com/capacity-of-storage-a-single-core-tp4244197p4244380.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: capacity of storage a single core

Posted by Upayavira <uv...@odoko.co.uk>.
I understood that on later Solrs, those join issues have been
(partially) resolved. So long as your joined-to collection is replicated
across every box, you should be good. 

Upayavira

On Tue, Dec 8, 2015, at 04:17 PM, Mugeesh Husain wrote:
> Thanks Toke Eskildsen,
> 
> Actually i need to join on my core, that why i am going to solrlcoud(join
> does not support in solrlcoud)
> 
> Is there any alternate way to doing it ?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/capacity-of-storage-a-single-core-tp4244197p4244248.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: capacity of storage a single core

Posted by Mugeesh Husain <mu...@gmail.com>.
Thanks Toke Eskildsen,

Actually i need to join on my core, that why i am going to solrlcoud(join
does not support in solrlcoud)

Is there any alternate way to doing it ?



--
View this message in context: http://lucene.472066.n3.nabble.com/capacity-of-storage-a-single-core-tp4244197p4244248.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: capacity of storage a single core

Posted by Susheel Kumar <su...@gmail.com>.
Thanks, Alessandro.  We can attempt to come up with such a blog and I can
volunteer for bullets/headings to start with. I also agree that we can
can't come up with some definitive answer as mentioned in other places but
can give an attempt to at least consolidate all these knowledge into one
place.   As of now i see few sources which can be referred to come up with
some consolidated knowledge

https://wiki.apache.org/solr/SolrPerformanceProblems
http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
Uwe's Article on MMAP
Erick's and others valuable posts



On Fri, Dec 11, 2015 at 6:20 AM, Alessandro Benedetti <abenedetti@apache.org
> wrote:

> Susheel, this is a very good idea.
> I am a little bit busy this period, so I doubt I can contribute with a blog
> post, but it would be great if anyone has time.
> If not I will add it to my backlog and sooner or later I will do it :)
>
> Furthermore latest observations from Erick are pure gold, and I agree
> completely.
> I have only a question related this :
>
> 1>  "the entire index". What's this? The size on disk?
> > 90% of that size on disk may be stored data which
> > uses very little memory, which is limited by the
> > documentCache in Solr. OTOH, only 10% of the on-disk
> > size might be stored data.
>
>
> If I am correct the documentCache in Solr is a map that relates the Lucene
> document ordinal to the stored fields for that document.
> We have control on that and we can assign our preferred values.
> First question :
> 1) Is this using the JVM memory to store this cache ? I assume yes.
> So we need to take care of our JVM memory if we want to store in memory big
> chunks of the stored index.
>
> 2) MMap index segments are actually only the segments used for searching ?
> Is not the Lucene directory memory mapping the stored segments as well ?
> This was my understanding but maybe I am wrong.
> In the case we first memory map the stored segments and then potentially
> store them on the Solr cache as well, right ?
>
> Cheers
>
>
> On 10 December 2015 at 19:43, Susheel Kumar <su...@gmail.com> wrote:
>
> > Like the details here Eric how you broke memory into different parts. I
> > feel if we can combine lot of this knowledge from your various posts,
> above
> > sizing blog, Solr wiki pages, Uwe article on MMap/heap,  consolidate and
> > present in at single place which may help lot of new folks/folks
> struggling
> > with memory/heap/sizing issues questions etc.
> >
> > Thanks,
> > Susheel
> >
> > On Wed, Dec 9, 2015 at 12:40 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> > > I object to the question. And the advice. And... ;).
> > >
> > > Practically, IMO guidance that "the entire index should
> > > fit into memory" is misleading, especially for newbies.
> > > Let's break it down:
> > >
> > > 1>  "the entire index". What's this? The size on disk?
> > > 90% of that size on disk may be stored data which
> > > uses very little memory, which is limited by the
> > > documentCache in Solr. OTOH, only 10% of the on-disk
> > > size might be stored data.
> > >
> > > 2> "fit into memory". What memory? Certainly not
> > > the JVM as much of the Lucene-level data is in
> > > MMapDirectory which uses the OS memory. So
> > > this _probably_ means JVM + OS memory, and OS
> > > memory is shared amongst other processes as well.
> > >
> > > 3> Solr and Lucene build in-memory structures that
> > > aren't reflected in the index size on disk. I've seen
> > > filterCaches for instance that have been (mis) configured
> > > that could grow to 100s of G. This is totally not reflected in
> > > the "index size".
> > >
> > > 4> Try faceting on a text field with lots of unique
> > > values. Bad Practice, but you'll see just how quickly
> > > the _query_ can change the memory requirements.
> > >
> > > 5> Sure, with modern hardware we can create huge JVM
> > > heaps... that hit GC pauses that'll drive performance
> > > down, sometimes radically.
> > >
> > > I've seen 350M docs, 200-300 fields (aggregate) fit into 12G
> > > of JVM. I've seen 25M docs (really big ones) strain 48G
> > > JVM heaps.
> > >
> > > Jack's approach is what I use; pick a number and test with it.
> > > Here's an approach:
> > >
> > >
> >
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Dec 9, 2015 at 8:54 AM, Susheel Kumar <su...@gmail.com>
> > > wrote:
> > > > Thanks, Jack for quick reply.  With Replica / Shard I mean to say on
> a
> > > > given machine there may be two/more replicas and all of them may not
> > fit
> > > > into memory.
> > > >
> > > > On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky <
> > > jack.krupansky@gmail.com>
> > > > wrote:
> > > >
> > > >> Yes, there are nuances to any general rule. It's just a starting
> > point,
> > > and
> > > >> your own testing will confirm specific details for your specific app
> > and
> > > >> data. For example, maybe you don't query all fields commonly, so
> each
> > > >> field-specific index may not require memory or not require it so
> > > commonly.
> > > >> And, yes, each app has its own latency requirements. The purpose of
> a
> > > >> general rule is to generally avoid unhappiness, but if you have an
> > > appetite
> > > >> and tolerance for unhappiness, then go for it.
> > > >>
> > > >> Replica vs. shard? They're basically the same - a replica is a copy
> > of a
> > > >> shard.
> > > >>
> > > >> -- Jack Krupansky
> > > >>
> > > >> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar <
> susheel2777@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi Jack,
> > > >> >
> > > >> > Just to add, OS Disk Cache will still make query performant even
> > > though
> > > >> > entire index can't be loaded into memory. How much more latency
> > > compare
> > > >> to
> > > >> > if index gets completely loaded into memory may vary depending to
> > > index
> > > >> > size etc.  I am trying to clarify this here because lot of folks
> > takes
> > > >> this
> > > >> > as a hard guideline (to fit index into memory)  and try to come up
> > > with
> > > >> > hardware/machines (100's of machines) just for the sake of fitting
> > > index
> > > >> > into memory even though there may not be much load/qps on the
> > cluster.
> > > >> For
> > > >> > e.g. this may vary and needs to be tested on case by case basis
> but
> > a
> > > >> > machine with 64GB  should still provide good performance (not the
> > > best)
> > > >> for
> > > >> > 100G index on that machine.  Do you agree / any thoughts?
> > > >> >
> > > >> > Same i believe is the case with Replicas,   as on a single machine
> > you
> > > >> have
> > > >> > replicas which itself may not fit into memory as well along with
> > shard
> > > >> > index.
> > > >> >
> > > >> > Thanks,
> > > >> > Susheel
> > > >> >
> > > >> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <
> > > >> jack.krupansky@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Generally, you will be resource limited (memory, cpu) rather
> than
> > by
> > > >> some
> > > >> > > arbitrary numeric limit (like 2 billion.)
> > > >> > >
> > > >> > > My personal general recommendation is for a practical limit is
> 100
> > > >> > million
> > > >> > > documents on a machine/node. Depending on your data model and
> > actual
> > > >> data
> > > >> > > that number could be higher or lower. A proof of concept test
> will
> > > >> allow
> > > >> > > you to determine the actual number for your particular use case,
> > > but a
> > > >> > > presumed limit of 100 million is not a bad start.
> > > >> > >
> > > >> > > You should have enough memory to hold the entire index in system
> > > >> memory.
> > > >> > If
> > > >> > > not, your query latency will suffer due to I/O required to
> > > constantly
> > > >> > > re-read portions of the index into memory.
> > > >> > >
> > > >> > > The practical limit for documents is not per core or number of
> > cores
> > > >> but
> > > >> > > across all cores on the node since it is mostly a memory limit
> and
> > > the
> > > >> > > available CPU resources for accessing that memory.
> > > >> > >
> > > >> > > -- Jack Krupansky
> > > >> > >
> > > >> > > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <
> > > te@statsbiblioteket.dk
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> > > >> > > > > Capacity regarding 2 simple question:
> > > >> > > > >
> > > >> > > > > 1.) How many document we could store in single core(capacity
> > of
> > > >> core
> > > >> > > > > storage)
> > > >> > > >
> > > >> > > > There is hard limit of 2 billion documents.
> > > >> > > >
> > > >> > > > > 2.) How many core we could create in a single server(single
> > node
> > > >> > > cluster)
> > > >> > > >
> > > >> > > > There is no hard limit. Except for 2 billion cores, I guess.
> But
> > > at
> > > >> > this
> > > >> > > > point in time that is a ridiculously high number of cores.
> > > >> > > >
> > > >> > > > It is hard to give a suggestion for real-world limits as
> indexes
> > > >> vary a
> > > >> > > > lot and the rules of thumb tend to be quite poor when scaling
> > up.
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > > >> > > >
> > > >> > > > People generally seems to run into problems with more than
> 1000
> > > >> > > > not-too-large cores. If the cores are large, there will
> probably
> > > be
> > > >> > > > performance problems long before that.
> > > >> > > >
> > > >> > > > You will have to build a prototype and test.
> > > >> > > >
> > > >> > > > - Toke Eskildsen, State and University Library, Denmark
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: capacity of storage a single core

Posted by Alessandro Benedetti <ab...@apache.org>.
Susheel, this is a very good idea.
I am a little bit busy this period, so I doubt I can contribute with a blog
post, but it would be great if anyone has time.
If not I will add it to my backlog and sooner or later I will do it :)

Furthermore latest observations from Erick are pure gold, and I agree
completely.
I have only a question related this :

1>  "the entire index". What's this? The size on disk?
> 90% of that size on disk may be stored data which
> uses very little memory, which is limited by the
> documentCache in Solr. OTOH, only 10% of the on-disk
> size might be stored data.


If I am correct the documentCache in Solr is a map that relates the Lucene
document ordinal to the stored fields for that document.
We have control on that and we can assign our preferred values.
First question :
1) Is this using the JVM memory to store this cache ? I assume yes.
So we need to take care of our JVM memory if we want to store in memory big
chunks of the stored index.

2) MMap index segments are actually only the segments used for searching ?
Is not the Lucene directory memory mapping the stored segments as well ?
This was my understanding but maybe I am wrong.
In the case we first memory map the stored segments and then potentially
store them on the Solr cache as well, right ?

Cheers


On 10 December 2015 at 19:43, Susheel Kumar <su...@gmail.com> wrote:

> Like the details here Eric how you broke memory into different parts. I
> feel if we can combine lot of this knowledge from your various posts, above
> sizing blog, Solr wiki pages, Uwe article on MMap/heap,  consolidate and
> present in at single place which may help lot of new folks/folks struggling
> with memory/heap/sizing issues questions etc.
>
> Thanks,
> Susheel
>
> On Wed, Dec 9, 2015 at 12:40 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > I object to the question. And the advice. And... ;).
> >
> > Practically, IMO guidance that "the entire index should
> > fit into memory" is misleading, especially for newbies.
> > Let's break it down:
> >
> > 1>  "the entire index". What's this? The size on disk?
> > 90% of that size on disk may be stored data which
> > uses very little memory, which is limited by the
> > documentCache in Solr. OTOH, only 10% of the on-disk
> > size might be stored data.
> >
> > 2> "fit into memory". What memory? Certainly not
> > the JVM as much of the Lucene-level data is in
> > MMapDirectory which uses the OS memory. So
> > this _probably_ means JVM + OS memory, and OS
> > memory is shared amongst other processes as well.
> >
> > 3> Solr and Lucene build in-memory structures that
> > aren't reflected in the index size on disk. I've seen
> > filterCaches for instance that have been (mis) configured
> > that could grow to 100s of G. This is totally not reflected in
> > the "index size".
> >
> > 4> Try faceting on a text field with lots of unique
> > values. Bad Practice, but you'll see just how quickly
> > the _query_ can change the memory requirements.
> >
> > 5> Sure, with modern hardware we can create huge JVM
> > heaps... that hit GC pauses that'll drive performance
> > down, sometimes radically.
> >
> > I've seen 350M docs, 200-300 fields (aggregate) fit into 12G
> > of JVM. I've seen 25M docs (really big ones) strain 48G
> > JVM heaps.
> >
> > Jack's approach is what I use; pick a number and test with it.
> > Here's an approach:
> >
> >
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >
> > Best,
> > Erick
> >
> > On Wed, Dec 9, 2015 at 8:54 AM, Susheel Kumar <su...@gmail.com>
> > wrote:
> > > Thanks, Jack for quick reply.  With Replica / Shard I mean to say on a
> > > given machine there may be two/more replicas and all of them may not
> fit
> > > into memory.
> > >
> > > On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky <
> > jack.krupansky@gmail.com>
> > > wrote:
> > >
> > >> Yes, there are nuances to any general rule. It's just a starting
> point,
> > and
> > >> your own testing will confirm specific details for your specific app
> and
> > >> data. For example, maybe you don't query all fields commonly, so each
> > >> field-specific index may not require memory or not require it so
> > commonly.
> > >> And, yes, each app has its own latency requirements. The purpose of a
> > >> general rule is to generally avoid unhappiness, but if you have an
> > appetite
> > >> and tolerance for unhappiness, then go for it.
> > >>
> > >> Replica vs. shard? They're basically the same - a replica is a copy
> of a
> > >> shard.
> > >>
> > >> -- Jack Krupansky
> > >>
> > >> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar <susheel2777@gmail.com
> >
> > >> wrote:
> > >>
> > >> > Hi Jack,
> > >> >
> > >> > Just to add, OS Disk Cache will still make query performant even
> > though
> > >> > entire index can't be loaded into memory. How much more latency
> > compare
> > >> to
> > >> > if index gets completely loaded into memory may vary depending to
> > index
> > >> > size etc.  I am trying to clarify this here because lot of folks
> takes
> > >> this
> > >> > as a hard guideline (to fit index into memory)  and try to come up
> > with
> > >> > hardware/machines (100's of machines) just for the sake of fitting
> > index
> > >> > into memory even though there may not be much load/qps on the
> cluster.
> > >> For
> > >> > e.g. this may vary and needs to be tested on case by case basis but
> a
> > >> > machine with 64GB  should still provide good performance (not the
> > best)
> > >> for
> > >> > 100G index on that machine.  Do you agree / any thoughts?
> > >> >
> > >> > Same i believe is the case with Replicas,   as on a single machine
> you
> > >> have
> > >> > replicas which itself may not fit into memory as well along with
> shard
> > >> > index.
> > >> >
> > >> > Thanks,
> > >> > Susheel
> > >> >
> > >> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <
> > >> jack.krupansky@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Generally, you will be resource limited (memory, cpu) rather than
> by
> > >> some
> > >> > > arbitrary numeric limit (like 2 billion.)
> > >> > >
> > >> > > My personal general recommendation is for a practical limit is 100
> > >> > million
> > >> > > documents on a machine/node. Depending on your data model and
> actual
> > >> data
> > >> > > that number could be higher or lower. A proof of concept test will
> > >> allow
> > >> > > you to determine the actual number for your particular use case,
> > but a
> > >> > > presumed limit of 100 million is not a bad start.
> > >> > >
> > >> > > You should have enough memory to hold the entire index in system
> > >> memory.
> > >> > If
> > >> > > not, your query latency will suffer due to I/O required to
> > constantly
> > >> > > re-read portions of the index into memory.
> > >> > >
> > >> > > The practical limit for documents is not per core or number of
> cores
> > >> but
> > >> > > across all cores on the node since it is mostly a memory limit and
> > the
> > >> > > available CPU resources for accessing that memory.
> > >> > >
> > >> > > -- Jack Krupansky
> > >> > >
> > >> > > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <
> > te@statsbiblioteket.dk
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> > >> > > > > Capacity regarding 2 simple question:
> > >> > > > >
> > >> > > > > 1.) How many document we could store in single core(capacity
> of
> > >> core
> > >> > > > > storage)
> > >> > > >
> > >> > > > There is hard limit of 2 billion documents.
> > >> > > >
> > >> > > > > 2.) How many core we could create in a single server(single
> node
> > >> > > cluster)
> > >> > > >
> > >> > > > There is no hard limit. Except for 2 billion cores, I guess. But
> > at
> > >> > this
> > >> > > > point in time that is a ridiculously high number of cores.
> > >> > > >
> > >> > > > It is hard to give a suggestion for real-world limits as indexes
> > >> vary a
> > >> > > > lot and the rules of thumb tend to be quite poor when scaling
> up.
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > >> > > >
> > >> > > > People generally seems to run into problems with more than 1000
> > >> > > > not-too-large cores. If the cores are large, there will probably
> > be
> > >> > > > performance problems long before that.
> > >> > > >
> > >> > > > You will have to build a prototype and test.
> > >> > > >
> > >> > > > - Toke Eskildsen, State and University Library, Denmark
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: capacity of storage a single core

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Thu, 2015-12-10 at 14:43 -0500, Susheel Kumar wrote:
> Like the details here Eric how you broke memory into different parts. I
> feel if we can combine lot of this knowledge from your various posts, above
> sizing blog, Solr wiki pages, Uwe article on MMap/heap,  consolidate and
> present in at single place which may help lot of new folks/folks struggling
> with memory/heap/sizing issues questions etc.

To demonstrate part of the problem:

Say we have an index with documents representing employees, with three
defined fields: name, company and the dynamic *_custom. Each company
uses 3 dynamic fields with custom names as they see fit.

Let's say we keep track of 1K companies, each with 10K employees.

The full index is now

  total documents: 10M (1K*10K)
  name: 10M unique values (or less due to names not being unique)
  company: Exactly 1K unique values
  *_custom: 3K unique fields, each with 1K unique values

We do our math-math-thing and arrive at an approximate index size of 5GB
(just an extremely loose guess here). Heap is nothing to speak of for
basic search on this, so let's set that to 1GB. We estimate that a
machine with 8GB of physical RAM is more than fine for this - halving
that to 4GB would probably also work well.

Say we want to group on company. The "company" field is UnInverted, so
there is an array of 10M pointers to 10K values. That is about 50MB
overhead. No change needed to heap allocation.

Say we want to filter on company and cache the filters. Each filter
takes ~1MB, so that is 1000*1MB = 1GB of heap. Okay, so we bump the heap
from 1 to 2GB. The 4GB machine might be a bit small here, depending on
storage, but the 8GB one will work just fine.

Say each company wants to facet on their custom fields. There are 3K of
those fields. Each one requiring ~50MB (like the company grouping) for
UnInversion. That is 150GB of heap. Yes, 150GB.


What about DocValues? Well, if we just use standard String faceting, we
need a map from segment-ordinals to global-ordinals for each facet field
or in other words a map with 1K entries for each facet. Such a map can
be represented with < 20 bits/entry (finely packed), so that is ~3KB of
heap for each field or 9GB (3K*3KB) for the full range of custom fields.
Still way too much for our 8GB machine.

Say we change the custom fields to fixed fields named "custom1",
"custom2" & "custom3" and do some name-mapping in the front-end so it
just looks as if the companies chooses the names themselves.
Suddenly there are only 3 larger fields to facet on instead of 3K small
ones. That is 3*50MB of heap required, even without using DocValues.
And we're back to our 4GB machine.

But wait, the index is used quite a lot! 200 concurrent requests. Each
facet request requires a counter and for the three custom fields there
are 1M unique values (1000 for each company). Those counters takes up
4bytes*1M = 4MB each and for 200 concurrent requests that is 800MB +
overhead. Better bump the heap with 1GB extra.

Except that someone turned on threaded faceting, so we do that for the 3
custom fields at the same time, so we better bump with 2GB more. Whoops,
even the 8GB machine is too small.



Not sure I follow all of the above myself, but the morale should be
clear: Seemingly innocuous changes to requirements or setup can easily
result is huge changes to requirements. If I were to describe such
things enough for another person (without previous in-depth knowledge in
this field) to make educated guesses, it would be a massive amount of
text with a lot of hard to grasp parts. I have tried twice and scrapped
it both times as it quickly became apparent that it would be much too
unwieldy.

Trying to not be a wet blanket, this could also be because I have my
head too far down these things. Skipping some details and making some
clearly stated choices up front could work. There is no doubt that there
are a lot of people that asks for estimates and "we cannot say anything"
is quite a raw deal.


- Toke Eskildsen, State and University Library, Denmark



Re: capacity of storage a single core

Posted by Susheel Kumar <su...@gmail.com>.
Like the details here Eric how you broke memory into different parts. I
feel if we can combine lot of this knowledge from your various posts, above
sizing blog, Solr wiki pages, Uwe article on MMap/heap,  consolidate and
present in at single place which may help lot of new folks/folks struggling
with memory/heap/sizing issues questions etc.

Thanks,
Susheel

On Wed, Dec 9, 2015 at 12:40 PM, Erick Erickson <er...@gmail.com>
wrote:

> I object to the question. And the advice. And... ;).
>
> Practically, IMO guidance that "the entire index should
> fit into memory" is misleading, especially for newbies.
> Let's break it down:
>
> 1>  "the entire index". What's this? The size on disk?
> 90% of that size on disk may be stored data which
> uses very little memory, which is limited by the
> documentCache in Solr. OTOH, only 10% of the on-disk
> size might be stored data.
>
> 2> "fit into memory". What memory? Certainly not
> the JVM as much of the Lucene-level data is in
> MMapDirectory which uses the OS memory. So
> this _probably_ means JVM + OS memory, and OS
> memory is shared amongst other processes as well.
>
> 3> Solr and Lucene build in-memory structures that
> aren't reflected in the index size on disk. I've seen
> filterCaches for instance that have been (mis) configured
> that could grow to 100s of G. This is totally not reflected in
> the "index size".
>
> 4> Try faceting on a text field with lots of unique
> values. Bad Practice, but you'll see just how quickly
> the _query_ can change the memory requirements.
>
> 5> Sure, with modern hardware we can create huge JVM
> heaps... that hit GC pauses that'll drive performance
> down, sometimes radically.
>
> I've seen 350M docs, 200-300 fields (aggregate) fit into 12G
> of JVM. I've seen 25M docs (really big ones) strain 48G
> JVM heaps.
>
> Jack's approach is what I use; pick a number and test with it.
> Here's an approach:
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Best,
> Erick
>
> On Wed, Dec 9, 2015 at 8:54 AM, Susheel Kumar <su...@gmail.com>
> wrote:
> > Thanks, Jack for quick reply.  With Replica / Shard I mean to say on a
> > given machine there may be two/more replicas and all of them may not fit
> > into memory.
> >
> > On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky <
> jack.krupansky@gmail.com>
> > wrote:
> >
> >> Yes, there are nuances to any general rule. It's just a starting point,
> and
> >> your own testing will confirm specific details for your specific app and
> >> data. For example, maybe you don't query all fields commonly, so each
> >> field-specific index may not require memory or not require it so
> commonly.
> >> And, yes, each app has its own latency requirements. The purpose of a
> >> general rule is to generally avoid unhappiness, but if you have an
> appetite
> >> and tolerance for unhappiness, then go for it.
> >>
> >> Replica vs. shard? They're basically the same - a replica is a copy of a
> >> shard.
> >>
> >> -- Jack Krupansky
> >>
> >> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar <su...@gmail.com>
> >> wrote:
> >>
> >> > Hi Jack,
> >> >
> >> > Just to add, OS Disk Cache will still make query performant even
> though
> >> > entire index can't be loaded into memory. How much more latency
> compare
> >> to
> >> > if index gets completely loaded into memory may vary depending to
> index
> >> > size etc.  I am trying to clarify this here because lot of folks takes
> >> this
> >> > as a hard guideline (to fit index into memory)  and try to come up
> with
> >> > hardware/machines (100's of machines) just for the sake of fitting
> index
> >> > into memory even though there may not be much load/qps on the cluster.
> >> For
> >> > e.g. this may vary and needs to be tested on case by case basis but a
> >> > machine with 64GB  should still provide good performance (not the
> best)
> >> for
> >> > 100G index on that machine.  Do you agree / any thoughts?
> >> >
> >> > Same i believe is the case with Replicas,   as on a single machine you
> >> have
> >> > replicas which itself may not fit into memory as well along with shard
> >> > index.
> >> >
> >> > Thanks,
> >> > Susheel
> >> >
> >> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <
> >> jack.krupansky@gmail.com>
> >> > wrote:
> >> >
> >> > > Generally, you will be resource limited (memory, cpu) rather than by
> >> some
> >> > > arbitrary numeric limit (like 2 billion.)
> >> > >
> >> > > My personal general recommendation is for a practical limit is 100
> >> > million
> >> > > documents on a machine/node. Depending on your data model and actual
> >> data
> >> > > that number could be higher or lower. A proof of concept test will
> >> allow
> >> > > you to determine the actual number for your particular use case,
> but a
> >> > > presumed limit of 100 million is not a bad start.
> >> > >
> >> > > You should have enough memory to hold the entire index in system
> >> memory.
> >> > If
> >> > > not, your query latency will suffer due to I/O required to
> constantly
> >> > > re-read portions of the index into memory.
> >> > >
> >> > > The practical limit for documents is not per core or number of cores
> >> but
> >> > > across all cores on the node since it is mostly a memory limit and
> the
> >> > > available CPU resources for accessing that memory.
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <
> te@statsbiblioteket.dk
> >> >
> >> > > wrote:
> >> > >
> >> > > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> >> > > > > Capacity regarding 2 simple question:
> >> > > > >
> >> > > > > 1.) How many document we could store in single core(capacity of
> >> core
> >> > > > > storage)
> >> > > >
> >> > > > There is hard limit of 2 billion documents.
> >> > > >
> >> > > > > 2.) How many core we could create in a single server(single node
> >> > > cluster)
> >> > > >
> >> > > > There is no hard limit. Except for 2 billion cores, I guess. But
> at
> >> > this
> >> > > > point in time that is a ridiculously high number of cores.
> >> > > >
> >> > > > It is hard to give a suggestion for real-world limits as indexes
> >> vary a
> >> > > > lot and the rules of thumb tend to be quite poor when scaling up.
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >> > > >
> >> > > > People generally seems to run into problems with more than 1000
> >> > > > not-too-large cores. If the cores are large, there will probably
> be
> >> > > > performance problems long before that.
> >> > > >
> >> > > > You will have to build a prototype and test.
> >> > > >
> >> > > > - Toke Eskildsen, State and University Library, Denmark
> >> > > >
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: capacity of storage a single core

Posted by Erick Erickson <er...@gmail.com>.
I object to the question. And the advice. And... ;).

Practically, IMO guidance that "the entire index should
fit into memory" is misleading, especially for newbies.
Let's break it down:

1>  "the entire index". What's this? The size on disk?
90% of that size on disk may be stored data which
uses very little memory, which is limited by the
documentCache in Solr. OTOH, only 10% of the on-disk
size might be stored data.

2> "fit into memory". What memory? Certainly not
the JVM as much of the Lucene-level data is in
MMapDirectory which uses the OS memory. So
this _probably_ means JVM + OS memory, and OS
memory is shared amongst other processes as well.

3> Solr and Lucene build in-memory structures that
aren't reflected in the index size on disk. I've seen
filterCaches for instance that have been (mis) configured
that could grow to 100s of G. This is totally not reflected in
the "index size".

4> Try faceting on a text field with lots of unique
values. Bad Practice, but you'll see just how quickly
the _query_ can change the memory requirements.

5> Sure, with modern hardware we can create huge JVM
heaps... that hit GC pauses that'll drive performance
down, sometimes radically.

I've seen 350M docs, 200-300 fields (aggregate) fit into 12G
of JVM. I've seen 25M docs (really big ones) strain 48G
JVM heaps.

Jack's approach is what I use; pick a number and test with it.
Here's an approach:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Wed, Dec 9, 2015 at 8:54 AM, Susheel Kumar <su...@gmail.com> wrote:
> Thanks, Jack for quick reply.  With Replica / Shard I mean to say on a
> given machine there may be two/more replicas and all of them may not fit
> into memory.
>
> On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> Yes, there are nuances to any general rule. It's just a starting point, and
>> your own testing will confirm specific details for your specific app and
>> data. For example, maybe you don't query all fields commonly, so each
>> field-specific index may not require memory or not require it so commonly.
>> And, yes, each app has its own latency requirements. The purpose of a
>> general rule is to generally avoid unhappiness, but if you have an appetite
>> and tolerance for unhappiness, then go for it.
>>
>> Replica vs. shard? They're basically the same - a replica is a copy of a
>> shard.
>>
>> -- Jack Krupansky
>>
>> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar <su...@gmail.com>
>> wrote:
>>
>> > Hi Jack,
>> >
>> > Just to add, OS Disk Cache will still make query performant even though
>> > entire index can't be loaded into memory. How much more latency compare
>> to
>> > if index gets completely loaded into memory may vary depending to index
>> > size etc.  I am trying to clarify this here because lot of folks takes
>> this
>> > as a hard guideline (to fit index into memory)  and try to come up with
>> > hardware/machines (100's of machines) just for the sake of fitting index
>> > into memory even though there may not be much load/qps on the cluster.
>> For
>> > e.g. this may vary and needs to be tested on case by case basis but a
>> > machine with 64GB  should still provide good performance (not the best)
>> for
>> > 100G index on that machine.  Do you agree / any thoughts?
>> >
>> > Same i believe is the case with Replicas,   as on a single machine you
>> have
>> > replicas which itself may not fit into memory as well along with shard
>> > index.
>> >
>> > Thanks,
>> > Susheel
>> >
>> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <
>> jack.krupansky@gmail.com>
>> > wrote:
>> >
>> > > Generally, you will be resource limited (memory, cpu) rather than by
>> some
>> > > arbitrary numeric limit (like 2 billion.)
>> > >
>> > > My personal general recommendation is for a practical limit is 100
>> > million
>> > > documents on a machine/node. Depending on your data model and actual
>> data
>> > > that number could be higher or lower. A proof of concept test will
>> allow
>> > > you to determine the actual number for your particular use case, but a
>> > > presumed limit of 100 million is not a bad start.
>> > >
>> > > You should have enough memory to hold the entire index in system
>> memory.
>> > If
>> > > not, your query latency will suffer due to I/O required to constantly
>> > > re-read portions of the index into memory.
>> > >
>> > > The practical limit for documents is not per core or number of cores
>> but
>> > > across all cores on the node since it is mostly a memory limit and the
>> > > available CPU resources for accessing that memory.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <te@statsbiblioteket.dk
>> >
>> > > wrote:
>> > >
>> > > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
>> > > > > Capacity regarding 2 simple question:
>> > > > >
>> > > > > 1.) How many document we could store in single core(capacity of
>> core
>> > > > > storage)
>> > > >
>> > > > There is hard limit of 2 billion documents.
>> > > >
>> > > > > 2.) How many core we could create in a single server(single node
>> > > cluster)
>> > > >
>> > > > There is no hard limit. Except for 2 billion cores, I guess. But at
>> > this
>> > > > point in time that is a ridiculously high number of cores.
>> > > >
>> > > > It is hard to give a suggestion for real-world limits as indexes
>> vary a
>> > > > lot and the rules of thumb tend to be quite poor when scaling up.
>> > > >
>> > > >
>> > >
>> >
>> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>> > > >
>> > > > People generally seems to run into problems with more than 1000
>> > > > not-too-large cores. If the cores are large, there will probably be
>> > > > performance problems long before that.
>> > > >
>> > > > You will have to build a prototype and test.
>> > > >
>> > > > - Toke Eskildsen, State and University Library, Denmark
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>

Re: capacity of storage a single core

Posted by Susheel Kumar <su...@gmail.com>.
Thanks, Jack for quick reply.  With Replica / Shard I mean to say on a
given machine there may be two/more replicas and all of them may not fit
into memory.

On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> Yes, there are nuances to any general rule. It's just a starting point, and
> your own testing will confirm specific details for your specific app and
> data. For example, maybe you don't query all fields commonly, so each
> field-specific index may not require memory or not require it so commonly.
> And, yes, each app has its own latency requirements. The purpose of a
> general rule is to generally avoid unhappiness, but if you have an appetite
> and tolerance for unhappiness, then go for it.
>
> Replica vs. shard? They're basically the same - a replica is a copy of a
> shard.
>
> -- Jack Krupansky
>
> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar <su...@gmail.com>
> wrote:
>
> > Hi Jack,
> >
> > Just to add, OS Disk Cache will still make query performant even though
> > entire index can't be loaded into memory. How much more latency compare
> to
> > if index gets completely loaded into memory may vary depending to index
> > size etc.  I am trying to clarify this here because lot of folks takes
> this
> > as a hard guideline (to fit index into memory)  and try to come up with
> > hardware/machines (100's of machines) just for the sake of fitting index
> > into memory even though there may not be much load/qps on the cluster.
> For
> > e.g. this may vary and needs to be tested on case by case basis but a
> > machine with 64GB  should still provide good performance (not the best)
> for
> > 100G index on that machine.  Do you agree / any thoughts?
> >
> > Same i believe is the case with Replicas,   as on a single machine you
> have
> > replicas which itself may not fit into memory as well along with shard
> > index.
> >
> > Thanks,
> > Susheel
> >
> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <
> jack.krupansky@gmail.com>
> > wrote:
> >
> > > Generally, you will be resource limited (memory, cpu) rather than by
> some
> > > arbitrary numeric limit (like 2 billion.)
> > >
> > > My personal general recommendation is for a practical limit is 100
> > million
> > > documents on a machine/node. Depending on your data model and actual
> data
> > > that number could be higher or lower. A proof of concept test will
> allow
> > > you to determine the actual number for your particular use case, but a
> > > presumed limit of 100 million is not a bad start.
> > >
> > > You should have enough memory to hold the entire index in system
> memory.
> > If
> > > not, your query latency will suffer due to I/O required to constantly
> > > re-read portions of the index into memory.
> > >
> > > The practical limit for documents is not per core or number of cores
> but
> > > across all cores on the node since it is mostly a memory limit and the
> > > available CPU resources for accessing that memory.
> > >
> > > -- Jack Krupansky
> > >
> > > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <te@statsbiblioteket.dk
> >
> > > wrote:
> > >
> > > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> > > > > Capacity regarding 2 simple question:
> > > > >
> > > > > 1.) How many document we could store in single core(capacity of
> core
> > > > > storage)
> > > >
> > > > There is hard limit of 2 billion documents.
> > > >
> > > > > 2.) How many core we could create in a single server(single node
> > > cluster)
> > > >
> > > > There is no hard limit. Except for 2 billion cores, I guess. But at
> > this
> > > > point in time that is a ridiculously high number of cores.
> > > >
> > > > It is hard to give a suggestion for real-world limits as indexes
> vary a
> > > > lot and the rules of thumb tend to be quite poor when scaling up.
> > > >
> > > >
> > >
> >
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > > >
> > > > People generally seems to run into problems with more than 1000
> > > > not-too-large cores. If the cores are large, there will probably be
> > > > performance problems long before that.
> > > >
> > > > You will have to build a prototype and test.
> > > >
> > > > - Toke Eskildsen, State and University Library, Denmark
> > > >
> > > >
> > > >
> > >
> >
>

Re: capacity of storage a single core

Posted by Jack Krupansky <ja...@gmail.com>.
Yes, there are nuances to any general rule. It's just a starting point, and
your own testing will confirm specific details for your specific app and
data. For example, maybe you don't query all fields commonly, so each
field-specific index may not require memory or not require it so commonly.
And, yes, each app has its own latency requirements. The purpose of a
general rule is to generally avoid unhappiness, but if you have an appetite
and tolerance for unhappiness, then go for it.

Replica vs. shard? They're basically the same - a replica is a copy of a
shard.

-- Jack Krupansky

On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar <su...@gmail.com>
wrote:

> Hi Jack,
>
> Just to add, OS Disk Cache will still make query performant even though
> entire index can't be loaded into memory. How much more latency compare to
> if index gets completely loaded into memory may vary depending to index
> size etc.  I am trying to clarify this here because lot of folks takes this
> as a hard guideline (to fit index into memory)  and try to come up with
> hardware/machines (100's of machines) just for the sake of fitting index
> into memory even though there may not be much load/qps on the cluster.  For
> e.g. this may vary and needs to be tested on case by case basis but a
> machine with 64GB  should still provide good performance (not the best) for
> 100G index on that machine.  Do you agree / any thoughts?
>
> Same i believe is the case with Replicas,   as on a single machine you have
> replicas which itself may not fit into memory as well along with shard
> index.
>
> Thanks,
> Susheel
>
> On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
> > Generally, you will be resource limited (memory, cpu) rather than by some
> > arbitrary numeric limit (like 2 billion.)
> >
> > My personal general recommendation is for a practical limit is 100
> million
> > documents on a machine/node. Depending on your data model and actual data
> > that number could be higher or lower. A proof of concept test will allow
> > you to determine the actual number for your particular use case, but a
> > presumed limit of 100 million is not a bad start.
> >
> > You should have enough memory to hold the entire index in system memory.
> If
> > not, your query latency will suffer due to I/O required to constantly
> > re-read portions of the index into memory.
> >
> > The practical limit for documents is not per core or number of cores but
> > across all cores on the node since it is mostly a memory limit and the
> > available CPU resources for accessing that memory.
> >
> > -- Jack Krupansky
> >
> > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
> > wrote:
> >
> > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> > > > Capacity regarding 2 simple question:
> > > >
> > > > 1.) How many document we could store in single core(capacity of core
> > > > storage)
> > >
> > > There is hard limit of 2 billion documents.
> > >
> > > > 2.) How many core we could create in a single server(single node
> > cluster)
> > >
> > > There is no hard limit. Except for 2 billion cores, I guess. But at
> this
> > > point in time that is a ridiculously high number of cores.
> > >
> > > It is hard to give a suggestion for real-world limits as indexes vary a
> > > lot and the rules of thumb tend to be quite poor when scaling up.
> > >
> > >
> >
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > >
> > > People generally seems to run into problems with more than 1000
> > > not-too-large cores. If the cores are large, there will probably be
> > > performance problems long before that.
> > >
> > > You will have to build a prototype and test.
> > >
> > > - Toke Eskildsen, State and University Library, Denmark
> > >
> > >
> > >
> >
>

Re: capacity of storage a single core

Posted by Susheel Kumar <su...@gmail.com>.
Hi Jack,

Just to add, OS Disk Cache will still make query performant even though
entire index can't be loaded into memory. How much more latency compare to
if index gets completely loaded into memory may vary depending to index
size etc.  I am trying to clarify this here because lot of folks takes this
as a hard guideline (to fit index into memory)  and try to come up with
hardware/machines (100's of machines) just for the sake of fitting index
into memory even though there may not be much load/qps on the cluster.  For
e.g. this may vary and needs to be tested on case by case basis but a
machine with 64GB  should still provide good performance (not the best) for
100G index on that machine.  Do you agree / any thoughts?

Same i believe is the case with Replicas,   as on a single machine you have
replicas which itself may not fit into memory as well along with shard
index.

Thanks,
Susheel

On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> Generally, you will be resource limited (memory, cpu) rather than by some
> arbitrary numeric limit (like 2 billion.)
>
> My personal general recommendation is for a practical limit is 100 million
> documents on a machine/node. Depending on your data model and actual data
> that number could be higher or lower. A proof of concept test will allow
> you to determine the actual number for your particular use case, but a
> presumed limit of 100 million is not a bad start.
>
> You should have enough memory to hold the entire index in system memory. If
> not, your query latency will suffer due to I/O required to constantly
> re-read portions of the index into memory.
>
> The practical limit for documents is not per core or number of cores but
> across all cores on the node since it is mostly a memory limit and the
> available CPU resources for accessing that memory.
>
> -- Jack Krupansky
>
> On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
> wrote:
>
> > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> > > Capacity regarding 2 simple question:
> > >
> > > 1.) How many document we could store in single core(capacity of core
> > > storage)
> >
> > There is hard limit of 2 billion documents.
> >
> > > 2.) How many core we could create in a single server(single node
> cluster)
> >
> > There is no hard limit. Except for 2 billion cores, I guess. But at this
> > point in time that is a ridiculously high number of cores.
> >
> > It is hard to give a suggestion for real-world limits as indexes vary a
> > lot and the rules of thumb tend to be quite poor when scaling up.
> >
> >
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >
> > People generally seems to run into problems with more than 1000
> > not-too-large cores. If the cores are large, there will probably be
> > performance problems long before that.
> >
> > You will have to build a prototype and test.
> >
> > - Toke Eskildsen, State and University Library, Denmark
> >
> >
> >
>

Re: capacity of storage a single core

Posted by Jack Krupansky <ja...@gmail.com>.
Generally, you will be resource limited (memory, cpu) rather than by some
arbitrary numeric limit (like 2 billion.)

My personal general recommendation is for a practical limit is 100 million
documents on a machine/node. Depending on your data model and actual data
that number could be higher or lower. A proof of concept test will allow
you to determine the actual number for your particular use case, but a
presumed limit of 100 million is not a bad start.

You should have enough memory to hold the entire index in system memory. If
not, your query latency will suffer due to I/O required to constantly
re-read portions of the index into memory.

The practical limit for documents is not per core or number of cores but
across all cores on the node since it is mostly a memory limit and the
available CPU resources for accessing that memory.

-- Jack Krupansky

On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> > Capacity regarding 2 simple question:
> >
> > 1.) How many document we could store in single core(capacity of core
> > storage)
>
> There is hard limit of 2 billion documents.
>
> > 2.) How many core we could create in a single server(single node cluster)
>
> There is no hard limit. Except for 2 billion cores, I guess. But at this
> point in time that is a ridiculously high number of cores.
>
> It is hard to give a suggestion for real-world limits as indexes vary a
> lot and the rules of thumb tend to be quite poor when scaling up.
>
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> People generally seems to run into problems with more than 1000
> not-too-large cores. If the cores are large, there will probably be
> performance problems long before that.
>
> You will have to build a prototype and test.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>

Re: capacity of storage a single core

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> Capacity regarding 2 simple question:
> 
> 1.) How many document we could store in single core(capacity of core
> storage)

There is hard limit of 2 billion documents.

> 2.) How many core we could create in a single server(single node cluster)

There is no hard limit. Except for 2 billion cores, I guess. But at this
point in time that is a ridiculously high number of cores.

It is hard to give a suggestion for real-world limits as indexes vary a
lot and the rules of thumb tend to be quite poor when scaling up.
http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

People generally seems to run into problems with more than 1000
not-too-large cores. If the cores are large, there will probably be
performance problems long before that.

You will have to build a prototype and test.

- Toke Eskildsen, State and University Library, Denmark