You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Yury Kats <yu...@yahoo.com> on 2011/12/15 18:58:28 UTC

Core overhead

Does anybody have an idea, or better yet, measured data,
to see what the overhead of a core is, both in memory and speed?

For example, what would be the difference between having 1 core
with 100M documents versus having 10 cores with 10M documents?

Re: Core overhead

Posted by Ted Dunning <te...@gmail.com>.
We still disagree.

On Fri, Dec 16, 2011 at 12:29 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Ted,
>
> The list would be unreadable if everyone spammed at the bottom their
> email like Otis'.  It's just bad form.
>
> Jason
>
> On Fri, Dec 16, 2011 at 12:00 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Sounds like we disagree.
> >
> > On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen <
> > jason.rutherglen@gmail.com> wrote:
> >
> >> Ted,
> >>
> >> "...- FREE!" is stupid idiot spam.  It's annoying and not suitable.
> >>
> >> On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >> > I thought it was slightly clumsy, but it was informative.  It seemed
> >> like a
> >> > fine thing to say.  Effectively it was "I/we have developed a tool
> that
> >> > will help you solve your problem".  That is responsive to the OP and
> it
> >> is
> >> > clear that it is a commercial deal.
> >> >
> >> > On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen <
> >> > jason.rutherglen@gmail.com> wrote:
> >> >
> >> >> Wow the shameless plugging of product (footer) has hit a new low
> Otis.
> >> >>
> >> >> On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
> >> >> <ot...@yahoo.com> wrote:
> >> >> > Hi Yury,
> >> >> >
> >> >> > Not sure if this was already covered in this thread, but with N
> >> smaller
> >> >> cores on a single N-CPU-core box you could run N queries in parallel
> >> over
> >> >> smaller indices, which may be faster than a single query going
> against a
> >> >> single big index, depending on how many concurrent query requests the
> >> box
> >> >> is handling (i.e. how busy or idle the CPU cores are).
> >> >> >
> >> >> > Otis
> >> >> > ----
> >> >> >
> >> >> > Performance Monitoring SaaS for Solr -
> >> >> http://sematext.com/spm/solr-performance-monitoring/index.html
> >> >> >
> >> >> >
> >> >> >
> >> >> >>________________________________
> >> >> >> From: Yury Kats <yu...@yahoo.com>
> >> >> >>To: solr-user@lucene.apache.org
> >> >> >>Sent: Thursday, December 15, 2011 12:58 PM
> >> >> >>Subject: Core overhead
> >> >> >>
> >> >> >>Does anybody have an idea, or better yet, measured data,
> >> >> >>to see what the overhead of a core is, both in memory and speed?
> >> >> >>
> >> >> >>For example, what would be the difference between having 1 core
> >> >> >>with 100M documents versus having 10 cores with 10M documents?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >>
>

Re: Core overhead

Posted by Chris Hostetter <ho...@fucit.org>.
: The list would be unreadable if everyone spammed at the bottom their
: email like Otis'.  It's just bad form.

If you'd like to debate project policy on what is/isn't acceptible on any 
of the Lucene mailing lists, please start a new thread on general@lucene 
(the list that exists precisely for the purpose of discussing meta-issues 
related to the Project/Community) instead of spamming the substantial 
solr-user@lucene subscriber base who probably subscribed to this list 
because they were interested in getting emails about using solr, not 
debating email etiquite.



-Hoss

Re: Core overhead

Posted by Jason Rutherglen <ja...@gmail.com>.
Ted,

The list would be unreadable if everyone spammed at the bottom their
email like Otis'.  It's just bad form.

Jason

On Fri, Dec 16, 2011 at 12:00 PM, Ted Dunning <te...@gmail.com> wrote:
> Sounds like we disagree.
>
> On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> Ted,
>>
>> "...- FREE!" is stupid idiot spam.  It's annoying and not suitable.
>>
>> On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > I thought it was slightly clumsy, but it was informative.  It seemed
>> like a
>> > fine thing to say.  Effectively it was "I/we have developed a tool that
>> > will help you solve your problem".  That is responsive to the OP and it
>> is
>> > clear that it is a commercial deal.
>> >
>> > On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen <
>> > jason.rutherglen@gmail.com> wrote:
>> >
>> >> Wow the shameless plugging of product (footer) has hit a new low Otis.
>> >>
>> >> On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
>> >> <ot...@yahoo.com> wrote:
>> >> > Hi Yury,
>> >> >
>> >> > Not sure if this was already covered in this thread, but with N
>> smaller
>> >> cores on a single N-CPU-core box you could run N queries in parallel
>> over
>> >> smaller indices, which may be faster than a single query going against a
>> >> single big index, depending on how many concurrent query requests the
>> box
>> >> is handling (i.e. how busy or idle the CPU cores are).
>> >> >
>> >> > Otis
>> >> > ----
>> >> >
>> >> > Performance Monitoring SaaS for Solr -
>> >> http://sematext.com/spm/solr-performance-monitoring/index.html
>> >> >
>> >> >
>> >> >
>> >> >>________________________________
>> >> >> From: Yury Kats <yu...@yahoo.com>
>> >> >>To: solr-user@lucene.apache.org
>> >> >>Sent: Thursday, December 15, 2011 12:58 PM
>> >> >>Subject: Core overhead
>> >> >>
>> >> >>Does anybody have an idea, or better yet, measured data,
>> >> >>to see what the overhead of a core is, both in memory and speed?
>> >> >>
>> >> >>For example, what would be the difference between having 1 core
>> >> >>with 100M documents versus having 10 cores with 10M documents?
>> >> >>
>> >> >>
>> >> >>
>> >>
>>

Re: Core overhead

Posted by Ted Dunning <te...@gmail.com>.
Sounds like we disagree.

On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Ted,
>
> "...- FREE!" is stupid idiot spam.  It's annoying and not suitable.
>
> On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > I thought it was slightly clumsy, but it was informative.  It seemed
> like a
> > fine thing to say.  Effectively it was "I/we have developed a tool that
> > will help you solve your problem".  That is responsive to the OP and it
> is
> > clear that it is a commercial deal.
> >
> > On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen <
> > jason.rutherglen@gmail.com> wrote:
> >
> >> Wow the shameless plugging of product (footer) has hit a new low Otis.
> >>
> >> On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
> >> <ot...@yahoo.com> wrote:
> >> > Hi Yury,
> >> >
> >> > Not sure if this was already covered in this thread, but with N
> smaller
> >> cores on a single N-CPU-core box you could run N queries in parallel
> over
> >> smaller indices, which may be faster than a single query going against a
> >> single big index, depending on how many concurrent query requests the
> box
> >> is handling (i.e. how busy or idle the CPU cores are).
> >> >
> >> > Otis
> >> > ----
> >> >
> >> > Performance Monitoring SaaS for Solr -
> >> http://sematext.com/spm/solr-performance-monitoring/index.html
> >> >
> >> >
> >> >
> >> >>________________________________
> >> >> From: Yury Kats <yu...@yahoo.com>
> >> >>To: solr-user@lucene.apache.org
> >> >>Sent: Thursday, December 15, 2011 12:58 PM
> >> >>Subject: Core overhead
> >> >>
> >> >>Does anybody have an idea, or better yet, measured data,
> >> >>to see what the overhead of a core is, both in memory and speed?
> >> >>
> >> >>For example, what would be the difference between having 1 core
> >> >>with 100M documents versus having 10 cores with 10M documents?
> >> >>
> >> >>
> >> >>
> >>
>

Re: Core overhead

Posted by Jason Rutherglen <ja...@gmail.com>.
Ted,

"...- FREE!" is stupid idiot spam.  It's annoying and not suitable.

On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning <te...@gmail.com> wrote:
> I thought it was slightly clumsy, but it was informative.  It seemed like a
> fine thing to say.  Effectively it was "I/we have developed a tool that
> will help you solve your problem".  That is responsive to the OP and it is
> clear that it is a commercial deal.
>
> On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> Wow the shameless plugging of product (footer) has hit a new low Otis.
>>
>> On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
>> <ot...@yahoo.com> wrote:
>> > Hi Yury,
>> >
>> > Not sure if this was already covered in this thread, but with N smaller
>> cores on a single N-CPU-core box you could run N queries in parallel over
>> smaller indices, which may be faster than a single query going against a
>> single big index, depending on how many concurrent query requests the box
>> is handling (i.e. how busy or idle the CPU cores are).
>> >
>> > Otis
>> > ----
>> >
>> > Performance Monitoring SaaS for Solr -
>> http://sematext.com/spm/solr-performance-monitoring/index.html
>> >
>> >
>> >
>> >>________________________________
>> >> From: Yury Kats <yu...@yahoo.com>
>> >>To: solr-user@lucene.apache.org
>> >>Sent: Thursday, December 15, 2011 12:58 PM
>> >>Subject: Core overhead
>> >>
>> >>Does anybody have an idea, or better yet, measured data,
>> >>to see what the overhead of a core is, both in memory and speed?
>> >>
>> >>For example, what would be the difference between having 1 core
>> >>with 100M documents versus having 10 cores with 10M documents?
>> >>
>> >>
>> >>
>>

Re: Core overhead

Posted by Ted Dunning <te...@gmail.com>.
I thought it was slightly clumsy, but it was informative.  It seemed like a
fine thing to say.  Effectively it was "I/we have developed a tool that
will help you solve your problem".  That is responsive to the OP and it is
clear that it is a commercial deal.

On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Wow the shameless plugging of product (footer) has hit a new low Otis.
>
> On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
> > Hi Yury,
> >
> > Not sure if this was already covered in this thread, but with N smaller
> cores on a single N-CPU-core box you could run N queries in parallel over
> smaller indices, which may be faster than a single query going against a
> single big index, depending on how many concurrent query requests the box
> is handling (i.e. how busy or idle the CPU cores are).
> >
> > Otis
> > ----
> >
> > Performance Monitoring SaaS for Solr -
> http://sematext.com/spm/solr-performance-monitoring/index.html
> >
> >
> >
> >>________________________________
> >> From: Yury Kats <yu...@yahoo.com>
> >>To: solr-user@lucene.apache.org
> >>Sent: Thursday, December 15, 2011 12:58 PM
> >>Subject: Core overhead
> >>
> >>Does anybody have an idea, or better yet, measured data,
> >>to see what the overhead of a core is, both in memory and speed?
> >>
> >>For example, what would be the difference between having 1 core
> >>with 100M documents versus having 10 cores with 10M documents?
> >>
> >>
> >>
>

Re: Core overhead

Posted by Jason Rutherglen <ja...@gmail.com>.
Wow the shameless plugging of product (footer) has hit a new low Otis.

On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hi Yury,
>
> Not sure if this was already covered in this thread, but with N smaller cores on a single N-CPU-core box you could run N queries in parallel over smaller indices, which may be faster than a single query going against a single big index, depending on how many concurrent query requests the box is handling (i.e. how busy or idle the CPU cores are).
>
> Otis
> ----
>
> Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
>>________________________________
>> From: Yury Kats <yu...@yahoo.com>
>>To: solr-user@lucene.apache.org
>>Sent: Thursday, December 15, 2011 12:58 PM
>>Subject: Core overhead
>>
>>Does anybody have an idea, or better yet, measured data,
>>to see what the overhead of a core is, both in memory and speed?
>>
>>For example, what would be the difference between having 1 core
>>with 100M documents versus having 10 cores with 10M documents?
>>
>>
>>

Re: Core overhead

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Yury,

Not sure if this was already covered in this thread, but with N smaller cores on a single N-CPU-core box you could run N queries in parallel over smaller indices, which may be faster than a single query going against a single big index, depending on how many concurrent query requests the box is handling (i.e. how busy or idle the CPU cores are).

Otis
----

Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: Yury Kats <yu...@yahoo.com>
>To: solr-user@lucene.apache.org 
>Sent: Thursday, December 15, 2011 12:58 PM
>Subject: Core overhead
> 
>Does anybody have an idea, or better yet, measured data,
>to see what the overhead of a core is, both in memory and speed?
>
>For example, what would be the difference between having 1 core
>with 100M documents versus having 10 cores with 10M documents?
>
>
>

Re: Core overhead

Posted by Ted Dunning <te...@gmail.com>.
Here is a talk I did on this topic at HPTS a few years ago.

On Thu, Dec 15, 2011 at 4:28 PM, Robert Petersen <ro...@buy.com> wrote:

> I see there is a lot of discussions about "micro-sharding", I'll have to
> read them.  I'm on an older version of solr and just use master index
> replicating out to a farm of slaves.  It always seemed like sharding
> causes a lot of background traffic to me when I read about it, but I
> never tried it out.  Thanks for the heads up on that topic...  :)
>
> -----Original Message-----
> From: Yury Kats [mailto:yurykats@yahoo.com]
> Sent: Thursday, December 15, 2011 2:16 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Core overhead
>
> On 12/15/2011 4:46 PM, Robert Petersen wrote:
> > Sure that is possible, but doesn't that defeat the purpose of
> sharding?
> > Why distribute across one machine?  Just keep all in one index in that
> > case is my thought there...
>
> To be able to scale w/o re-indexing. Also often referred to as
> "micro-sharding".
>

RE: Core overhead

Posted by Robert Petersen <ro...@buy.com>.
I see there is a lot of discussions about "micro-sharding", I'll have to
read them.  I'm on an older version of solr and just use master index
replicating out to a farm of slaves.  It always seemed like sharding
causes a lot of background traffic to me when I read about it, but I
never tried it out.  Thanks for the heads up on that topic...  :)

-----Original Message-----
From: Yury Kats [mailto:yurykats@yahoo.com] 
Sent: Thursday, December 15, 2011 2:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Core overhead

On 12/15/2011 4:46 PM, Robert Petersen wrote:
> Sure that is possible, but doesn't that defeat the purpose of
sharding?
> Why distribute across one machine?  Just keep all in one index in that
> case is my thought there...

To be able to scale w/o re-indexing. Also often referred to as
"micro-sharding".

Re: Core overhead

Posted by Yury Kats <yu...@yahoo.com>.
On 12/15/2011 4:46 PM, Robert Petersen wrote:
> Sure that is possible, but doesn't that defeat the purpose of sharding?
> Why distribute across one machine?  Just keep all in one index in that
> case is my thought there...

To be able to scale w/o re-indexing. Also often referred to as "micro-sharding".

Re: Replication file become very very big

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

Hm, I don't know what this could be caused by.  But if you want to get rid of it, remote that Linux server our of the load balancer pool, stop Solr, remove the index, and restart Solr.  Then force replication and put the server back in the load balancer pool.

If you use SPM (see link in my signature below) you will see how your indices grow (and shrink!) over time and will catch this problem when it happens next time by looking at the graph that shows info about your index - size on FS, # of segments, documents, etc.

Otis 
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: ZiLi <da...@163.com>
>To: solr-user@lucene.apache.org 
>Cc: dangldang@163.com 
>Sent: Thursday, December 15, 2011 9:28 PM
>Subject: Replication file become very very big
> 
>Hi all,
>    I meet a very strange problem .
>    We use a windows server as master  serviced for  5 windows slaves and 3 Linux slaves .
>    It has worked normally for 2 months .But today we find one of the Linux slave's index file become very very big (150G! Others is 300M ). And we can't find the index folder under data folder .There is just four files :index.20111203090855 (150G)、index.properties、replication.properties、spellchecker 。 By  the way , although this file is 150G , its service is normal and the query is very fast .
>    By the way, our Linux slaves' index will poll from server every 40 minutes and every 15 minutes our program will update these server's  solr index.  
>   We forbidden AutoCommit in solrconfig.xml . Is this caused the problem via some big transaction ?
>   Any suggestion will be appreciate .
>
>
>
>

Replication file become very very big

Posted by ZiLi <da...@163.com>.
Hi all,
    I meet a very strange problem .
    We use a windows server as master  serviced for  5 windows slaves and 3 Linux slaves .
    It has worked normally for 2 months .But today we find one of the Linux slave's index file become very very big (150G! Others is 300M ). And we can't find the index folder under data folder .There is just four files :index.20111203090855 (150G)、index.properties、replication.properties、spellchecker 。 By  the way , although this file is 150G , its service is normal and the query is very fast .
    By the way, our Linux slaves' index will poll from server every 40 minutes and every 15 minutes our program will update these server's  solr index.   
   We forbidden AutoCommit in solrconfig.xml . Is this caused the problem via some big transaction ?
   Any suggestion will be appreciate .


RE: Core overhead

Posted by Robert Petersen <ro...@buy.com>.
Sure that is possible, but doesn't that defeat the purpose of sharding?
Why distribute across one machine?  Just keep all in one index in that
case is my thought there...

-----Original Message-----
From: Yury Kats [mailto:yurykats@yahoo.com] 
Sent: Thursday, December 15, 2011 11:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Core overhead

On 12/15/2011 1:41 PM, Robert Petersen wrote:
> loading.  Try it out, but make sure that the functionality you are
> actually looking for isn't sharding instead of multiple cores...  

Yes, but the way to achieve sharding is to have multiple cores.
The question is then becomes -- how many cores (shards)?

Re: Core overhead

Posted by Yury Kats <yu...@yahoo.com>.
On 12/15/2011 1:41 PM, Robert Petersen wrote:
> loading.  Try it out, but make sure that the functionality you are
> actually looking for isn't sharding instead of multiple cores...  

Yes, but the way to achieve sharding is to have multiple cores.
The question is then becomes -- how many cores (shards)?

RE: Core overhead

Posted by Robert Petersen <ro...@buy.com>.
I am running eight cores, each core serves up different types of
searches so there is no overlap in their function.  Some cores have
millions of documents.  My search times are quite fast.  I don't see any
real slowdown from multiple cores, but you just have to have enough
memory for them. Memory simply has to be big enough to hold what you are
loading.  Try it out, but make sure that the functionality you are
actually looking for isn't sharding instead of multiple cores...  

http://wiki.apache.org/solr/DistributedSearch


-----Original Message-----
From: Yury Kats [mailto:yurykats@yahoo.com] 
Sent: Thursday, December 15, 2011 10:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Core overhead

On 12/15/2011 1:07 PM, Robert Stewart wrote:

> I think overall memory usage would be close to the same.

Is this really so? I suspect that the consumed memory is in direct
proportion to the number of terms in the index. I also suspect that
if I divided 1 core with N terms into 10 smaller cores, each smaller
core would have much more than N/10 terms. Let's say I'm indexing
English texts, it's likely that all smaller cores would have almost
the same number of terms, close to the original N. Not so?

Re: Core overhead

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

I used to think this, too, but have learned this not to be entirely true.  We had a customer with a query rate of a few hundred QPS and 32 or 64 GB RAM (don't recall which any more) and a pretty large JVM heap.  Most queries were very fast, but once in a while a query would be very slow.  GC, we thought!  So the initial thinking was was - must be that big heap of theirs.  But.... long story short, instead of making the heap smaller we just tuned the JVM and took care of those slow queries.  Using SPM (link in sig) and seeing GC info (collection counts, times, heap size, etc.) was invaluable!

Otis
----

Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html - FREE!



>________________________________
> From: Robert Stewart <bs...@gmail.com>
>To: solr-user@lucene.apache.org 
>Sent: Thursday, December 15, 2011 2:16 PM
>Subject: Re: Core overhead
> 
>One other thing I did not mention is GC pauses.  If you have smaller
>heap sizes, you would have less very long GC pauses, so that can be an
>advantage having many cores (if cores are distributed into seperate
>SOLR instances, as seperate processes).  I think you can expect 1
>second pause for each GB of heap size in worst case.
>
>
>
>On Thu, Dec 15, 2011 at 2:14 PM, Robert Stewart <bs...@gmail.com> wrote:
>> It is true number of terms may be much more than N/10 (or even N for
>> each core), but it is the number of docs per term that will really
>> matter.  So you can have N terms in each core but each term has 1/10
>> number of docs on avg.
>>
>>
>>
>>
>> 2011/12/15 Yury Kats <yu...@yahoo.com>:
>>> On 12/15/2011 1:07 PM, Robert Stewart wrote:
>>>
>>>> I think overall memory usage would be close to the same.
>>>
>>> Is this really so? I suspect that the consumed memory is in direct
>>> proportion to the number of terms in the index. I also suspect that
>>> if I divided 1 core with N terms into 10 smaller cores, each smaller
>>> core would have much more than N/10 terms. Let's say I'm indexing
>>> English texts, it's likely that all smaller cores would have almost
>>> the same number of terms, close to the original N. Not so?
>
>
>

Re: Core overhead

Posted by Robert Stewart <bs...@gmail.com>.
One other thing I did not mention is GC pauses.  If you have smaller
heap sizes, you would have less very long GC pauses, so that can be an
advantage having many cores (if cores are distributed into seperate
SOLR instances, as seperate processes).  I think you can expect 1
second pause for each GB of heap size in worst case.



On Thu, Dec 15, 2011 at 2:14 PM, Robert Stewart <bs...@gmail.com> wrote:
> It is true number of terms may be much more than N/10 (or even N for
> each core), but it is the number of docs per term that will really
> matter.  So you can have N terms in each core but each term has 1/10
> number of docs on avg.
>
>
>
>
> 2011/12/15 Yury Kats <yu...@yahoo.com>:
>> On 12/15/2011 1:07 PM, Robert Stewart wrote:
>>
>>> I think overall memory usage would be close to the same.
>>
>> Is this really so? I suspect that the consumed memory is in direct
>> proportion to the number of terms in the index. I also suspect that
>> if I divided 1 core with N terms into 10 smaller cores, each smaller
>> core would have much more than N/10 terms. Let's say I'm indexing
>> English texts, it's likely that all smaller cores would have almost
>> the same number of terms, close to the original N. Not so?

Re: Core overhead

Posted by Robert Stewart <bs...@gmail.com>.
It is true number of terms may be much more than N/10 (or even N for
each core), but it is the number of docs per term that will really
matter.  So you can have N terms in each core but each term has 1/10
number of docs on avg.




2011/12/15 Yury Kats <yu...@yahoo.com>:
> On 12/15/2011 1:07 PM, Robert Stewart wrote:
>
>> I think overall memory usage would be close to the same.
>
> Is this really so? I suspect that the consumed memory is in direct
> proportion to the number of terms in the index. I also suspect that
> if I divided 1 core with N terms into 10 smaller cores, each smaller
> core would have much more than N/10 terms. Let's say I'm indexing
> English texts, it's likely that all smaller cores would have almost
> the same number of terms, close to the original N. Not so?

Re: Core overhead

Posted by Yury Kats <yu...@yahoo.com>.
On 12/15/2011 1:07 PM, Robert Stewart wrote:

> I think overall memory usage would be close to the same.

Is this really so? I suspect that the consumed memory is in direct
proportion to the number of terms in the index. I also suspect that
if I divided 1 core with N terms into 10 smaller cores, each smaller
core would have much more than N/10 terms. Let's say I'm indexing
English texts, it's likely that all smaller cores would have almost
the same number of terms, close to the original N. Not so?

Re: Core overhead

Posted by Robert Stewart <bs...@gmail.com>.
I dont have any measured data, but here are my thoughts.

I think overall memory usage would be close to the same.
Speed will be slower in general, because if search speed is approx
log(n) then 10 * log(n/10) > log(n), and also if merging results you
have overhead in the merge step and also if fetching results beyond
the first page since you would generally need page_size * page_number
from each core.  Of course if you search many cores in parallel over
many CPU cores you would mitigate that overhead.  There are other
considerations such as caching - for example if you are adding new
documents on one core only, the other cores get to keep there filter
caches, etc. in RAM much longer than if you are always committing to
one single large core.  And then of course if you have some client
logic to pick a sub-set of cores based on some query data (such as
only searching newer cores, etc.) then you could end up with faster
search over many cores.


2011/12/15 Yury Kats <yu...@yahoo.com>:
> Does anybody have an idea, or better yet, measured data,
> to see what the overhead of a core is, both in memory and speed?
>
> For example, what would be the difference between having 1 core
> with 100M documents versus having 10 cores with 10M documents?