You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kevin Osborn <os...@yahoo.com> on 2007/03/28 00:08:02 UTC

maximum index size

I know there are a bunch of variables here (RAM, number of fields, hits, etc.), but I am trying to get a sense of how big of an index in terms of number of documents Solr can reasonable handle. I have heard indexes of 3-4 million documents running fine. But, I have no idea what a reasonable upper limit might be.

I have a large number of documents and about 200-300 customers would have access to varying subsets of those documents. So, one possible strategy is to have everything in a large index, but duplicate the documents for each customer that has access to that document. But, that would really make the total number of documents huge. So, I am trying to get a sense of how big is too big. Each document will probably have about 30 fields. Most of them will be strings, but there will be some text, ints,a nd floats.

An extension to this strategy is to segment the customers among various instances of Solr.


Re: maximum index size

Posted by Venkatesh Seetharam <vs...@gmail.com>.
Hi Andre,

Comments are inline.

> What hardware are you running?
4 Dual-proc 64 GB blades for each searcher and a broker that merges results
on 64 bit SUSE linux running JDK 1.6 with 8GB Heap.

> Do you use collection distribution?
Nope. I use hadoop to index the documents.

Thanks,
Venkatesh

On 3/27/07, Andre Basse <AB...@theage.com.au> wrote:
>
> >I've 50 million documents each about 10K in size and I've 4 index
> partitions each consisting of 12.5 million documents. Each index
> partition is about 80GB. A search typically takes about 3-5 seconds.
> Single word searches are faster than multi-word searches. I'm still
> working on finding the ideal index size that Solr can handle well with
> in a second.
>
> Hi Venkatesh,
>
> I'm looking at a similar size of archive. What hardware are you running?
> Do you use collection distribution?
>
>
> Thanks,
>
> Andre
>
>
> The information contained in this e-mail message and any accompanying
> files is or may be confidential. If you are not the intended recipient, any
> use, dissemination, reliance, forwarding, printing or copying of this e-mail
> or any attached files is unauthorised. This e-mail is subject to copyright.
> No part of it should be reproduced, adapted or communicated without the
> written consent of the copyright owner. If you have received this e-mail in
> error please advise the sender immediately by return e-mail or telephone and
> delete all copies. Fairfax does not guarantee the accuracy or completeness
> of any information contained in this e-mail or attached files. Internet
> communications are not secure, therefore Fairfax does not accept legal
> responsibility for the contents of this message or attached files.
>

RE: maximum index size

Posted by Andre Basse <AB...@theage.com.au>.
>I've 50 million documents each about 10K in size and I've 4 index
partitions each consisting of 12.5 million documents. Each index
partition is about 80GB. A search typically takes about 3-5 seconds.
Single word searches are faster than multi-word searches. I'm still
working on finding the ideal index size that Solr can handle well with
in a second.

Hi Venkatesh,

I'm looking at a similar size of archive. What hardware are you running?
Do you use collection distribution?


Thanks,

Andre


The information contained in this e-mail message and any accompanying files is or may be confidential. If you are not the intended recipient, any use, dissemination, reliance, forwarding, printing or copying of this e-mail or any attached files is unauthorised. This e-mail is subject to copyright. No part of it should be reproduced, adapted or communicated without the written consent of the copyright owner. If you have received this e-mail in error please advise the sender immediately by return e-mail or telephone and delete all copies. Fairfax does not guarantee the accuracy or completeness of any information contained in this e-mail or attached files. Internet communications are not secure, therefore Fairfax does not accept legal responsibility for the contents of this message or attached files.

Re: maximum index size

Posted by Venkatesh Seetharam <vs...@gmail.com>.
I've 50 million documents each about 10K in size and I've 4 index partitions
each consisting of 12.5 million documents. Each index partition is about
80GB. A search typically takes about 3-5 seconds. Single word searches are
faster than multi-word searches. I'm still working on finding the ideal
index size that Solr can handle well with in a second.

Thanks,
Venkatesh

On 3/27/07, Kevin Osborn <os...@yahoo.com> wrote:
>
> I know there are a bunch of variables here (RAM, number of fields, hits,
> etc.), but I am trying to get a sense of how big of an index in terms of
> number of documents Solr can reasonable handle. I have heard indexes of 3-4
> million documents running fine. But, I have no idea what a reasonable upper
> limit might be.
>
> I have a large number of documents and about 200-300 customers would have
> access to varying subsets of those documents. So, one possible strategy is
> to have everything in a large index, but duplicate the documents for each
> customer that has access to that document. But, that would really make the
> total number of documents huge. So, I am trying to get a sense of how big is
> too big. Each document will probably have about 30 fields. Most of them will
> be strings, but there will be some text, ints,a nd floats.
>
> An extension to this strategy is to segment the customers among various
> instances of Solr.
>
>

Re: maximum index size

Posted by James liu <li...@gmail.com>.
If u wanna use one index file to do it, i think u know how to do when u read
my this mail.

I think maybe you can divid it into serveral ?(i don't know how to define
it.) everyone have one master and serveral slaver if u use solr...one
request do serveral query.
it can reduce index file size and index time.
But it will lost some search performance...maybe it will be fixed by more pc
or server.


2007/3/30, Chris Hostetter <ho...@fucit.org>:
>
>
> : I'd be interested to know what is the ideal size for an index to achieve
> 1
> : sec response time for queries. I'd appreciate if you can share any
> numbers.
>
> that's a fairly impossible question to answer ... the lucene email
> archives have lots of discusssion about how the number of documents isn't
> really the biggest facter when considering raw search performance ... the
> number of unique terms in the index and the average number of terms per
> docuemnt are typically more significant factors.
>
> there's also the question of what you mean by a "query" .. a simple term
> query is a lot cheaper/faster then a complex boolean query or a phrase
> query.
>
>
>
>
> -Hoss
>
>


-- 
regards
jl

Re: maximum index size

Posted by Chris Hostetter <ho...@fucit.org>.
: I'd be interested to know what is the ideal size for an index to achieve 1
: sec response time for queries. I'd appreciate if you can share any numbers.

that's a fairly impossible question to answer ... the lucene email
archives have lots of discusssion about how the number of documents isn't
really the biggest facter when considering raw search performance ... the
number of unique terms in the index and the average number of terms per
docuemnt are typically more significant factors.

there's also the question of what you mean by a "query" .. a simple term
query is a lot cheaper/faster then a complex boolean query or a phrase
query.




-Hoss


Re: maximum index size

Posted by Venkatesh Seetharam <vs...@gmail.com>.
Hi Mike,

I'd be interested to know what is the ideal size for an index to achieve 1
sec response time for queries. I'd appreciate if you can share any numbers.

Thanks,
Venkatesh

On 3/27/07, Mike Klaas <mi...@gmail.com> wrote:
>
> On 3/27/07, Kevin Osborn <os...@yahoo.com> wrote:
> > I know there are a bunch of variables here (RAM, number of fields, hits,
> etc.), but I am trying to get a sense of how big of an index in terms of
> number of documents Solr can reasonable handle. I have heard indexes of 3-4
> million documents running fine. But, I have no idea what a reasonable upper
> limit might be.
>
> People have constructed (lucene) indices with over a billion
> documents.  But if "reasonable" means something like "<1s query time
> for a medium-complexity query on non-astronomical hardware", I
> wouldn't go much higher than the figure you quote.
>
> > I have a large number of documents and about 200-300 customers would
> have access to varying subsets of those documents. So, one possible strategy
> is to have everything in a large index, but duplicate the documents for each
> customer that has access to that document. But, that would really make the
> total number of documents huge. So, I am trying to get a sense of how big is
> too big. Each document will probably have about 30 fields. Most of them will
> be strings, but there will be some text, ints,a nd floats.
>
> If you are going to store a document for each customer then some field
> must indicate to which customer the document instance belongs.  In
> that case, why not index a single copy of each document, with a field
> containing a list of customers having access?
>
> -Mike
>

Re: maximum index size

Posted by Mike Klaas <mi...@gmail.com>.
On 3/27/07, Kevin Osborn <os...@yahoo.com> wrote:
> I know there are a bunch of variables here (RAM, number of fields, hits, etc.), but I am trying to get a sense of how big of an index in terms of number of documents Solr can reasonable handle. I have heard indexes of 3-4 million documents running fine. But, I have no idea what a reasonable upper limit might be.

People have constructed (lucene) indices with over a billion
documents.  But if "reasonable" means something like "<1s query time
for a medium-complexity query on non-astronomical hardware", I
wouldn't go much higher than the figure you quote.

> I have a large number of documents and about 200-300 customers would have access to varying subsets of those documents. So, one possible strategy is to have everything in a large index, but duplicate the documents for each customer that has access to that document. But, that would really make the total number of documents huge. So, I am trying to get a sense of how big is too big. Each document will probably have about 30 fields. Most of them will be strings, but there will be some text, ints,a nd floats.

If you are going to store a document for each customer then some field
must indicate to which customer the document instance belongs.  In
that case, why not index a single copy of each document, with a field
containing a list of customers having access?

-Mike