You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Willie Wong <Wi...@METALOGIC-INC.COM> on 2008/07/01 17:37:12 UTC

Solr Capabilities/Limitations

Hi y’all,

I’m a newbie to Solr, and was looking for advice on whether Solr is the 
best choice for this project. 

I need to be able to search through terabytes of existing data.  Documents 
may vary in size from 10 MB to 20 KB in size.  Also at some point I’ll 
also need to feed in approximately approximately 1-5 million new documents 
a day. 

With this in mind…

Has anyone used Solr to conduct searches over terabytes of data?  If so, 
are there any configuration parameters I should pay particular attention 
to such jvm size, mergeFactor etc?

Is there a limit to the number of shards Solr is capable of?  I don’t 
think there’s any way I can do this without some sort of distributed 
search.

I’ve read that solr indexes can go into the millions if not billions of 
documents… however at what point do the index size become impractical – I 
know this is a bit open ended, but I guess does Solr have a limit to the 
number of documents that can be in a single index? 

Has anyone looked into any of these other search engines and are there any 
other search engines that would be better suited such as Fast or Automomy:
http://mg4j.dsi.unimi.it/
http://www.egothor.org/performance.shtml

I know I asked quite a bit in this post, but any help/suggestions would be 
much appreciated.



Regards,

Willie

Re: Solr Capabilities/Limitations

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Willie,

Yes, Solr has "multi core" support: <http://wiki.apache.org/solr/MultiCore 
 >

	Erik



On Jul 2, 2008, at 1:15 PM, Willie Wong wrote:

> Thanks Mike for your quick response - they were very informative and
> useful.
>
> I have one final question if you don't mind....  is it possible for a
> single Solr instance to switch between multiple indexes?  For  
> example, can
> solr search in one index on one server partition then use another  
> index
> located on another drive, without requiring a restart?  This differs
> slightly from the distributed search examples I've read in the
> documentation where you have another server running solr with the
> distributed index.
>
>
> Thanks,
>
> Willie
>
>
>
>
>
> Mike Klaas <mi...@gmail.com>
> 01/07/2008 05:44 PM
> Please respond to
> solr-user@lucene.apache.org
>
>
> To
> solr-user@lucene.apache.org
> cc
>
> Subject
> Re: Solr Capabilities/Limitations
>
>
>
>
>
>
> On 1-Jul-08, at 8:37 AM, Willie Wong wrote:
>
>> I need to be able to search through terabytes of existing data.
>> Documents
>> may vary in size from 10 MB to 20 KB in size.  Also at some point I? 
>> ll
>> also need to feed in approximately approximately 1-5 million new
>> documents
>> a day.
>
> This depends greatly on what kind of searching you want to do, and
> what are the desired response times.  I'm using Solr to full-text
> search about 10 TB of data at the moment.  Response times are around
> ~1s including dynamic snippet generation.  The queries themselves are
> relatively complicated by lucene standards, including a custom word-
> proximity boosting query and link-analysis factors.
>
> Of course, this is distributed over dozens of machines, and is a
> mostly static index.  There are about 10million docs per server.
>
>> Has anyone used Solr to conduct searches over terabytes of data?  If
>> so,
>> are there any configuration parameters I should pay particular
>> attention
>> to such jvm size, mergeFactor etc?
>
> JVM size will depend mostly on your sorting/faceting requirements.
> Just remember to leave gobs of memory for the OS disk cache.  Memory
> is key to serving large indices (consequently, things won't be fast
> until a decent amount of warming up is done).  mergeFactor?  You
> should only be searching optimized indices of this size, so it isn't
> terribly relevant.  The daily new docs should probably be added in
> their own index, which is then searched in parallel with the existing
> indices.
>
>> Is there a limit to the number of shards Solr is capable of?  I don?t
>> think there?s any way I can do this without some sort of distributed
>> search.
>
> Not really, though you will want to move to a 2-level hierarchy
> eventually.  I can't speak for the distributed search implementation
> in trunk (we built our own before this was available), but it should
> be exactly what you need.
>
>> I?ve read that solr indexes can go into the millions if not billions
>> of
>> documents? however at what point do the index size become
>> impractical ? I
>> know this is a bit open ended, but I guess does Solr have a limit to
>> the
>> number of documents that can be in a single index?
>
> Depends on query composition and document size.  But for web docs,
> about 10m seems practical.
>
>> Has anyone looked into any of these other search engines and are
>> there any
>> other search engines that would be better suited such as Fast or
>> Automomy:
>> http://mg4j.dsi.unimi.it/
>> http://www.egothor.org/performance.shtml
>
>
> I haven't, but it should be possible to build a system based on those
> engines.  For a system this size, the distributed architecture will be
> more important than the underlying index engine (though it sure helps
> to use an engine as optimized as lucene).
>
> -Mike
>

Re: Solr Capabilities/Limitations

Posted by Willie Wong <Wi...@METALOGIC-INC.COM>.

Thanks Mike for your quick response - they were very informative and 
useful.

I have one final question if you don't mind....  is it possible for a 
single Solr instance to switch between multiple indexes?  For example, can 
solr search in one index on one server partition then use another index 
located on another drive, without requiring a restart?  This differs 
slightly from the distributed search examples I've read in the 
documentation where you have another server running solr with the 
distributed index. 

Thanks,

Willie

Mike Klaas <mi...@gmail.com> 
01/07/2008 05:44 PM
Please respond to
solr-user@lucene.apache.org

To
solr-user@lucene.apache.org
cc

Subject
Re: Solr Capabilities/Limitations

On 1-Jul-08, at 8:37 AM, Willie Wong wrote:

> I need to be able to search through terabytes of existing data. 
> Documents
> may vary in size from 10 MB to 20 KB in size.  Also at some point I?ll
> also need to feed in approximately approximately 1-5 million new 
> documents
> a day.

This depends greatly on what kind of searching you want to do, and 
what are the desired response times.  I'm using Solr to full-text 
search about 10 TB of data at the moment.  Response times are around 
~1s including dynamic snippet generation.  The queries themselves are 
relatively complicated by lucene standards, including a custom word- 
proximity boosting query and link-analysis factors.

Of course, this is distributed over dozens of machines, and is a 
mostly static index.  There are about 10million docs per server.

> Has anyone used Solr to conduct searches over terabytes of data?  If 
> so,
> are there any configuration parameters I should pay particular 
> attention
> to such jvm size, mergeFactor etc?

JVM size will depend mostly on your sorting/faceting requirements. 
Just remember to leave gobs of memory for the OS disk cache.  Memory 
is key to serving large indices (consequently, things won't be fast 
until a decent amount of warming up is done).  mergeFactor?  You 
should only be searching optimized indices of this size, so it isn't 
terribly relevant.  The daily new docs should probably be added in 
their own index, which is then searched in parallel with the existing 
indices.

> Is there a limit to the number of shards Solr is capable of?  I don?t
> think there?s any way I can do this without some sort of distributed
> search.

Not really, though you will want to move to a 2-level hierarchy 
eventually.  I can't speak for the distributed search implementation 
in trunk (we built our own before this was available), but it should 
be exactly what you need.

> I?ve read that solr indexes can go into the millions if not billions 
> of
> documents? however at what point do the index size become 
> impractical ? I
> know this is a bit open ended, but I guess does Solr have a limit to 
> the
> number of documents that can be in a single index?

Depends on query composition and document size.  But for web docs, 
about 10m seems practical.

> Has anyone looked into any of these other search engines and are 
> there any
> other search engines that would be better suited such as Fast or 
> Automomy:
> http://mg4j.dsi.unimi.it/
> http://www.egothor.org/performance.shtml

I haven't, but it should be possible to build a system based on those 
engines.  For a system this size, the distributed architecture will be 
more important than the underlying index engine (though it sure helps 
to use an engine as optimized as lucene).

-Mike

Re: Solr Capabilities/Limitations

Posted by Mike Klaas <mi...@gmail.com>.

On 1-Jul-08, at 8:37 AM, Willie Wong wrote:

> I need to be able to search through terabytes of existing data.   
> Documents
> may vary in size from 10 MB to 20 KB in size.  Also at some point I’ll
> also need to feed in approximately approximately 1-5 million new  
> documents
> a day.

This depends greatly on what kind of searching you want to do, and  
what are the desired response times.  I'm using Solr to full-text  
search about 10 TB of data at the moment.  Response times are around  
~1s including dynamic snippet generation.  The queries themselves are  
relatively complicated by lucene standards, including a custom word- 
proximity boosting query and link-analysis factors.

Of course, this is distributed over dozens of machines, and is a  
mostly static index.  There are about 10million docs per server.

> Has anyone used Solr to conduct searches over terabytes of data?  If  
> so,
> are there any configuration parameters I should pay particular  
> attention
> to such jvm size, mergeFactor etc?

JVM size will depend mostly on your sorting/faceting requirements.   
Just remember to leave gobs of memory for the OS disk cache.  Memory  
is key to serving large indices (consequently, things won't be fast  
until a decent amount of warming up is done).  mergeFactor?  You  
should only be searching optimized indices of this size, so it isn't  
terribly relevant.  The daily new docs should probably be added in  
their own index, which is then searched in parallel with the existing  
indices.

> Is there a limit to the number of shards Solr is capable of?  I don’t
> think there’s any way I can do this without some sort of distributed
> search.

Not really, though you will want to move to a 2-level hierarchy  
eventually.  I can't speak for the distributed search implementation  
in trunk (we built our own before this was available), but it should  
be exactly what you need.

> I’ve read that solr indexes can go into the millions if not billions  
> of
> documents… however at what point do the index size become  
> impractical – I
> know this is a bit open ended, but I guess does Solr have a limit to  
> the
> number of documents that can be in a single index?

Depends on query composition and document size.  But for web docs,  
about 10m seems practical.

> Has anyone looked into any of these other search engines and are  
> there any
> other search engines that would be better suited such as Fast or  
> Automomy:
> http://mg4j.dsi.unimi.it/
> http://www.egothor.org/performance.shtml

I haven't, but it should be possible to build a system based on those  
engines.  For a system this size, the distributed architecture will be  
more important than the underlying index engine (though it sure helps  
to use an engine as optimized as lucene).

-Mike