You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by souravm <SO...@infosys.com> on 2008/12/07 06:40:34 UTC

Limitations of Distributed Search ....

Hi,

We are planning to use Solr for processing large volume of application log files (around ~ 10 Billions documents of size 5-6 TB).

One of the approach we are considering for the same is to use Distributed Search extensively. 

What we have in mind is distributing the log files in multiple boxes monthly or weekly basis - where at the weekly basis itself the volume can go to the level of 200 M of documents. And a search query can spread across all weeks (e.g. number of a given txn for 1st 6 months of a year)

However, what we are not sure how well the distributed search would scale when we may use around 50-60 boxes to distribute indexed documents on weekly basis. The specific questions I have in mind are -

a) How would be the impact on the performance when a query spreads over 50 boxes
b) Is there any hard limit on the number of slaves which can be contacted from the master server?
c) How much load will this type of approach create on master server for merging data, keeping the track whether a slave is down or not
d) Any other manageability issues with so many slaves

If anyone of you have deployed Solr in such a environment it would be great if you can share your experience on the same.

Thanks in advance.

Regards,
Sourav



**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Limitations of Distributed Search ....

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

I have not worked with a 50 node Solr cluster, but I've worked with pure Lucene clusters of that size, very high query and data volumes.  I don't imagine a dist search involving 50 nodes will be a problem for Solr.  As for handling query slave failures, I'm sure you'll want to involve a LB that can detect those, and have multiple replicas of each query node behind it for fail-over.

As for the manageability, I think you'll find that management is really mostly on you - Solr doesn't provide tools for cluster / shard management.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: souravm <SO...@infosys.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Sunday, December 7, 2008 12:40:34 AM
> Subject: Limitations of Distributed Search ....
> 
> Hi,
> 
> We are planning to use Solr for processing large volume of application log files 
> (around ~ 10 Billions documents of size 5-6 TB).
> 
> One of the approach we are considering for the same is to use Distributed Search 
> extensively. 
> 
> What we have in mind is distributing the log files in multiple boxes monthly or 
> weekly basis - where at the weekly basis itself the volume can go to the level 
> of 200 M of documents. And a search query can spread across all weeks (e.g. 
> number of a given txn for 1st 6 months of a year)
> 
> However, what we are not sure how well the distributed search would scale when 
> we may use around 50-60 boxes to distribute indexed documents on weekly basis. 
> The specific questions I have in mind are -
> 
> a) How would be the impact on the performance when a query spreads over 50 boxes
> b) Is there any hard limit on the number of slaves which can be contacted from 
> the master server?
> c) How much load will this type of approach create on master server for merging 
> data, keeping the track whether a slave is down or not
> d) Any other manageability issues with so many slaves
> 
> If anyone of you have deployed Solr in such a environment it would be great if 
> you can share your experience on the same.
> 
> Thanks in advance.
> 
> Regards,
> Sourav
> 
> 
> 
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
> for the use of the addressee(s). If you are not the intended recipient, please 
> notify the sender by e-mail and delete the original message. Further, you are 
> not 
> to copy, disclose, or distribute this e-mail or its contents to any other person 
> and 
> any such actions are unlawful. This e-mail may contain viruses. Infosys has 
> taken 
> every reasonable precaution to minimize this risk, but is not liable for any 
> damage 
> you may sustain as a result of any virus in this e-mail. You should carry out 
> your 
> own virus checks before opening the e-mail or attachment. Infosys reserves the 
> right to monitor and review the content of all messages sent to or from this 
> e-mail 
> address. Messages sent to or from this e-mail address may be stored on the 
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***