You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rahul Warawdekar <ra...@gmail.com> on 2011/11/21 21:05:31 UTC

Re: Architecture and Capacity planning for large Solr index

Thanks !

My business requirements have changed a bit.
We need one year rolling data in Production.
The index size for the same comes to approximately 200 - 220 GB.
I am planning to address this using Solr distributed search as follows.

1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves
(load balanced)
2. Master configuration
 will be 4 CPU


On Tue, Oct 11, 2011 at 2:05 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hi Rahul,
>
> This is unfortunately not enough information for anyone to give you very
> precise answers, so I'll just give some rough ones:
>
> * best disk - SSD :)
> * CPU - multicore, depends on query complexity, concurrency, etc.
> * sharded search and failover - start with SolrCloud, there are a couple
> of pages about it on the Wiki and
> http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
> >________________________________
> >From: Rahul Warawdekar <ra...@gmail.com>
> >To: solr-user <so...@lucene.apache.org>
> >Sent: Tuesday, October 11, 2011 11:47 AM
> >Subject: Architecture and Capacity planning for large Solr index
> >
> >Hi All,
> >
> >I am working on a Solr search based project, and would highly appreciate
> >help/suggestions from you all regarding Solr architecture and capacity
> >planning.
> >Details of the project are as follows
> >
> >1. There are 2 databases from which, data needs to be indexed and made
> >searchable,
> >                - Production
> >                - Archive
> >2. Production database will retain 6 months old data and archive data
> every
> >month.
> >3. Archive database will retain 3 years old data.
> >4. Database is SQL Server 2008 and Solr version is 3.1
> >
> >Data to be indexed contains a huge volume of attachments (PDF, Word, excel
> >etc..), approximately 200 GB per month.
> >We are planning to do a full index every month (multithreaded) and
> >incremental indexing on a daily basis.
> >The Solr index size is coming to approximately 25 GB per month.
> >
> >If we were to use distributed search, what would be the best configuration
> >for Production as well as Archive indexes ?
> >What would be the best CPU/RAM/Disk configuration ?
> >How can I implement failover mechanism for sharded searches ?
> >
> >Please let me know in case I need to share more information.
> >
> >
> >--
> >Thanks and Regards
> >Rahul A. Warawdekar
> >
> >
> >
>



-- 
Thanks and Regards
Rahul A. Warawdekar

Re: Architecture and Capacity planning for large Solr index

Posted by Erick Erickson <er...@gmail.com>.
Whether three shards will give you adequate throughput is not an
answerable question. Here's what I suggest. Get a single box
of the size you expect your servers to be and index 1/3 of your
documents on it. Run stress tests. That's really the only way to
be fairly sure your hardware is adequate.

As far as SANs are concerned, local storage is almost always
better. I'd advise against trying to share the index amongst
slaves, SAN or not. And using the SAN for each slave's copy
seems unnecessary with storage as cheap as it is, what
advantage do you see in this scenario?

Best
Erick

On Mon, Nov 21, 2011 at 3:18 PM, Rahul Warawdekar
<ra...@gmail.com> wrote:
> Thanks Otis !
> Please ignore my earlier email which does not have all the information.
>
> My business requirements have changed a bit.
> We now need one year rolling data in Production, with the following details
>    - Number of records -> 1.2 million
>    - Solr index size for these records comes to approximately 200 - 220
> GB. (includes large attachments)
>    - Approx 250 users who will be searching the applicaiton with a peak of
> 1 search request every 40 seconds.
>
> I am planning to address this using Solr distributed search on a VMWare
> virtualized environment as follows.
>
> 1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves
> (load balanced)
>
> 2. Master configuration for each server is as follows
>    - 4 CPUs
>    - 16 GB RAM
>    - 300 GB disk space
>
> 3. Slave configuration for each server is as follows
>    - 4 CPUs
>    - 16 GB RAM
>    - 150 GB disk space
>
> 4. I am planning to use SAN instead of local storage to store Solr index.
>
> And my questions are as follows:
> Will 3 shards serve the purpose here ?
> Is SAN a a good option for storing solr index, given the high index volume ?
>
>
>
>
> On Mon, Nov 21, 2011 at 3:05 PM, Rahul Warawdekar <
> rahul.warawdekar@gmail.com> wrote:
>
>> Thanks !
>>
>> My business requirements have changed a bit.
>> We need one year rolling data in Production.
>> The index size for the same comes to approximately 200 - 220 GB.
>> I am planning to address this using Solr distributed search as follows.
>>
>> 1. Whole index to be split up between 3 shards, with 3 masters and 6
>> slaves (load balanced)
>> 2. Master configuration
>>  will be 4 CPU
>>
>>
>>
>> On Tue, Oct 11, 2011 at 2:05 PM, Otis Gospodnetic <
>> otis_gospodnetic@yahoo.com> wrote:
>>
>>> Hi Rahul,
>>>
>>> This is unfortunately not enough information for anyone to give you very
>>> precise answers, so I'll just give some rough ones:
>>>
>>> * best disk - SSD :)
>>> * CPU - multicore, depends on query complexity, concurrency, etc.
>>> * sharded search and failover - start with SolrCloud, there are a couple
>>> of pages about it on the Wiki and
>>> http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/
>>>
>>> Otis
>>> ----
>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>>> Lucene ecosystem search :: http://search-lucene.com/
>>>
>>>
>>> >________________________________
>>> >From: Rahul Warawdekar <ra...@gmail.com>
>>> >To: solr-user <so...@lucene.apache.org>
>>> >Sent: Tuesday, October 11, 2011 11:47 AM
>>> >Subject: Architecture and Capacity planning for large Solr index
>>> >
>>> >Hi All,
>>> >
>>> >I am working on a Solr search based project, and would highly appreciate
>>> >help/suggestions from you all regarding Solr architecture and capacity
>>> >planning.
>>> >Details of the project are as follows
>>> >
>>> >1. There are 2 databases from which, data needs to be indexed and made
>>> >searchable,
>>> >                - Production
>>> >                - Archive
>>> >2. Production database will retain 6 months old data and archive data
>>> every
>>> >month.
>>> >3. Archive database will retain 3 years old data.
>>> >4. Database is SQL Server 2008 and Solr version is 3.1
>>> >
>>> >Data to be indexed contains a huge volume of attachments (PDF, Word,
>>> excel
>>> >etc..), approximately 200 GB per month.
>>> >We are planning to do a full index every month (multithreaded) and
>>> >incremental indexing on a daily basis.
>>> >The Solr index size is coming to approximately 25 GB per month.
>>> >
>>> >If we were to use distributed search, what would be the best
>>> configuration
>>> >for Production as well as Archive indexes ?
>>> >What would be the best CPU/RAM/Disk configuration ?
>>> >How can I implement failover mechanism for sharded searches ?
>>> >
>>> >Please let me know in case I need to share more information.
>>> >
>>> >
>>> >--
>>> >Thanks and Regards
>>> >Rahul A. Warawdekar
>>> >
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Thanks and Regards
>> Rahul A. Warawdekar
>>
>>
>
>
> --
> Thanks and Regards
> Rahul A. Warawdekar
>

Re: Architecture and Capacity planning for large Solr index

Posted by Rahul Warawdekar <ra...@gmail.com>.
Thanks Otis !
Please ignore my earlier email which does not have all the information.

My business requirements have changed a bit.
We now need one year rolling data in Production, with the following details
    - Number of records -> 1.2 million
    - Solr index size for these records comes to approximately 200 - 220
GB. (includes large attachments)
    - Approx 250 users who will be searching the applicaiton with a peak of
1 search request every 40 seconds.

I am planning to address this using Solr distributed search on a VMWare
virtualized environment as follows.

1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves
(load balanced)

2. Master configuration for each server is as follows
    - 4 CPUs
    - 16 GB RAM
    - 300 GB disk space

3. Slave configuration for each server is as follows
    - 4 CPUs
    - 16 GB RAM
    - 150 GB disk space

4. I am planning to use SAN instead of local storage to store Solr index.

And my questions are as follows:
Will 3 shards serve the purpose here ?
Is SAN a a good option for storing solr index, given the high index volume ?




On Mon, Nov 21, 2011 at 3:05 PM, Rahul Warawdekar <
rahul.warawdekar@gmail.com> wrote:

> Thanks !
>
> My business requirements have changed a bit.
> We need one year rolling data in Production.
> The index size for the same comes to approximately 200 - 220 GB.
> I am planning to address this using Solr distributed search as follows.
>
> 1. Whole index to be split up between 3 shards, with 3 masters and 6
> slaves (load balanced)
> 2. Master configuration
>  will be 4 CPU
>
>
>
> On Tue, Oct 11, 2011 at 2:05 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>
>> Hi Rahul,
>>
>> This is unfortunately not enough information for anyone to give you very
>> precise answers, so I'll just give some rough ones:
>>
>> * best disk - SSD :)
>> * CPU - multicore, depends on query complexity, concurrency, etc.
>> * sharded search and failover - start with SolrCloud, there are a couple
>> of pages about it on the Wiki and
>> http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>> >________________________________
>> >From: Rahul Warawdekar <ra...@gmail.com>
>> >To: solr-user <so...@lucene.apache.org>
>> >Sent: Tuesday, October 11, 2011 11:47 AM
>> >Subject: Architecture and Capacity planning for large Solr index
>> >
>> >Hi All,
>> >
>> >I am working on a Solr search based project, and would highly appreciate
>> >help/suggestions from you all regarding Solr architecture and capacity
>> >planning.
>> >Details of the project are as follows
>> >
>> >1. There are 2 databases from which, data needs to be indexed and made
>> >searchable,
>> >                - Production
>> >                - Archive
>> >2. Production database will retain 6 months old data and archive data
>> every
>> >month.
>> >3. Archive database will retain 3 years old data.
>> >4. Database is SQL Server 2008 and Solr version is 3.1
>> >
>> >Data to be indexed contains a huge volume of attachments (PDF, Word,
>> excel
>> >etc..), approximately 200 GB per month.
>> >We are planning to do a full index every month (multithreaded) and
>> >incremental indexing on a daily basis.
>> >The Solr index size is coming to approximately 25 GB per month.
>> >
>> >If we were to use distributed search, what would be the best
>> configuration
>> >for Production as well as Archive indexes ?
>> >What would be the best CPU/RAM/Disk configuration ?
>> >How can I implement failover mechanism for sharded searches ?
>> >
>> >Please let me know in case I need to share more information.
>> >
>> >
>> >--
>> >Thanks and Regards
>> >Rahul A. Warawdekar
>> >
>> >
>> >
>>
>
>
>
> --
> Thanks and Regards
> Rahul A. Warawdekar
>
>


-- 
Thanks and Regards
Rahul A. Warawdekar