You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Travis Low <tl...@4centurion.com> on 2011/10/11 14:36:25 UTC

capacity planning

Greetings.  I have a paltry 23,000 database records that point to a
voluminous 300GB worth of PDF, Word, Excel, and other documents.  We are
planning on indexing the records and the documents they point to.  I have no
clue on how we can calculate what kind of server we need for this.  I
imagine the index isn't going to be bigger than the documents (is it?) so I
suppose 1TB is a starting point for disk space.  But what kind of processing
power and memory might we need?  Can anyone please point me in the right
direction?

cheers,

Travis

-- 

**

*Travis Low, Director of Development*


** <tl...@4centurion.com>* *

*Centurion Research Solutions, LLC*

*14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*

*703-956-6276 *•* 703-378-4474 (fax)*

*http://www.centurionresearch.com* <http://www.centurionresearch.com>

**The information contained in this email message is confidential and
protected from disclosure.  If you are not the intended recipient, any use
or dissemination of this communication, including attachments, is strictly
prohibited.  If you received this email message in error, please delete it
and immediately notify the sender.

This email message and any attachments have been scanned and are believed to
be free of malicious software and defects that might affect any computer
system in which they are received and opened. No responsibility is accepted
by Centurion Research Solutions, LLC for any loss or damage arising from the
content of this email.

Re: capacity planning

Posted by Travis Low <tl...@4centurion.com>.
Thanks, Erik!  We probably won't use highlighting.  Also, documents are
added but *never* deleted.

Does anyone have comments about memory and CPU resources required for
indexing the 300GB of documents in a "reasonable" amount of time?  It's okay
if the initial indexing takes hours or maybe even days, but not too many
days.  Do we need 16GB of memory?  32GB?  8-core processor?  I have zero
sense of server requirements and I would appreciate any guidance.

Do I need to be concerned about performance/resources later, when adding
documents to an existing (large) index?

cheers,

Travis

On Tue, Oct 11, 2011 at 9:49 AM, Erik Hatcher <er...@gmail.com>wrote:

> Travis -
>
> Whether the index is bigger than the original content depends on what you
> need to do with it in Solr.  One of the primary deciding factors is if you
> need to use highlighting, which currently requires the fields to be
> highlighted be stored.  Stored fields will take up about the same space as
> the original documents (text-wise, likely a bit smaller than, say, the
> actual Word doc itself).  If you don't need highlighting or the contents
> stored for other purposes, then you'll have a dramatically smaller index
> than the original (roughly 35% the size, generally).
>
>        Erik
>
>
> On Oct 11, 2011, at 08:36 , Travis Low wrote:
>
> > Greetings.  I have a paltry 23,000 database records that point to a
> > voluminous 300GB worth of PDF, Word, Excel, and other documents.  We are
> > planning on indexing the records and the documents they point to.  I have
> no
> > clue on how we can calculate what kind of server we need for this.  I
> > imagine the index isn't going to be bigger than the documents (is it?) so
> I
> > suppose 1TB is a starting point for disk space.  But what kind of
> processing
> > power and memory might we need?  Can anyone please point me in the right
> > direction?
>
>


-- 

**

*Travis Low, Director of Development*


** <tl...@4centurion.com>* *

*Centurion Research Solutions, LLC*

*14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*

*703-956-6276 *•* 703-378-4474 (fax)*

*http://www.centurionresearch.com* <http://www.centurionresearch.com>

**The information contained in this email message is confidential and
protected from disclosure.  If you are not the intended recipient, any use
or dissemination of this communication, including attachments, is strictly
prohibited.  If you received this email message in error, please delete it
and immediately notify the sender.

This email message and any attachments have been scanned and are believed to
be free of malicious software and defects that might affect any computer
system in which they are received and opened. No responsibility is accepted
by Centurion Research Solutions, LLC for any loss or damage arising from the
content of this email.

Re: capacity planning

Posted by Paul Libbrecht <pa...@hoplahup.net>.
My experience was 10% of the size.

Le 11 oct. 2011 à 15:49, Erik Hatcher a écrit :

> (roughly 35% the size, generally).


Re: capacity planning

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

To add to what Erik wrote - keep in mind you can compress data before indexing/storing it in Solr so, assuming those PDFs are not compressed under the hood, even if you store your fields for highlighting or other purposes, the resulting index may be smaller than raw PDFs, if you compress the fields.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


>________________________________
>From: Erik Hatcher <er...@gmail.com>
>To: solr-user@lucene.apache.org
>Sent: Tuesday, October 11, 2011 9:49 AM
>Subject: Re: capacity planning
>
>Travis -
>
>Whether the index is bigger than the original content depends on what you need to do with it in Solr.  One of the primary deciding factors is if you need to use highlighting, which currently requires the fields to be highlighted be stored.  Stored fields will take up about the same space as the original documents (text-wise, likely a bit smaller than, say, the actual Word doc itself).  If you don't need highlighting or the contents stored for other purposes, then you'll have a dramatically smaller index than the original (roughly 35% the size, generally).
>
>    Erik
>
>
>On Oct 11, 2011, at 08:36 , Travis Low wrote:
>
>> Greetings.  I have a paltry 23,000 database records that point to a
>> voluminous 300GB worth of PDF, Word, Excel, and other documents.  We are
>> planning on indexing the records and the documents they point to.  I have no
>> clue on how we can calculate what kind of server we need for this.  I
>> imagine the index isn't going to be bigger than the documents (is it?) so I
>> suppose 1TB is a starting point for disk space.  But what kind of processing
>> power and memory might we need?  Can anyone please point me in the right
>> direction?
>
>
>
>

Re: capacity planning

Posted by Erik Hatcher <er...@gmail.com>.
Travis -

Whether the index is bigger than the original content depends on what you need to do with it in Solr.  One of the primary deciding factors is if you need to use highlighting, which currently requires the fields to be highlighted be stored.  Stored fields will take up about the same space as the original documents (text-wise, likely a bit smaller than, say, the actual Word doc itself).  If you don't need highlighting or the contents stored for other purposes, then you'll have a dramatically smaller index than the original (roughly 35% the size, generally).

	Erik


On Oct 11, 2011, at 08:36 , Travis Low wrote:

> Greetings.  I have a paltry 23,000 database records that point to a
> voluminous 300GB worth of PDF, Word, Excel, and other documents.  We are
> planning on indexing the records and the documents they point to.  I have no
> clue on how we can calculate what kind of server we need for this.  I
> imagine the index isn't going to be bigger than the documents (is it?) so I
> suppose 1TB is a starting point for disk space.  But what kind of processing
> power and memory might we need?  Can anyone please point me in the right
> direction?


Re: capacity planning

Posted by Shawn Heisey <so...@elyograg.org>.
On 10/11/2011 11:49 AM, Toke Eskildsen wrote:
> Inline or top-posting? Long discussion, but for mailing lists I 
> clearly prefer the former.

Ditto. ;)

> I have little experience with VM servers for search. Although we use a 
> lot of virtual machines, we use dedicated machines for our searchers, 
> primarily to ensure low latency for I/O. They might be fine for that 
> too, but we haven't tried it yet. Glad to be of help, Toke 

We've been running a production Solr installation for over a year (first 
1.4.1 and now 3.2) on virtual machines using Xen, CentOS 5 on CentOS 5.  
Each shard lives in a virtual machine.  We have a pair of virtual 
machines (on separate hardware) to act as search brokers.  Another pair 
of VMs acts as a heartbeat/haproxy load balancer.  Each physical machine 
hosts three of the six large shards that make up our index, and we have 
two copies of the index, requiring four physical hosts.

Now I'm doing a migration where we use the same physical hardware with 
multiple production cores rather than virtualization, upgrading Solr to 
3.4 and CentOS to 6.0 with ext4 at the same time.  The hosts that have 
been migrated have 32GB of RAM, the hosts that are still using Xen have 
64GB.  There is not enough RAM in either case for the index to fit 
fully.  Despite having less memory, the cores on the upgraded hosts are 
showing average query times 20-25% lower than the others.  Before, the 
hosts with less memory had higher average query times.  I expect that 
when I get the larger hosts migrated, query times will drop yet again.

My opinion: Virtualization can be very effective, but you'll get better 
results without it.  It requires a more complex build system, because 
you can't assume every machine has cores with the same names.  I also 
had to change Jetty's port number because when your load balancer is 
running on the same OS, you can't bind to the same port.

Thanks,
Shawn


RE: capacity planning

Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.
We have used a VMWare VM for our index for testing for our index (currently around 3GB) and it has been just fine - at most maybe a 10 to 20% penalty, if that, even when CPU bound.  We also plan to use a VM for production.

What hypervisor one uses matters - sometimes a lot.

-----Original Message-----
From: eksdev@googlemail.com [mailto:eksdev@googlemail.com] On Behalf Of eks dev
Sent: Tuesday, October 11, 2011 1:20 PM
To: solr-user@lucene.apache.org
Subject: Re: capacity planning

Re. "I have little experience with VM servers for search."

We had huge performance penalty on VMs,  CPU was bottleneck.
We couldn't freely run measurements to figure out what the problem really
was (hosting was contracted by customer...), but it was something pretty
scary, kind of 8-10 times slower than advertised dedicated equivalent.
Whatever its worth, if you can afford it, keep lucene away from it. Lucene
is highly optimized machine, and someone twiddling with context switches is
not welcome there.

Of course, if you get IO bound, it makes no big diff anyhow.

This is just my singular experience, might be the hosting team did not
configure it right, or something changed in meantime (~ 4 Years old
experience),  but we burnt our fingers that hard I still remember it




On Tue, Oct 11, 2011 at 7:49 PM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

> Travis Low [tlow@4centurion.com] wrote:
> > Toke, thanks.  Comments embedded (hope that's okay):
>
> Inline or top-posting? Long discussion, but for mailing lists I clearly
> prefer the former.
>
> [Toke: Estimate characters]
>
> > Yes.  We estimate each of the 23K DB records has 600 pages of text for
> the
> > combined documents, 300 words per page, 5 characters per word.  Which
> > coincidentally works out to about 21GB, so good guessing there. :)
>
> Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does
> not sound scary at all.
>
> > The way it works is we have researchers modifying the DB records during
> the
> > day, and they may upload documents at that time.  We estimate 50-60
> uploads
> > throughout the day.  If possible, we'd like to index them as they are
> > uploaded, but if that would negatively affect the search, then we can
> > rebuild the index nightly.
> >
> > Which is better?
>
> The analyzing part is only CPU and you're running multi-core so as long as
> you only analyze using one thread you're safe there. That leaves us with
> I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of
> extracted text each shouldn't have any real effect - with the usual caveat
> that large merges should be avoided by either optimizing at night or
> tweaking merge policy to avoid large segments. With such a relatively small
> index, (re)opening and warm up should be painless too.
>
> Summary: 300GB is a fair amount of data and takes some power to crunch.
> However, in the Solr/Lucene end your index size and your update rates are
> nothing to worry about. Usual caveat for advanced use and all that applies.
>
> [Toke: i7, 8GB, 1TB spinning, 256GB SSD]
>
> > We have a very beefy VM server that we will use for benchmarking, but
> your
> > specs provide a starting point.  Thanks very much for that.
>
> I have little experience with VM servers for search. Although we use a lot
> of virtual machines, we use dedicated machines for our searchers, primarily
> to ensure low latency for I/O. They might be fine for that too, but we
> haven't tried it yet.
>
> Glad to be of help,
> Toke

Re: capacity planning

Posted by Travis Low <tl...@4centurion.com>.
Our plan for the VM is just benchmarking, not production.  We will turn off
all guest machines, then configure a Solr VM.  Then we'll tweak memory and
see what effect it has on indexing and searching.  Then we'll reconfigure
the number of processors used and see what that does.  Then again with more
disk space.  And so on.  We'll try to start with a reasonable configuration
and then make intelligent guesses for our changes so we don't spend a year
on this.

What we are trying to avoid is configuring a brand new box at the hoster,
only to find we need a bigger and better box.  Or, paying too much for
something we don't need.

Thanks everyone for your input, it was very helpful.

cheers,
Travis

On Tue, Oct 11, 2011 at 2:19 PM, eks dev <ek...@yahoo.co.uk> wrote:

> Re. "I have little experience with VM servers for search."
>
> We had huge performance penalty on VMs,  CPU was bottleneck.
> We couldn't freely run measurements to figure out what the problem really
> was (hosting was contracted by customer...), but it was something pretty
> scary, kind of 8-10 times slower than advertised dedicated equivalent.
> Whatever its worth, if you can afford it, keep lucene away from it. Lucene
> is highly optimized machine, and someone twiddling with context switches is
> not welcome there.
>
> Of course, if you get IO bound, it makes no big diff anyhow.
>
> This is just my singular experience, might be the hosting team did not
> configure it right, or something changed in meantime (~ 4 Years old
> experience),  but we burnt our fingers that hard I still remember it
>
>
>
>
> On Tue, Oct 11, 2011 at 7:49 PM, Toke Eskildsen <te@statsbiblioteket.dk
> >wrote:
>
> > Travis Low [tlow@4centurion.com] wrote:
> > > Toke, thanks.  Comments embedded (hope that's okay):
> >
> > Inline or top-posting? Long discussion, but for mailing lists I clearly
> > prefer the former.
> >
> > [Toke: Estimate characters]
> >
> > > Yes.  We estimate each of the 23K DB records has 600 pages of text for
> > the
> > > combined documents, 300 words per page, 5 characters per word.  Which
> > > coincidentally works out to about 21GB, so good guessing there. :)
> >
> > Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does
> > not sound scary at all.
> >
> > > The way it works is we have researchers modifying the DB records during
> > the
> > > day, and they may upload documents at that time.  We estimate 50-60
> > uploads
> > > throughout the day.  If possible, we'd like to index them as they are
> > > uploaded, but if that would negatively affect the search, then we can
> > > rebuild the index nightly.
> > >
> > > Which is better?
> >
> > The analyzing part is only CPU and you're running multi-core so as long
> as
> > you only analyze using one thread you're safe there. That leaves us with
> > I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of
> > extracted text each shouldn't have any real effect - with the usual
> caveat
> > that large merges should be avoided by either optimizing at night or
> > tweaking merge policy to avoid large segments. With such a relatively
> small
> > index, (re)opening and warm up should be painless too.
> >
> > Summary: 300GB is a fair amount of data and takes some power to crunch.
> > However, in the Solr/Lucene end your index size and your update rates are
> > nothing to worry about. Usual caveat for advanced use and all that
> applies.
> >
> > [Toke: i7, 8GB, 1TB spinning, 256GB SSD]
> >
> > > We have a very beefy VM server that we will use for benchmarking, but
> > your
> > > specs provide a starting point.  Thanks very much for that.
> >
> > I have little experience with VM servers for search. Although we use a
> lot
> > of virtual machines, we use dedicated machines for our searchers,
> primarily
> > to ensure low latency for I/O. They might be fine for that too, but we
> > haven't tried it yet.
> >
> > Glad to be of help,
> > Toke
>



-- 

**

*Travis Low, Director of Development*


** <tl...@4centurion.com>* *

*Centurion Research Solutions, LLC*

*14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*

*703-956-6276 *•* 703-378-4474 (fax)*

*http://www.centurionresearch.com* <http://www.centurionresearch.com>

**The information contained in this email message is confidential and
protected from disclosure.  If you are not the intended recipient, any use
or dissemination of this communication, including attachments, is strictly
prohibited.  If you received this email message in error, please delete it
and immediately notify the sender.

This email message and any attachments have been scanned and are believed to
be free of malicious software and defects that might affect any computer
system in which they are received and opened. No responsibility is accepted
by Centurion Research Solutions, LLC for any loss or damage arising from the
content of this email.

Re: capacity planning

Posted by eks dev <ek...@yahoo.co.uk>.
Re. "I have little experience with VM servers for search."

We had huge performance penalty on VMs,  CPU was bottleneck.
We couldn't freely run measurements to figure out what the problem really
was (hosting was contracted by customer...), but it was something pretty
scary, kind of 8-10 times slower than advertised dedicated equivalent.
Whatever its worth, if you can afford it, keep lucene away from it. Lucene
is highly optimized machine, and someone twiddling with context switches is
not welcome there.

Of course, if you get IO bound, it makes no big diff anyhow.

This is just my singular experience, might be the hosting team did not
configure it right, or something changed in meantime (~ 4 Years old
experience),  but we burnt our fingers that hard I still remember it




On Tue, Oct 11, 2011 at 7:49 PM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

> Travis Low [tlow@4centurion.com] wrote:
> > Toke, thanks.  Comments embedded (hope that's okay):
>
> Inline or top-posting? Long discussion, but for mailing lists I clearly
> prefer the former.
>
> [Toke: Estimate characters]
>
> > Yes.  We estimate each of the 23K DB records has 600 pages of text for
> the
> > combined documents, 300 words per page, 5 characters per word.  Which
> > coincidentally works out to about 21GB, so good guessing there. :)
>
> Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does
> not sound scary at all.
>
> > The way it works is we have researchers modifying the DB records during
> the
> > day, and they may upload documents at that time.  We estimate 50-60
> uploads
> > throughout the day.  If possible, we'd like to index them as they are
> > uploaded, but if that would negatively affect the search, then we can
> > rebuild the index nightly.
> >
> > Which is better?
>
> The analyzing part is only CPU and you're running multi-core so as long as
> you only analyze using one thread you're safe there. That leaves us with
> I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of
> extracted text each shouldn't have any real effect - with the usual caveat
> that large merges should be avoided by either optimizing at night or
> tweaking merge policy to avoid large segments. With such a relatively small
> index, (re)opening and warm up should be painless too.
>
> Summary: 300GB is a fair amount of data and takes some power to crunch.
> However, in the Solr/Lucene end your index size and your update rates are
> nothing to worry about. Usual caveat for advanced use and all that applies.
>
> [Toke: i7, 8GB, 1TB spinning, 256GB SSD]
>
> > We have a very beefy VM server that we will use for benchmarking, but
> your
> > specs provide a starting point.  Thanks very much for that.
>
> I have little experience with VM servers for search. Although we use a lot
> of virtual machines, we use dedicated machines for our searchers, primarily
> to ensure low latency for I/O. They might be fine for that too, but we
> haven't tried it yet.
>
> Glad to be of help,
> Toke

Re: capacity planning

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Travis Low [tlow@4centurion.com] wrote:
> Toke, thanks.  Comments embedded (hope that's okay):

Inline or top-posting? Long discussion, but for mailing lists I clearly prefer the former.

[Toke: Estimate characters]

> Yes.  We estimate each of the 23K DB records has 600 pages of text for the
> combined documents, 300 words per page, 5 characters per word.  Which
> coincidentally works out to about 21GB, so good guessing there. :)

Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does not sound scary at all.

> The way it works is we have researchers modifying the DB records during the
> day, and they may upload documents at that time.  We estimate 50-60 uploads
> throughout the day.  If possible, we'd like to index them as they are
> uploaded, but if that would negatively affect the search, then we can
> rebuild the index nightly.
>
> Which is better?

The analyzing part is only CPU and you're running multi-core so as long as you only analyze using one thread you're safe there. That leaves us with I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of extracted text each shouldn't have any real effect - with the usual caveat that large merges should be avoided by either optimizing at night or tweaking merge policy to avoid large segments. With such a relatively small index, (re)opening and warm up should be painless too.

Summary: 300GB is a fair amount of data and takes some power to crunch. However, in the Solr/Lucene end your index size and your update rates are nothing to worry about. Usual caveat for advanced use and all that applies.

[Toke: i7, 8GB, 1TB spinning, 256GB SSD]

> We have a very beefy VM server that we will use for benchmarking, but your
> specs provide a starting point.  Thanks very much for that.

I have little experience with VM servers for search. Although we use a lot of virtual machines, we use dedicated machines for our searchers, primarily to ensure low latency for I/O. They might be fine for that too, but we haven't tried it yet.

Glad to be of help,
Toke

Re: capacity planning

Posted by Travis Low <tl...@4centurion.com>.
Toke, thanks.  Comments embedded (hope that's okay):

On Tue, Oct 11, 2011 at 10:52 AM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

> > Greetings.  I have a paltry 23,000 database records that point to a
> > voluminous 300GB worth of PDF, Word, Excel, and other documents.  We are
> > planning on indexing the records and the documents they point to.  I have
> no
> > clue on how we can calculate what kind of server we need for this.  I
> > imagine the index isn't going to be bigger than the documents (is it?)
>
> Sanity check: Let's say your average document is 200 pages with 1000
> words of 5 characters each. That gives you 200 * 1000 * 5 * 23,000 ~=
> 21GB of raw text, which is a far cry from the 300GB.
>
> Either your documents are extremely text heavy or they contain
> illustrations and other elements that are not to be indexed. Is it
> possible for you to estimate the number of characters in your corpus?
>

Yes.  We estimate each of the 23K DB records has 600 pages of text for the
combined documents, 300 words per page, 5 characters per word.  Which
coincidentally works out to about 21GB, so good guessing there. :)

>  But what kind of processing power and memory might we need?

 I am not well-versed in Tika and other PDF/Word/etc analyzing
> frameworks, so I'll just focus on the search part here. Guessing wildly,
> you're aiming for a low number of running updates or even just a nightly
> batch update. Response times should be below 200 ms and the number of
> concurrent searches is 2 to 4 at most.
>

The way it works is we have researchers modifying the DB records during the
day, and they may upload documents at that time.  We estimate 50-60 uploads
throughout the day.  If possible, we'd like to index them as they are
uploaded, but if that would negatively affect the search, then we can
rebuild the index nightly.

Which is better?


> Bold claim: Assuming that your corpus is more 20GB of raw text than
> 300GB, you'll get by just fine with an i7 machine with 8GB of RAM, a 1TB
> 7200 RPM drive for storage and a 256GB consumer SSD for search. That is
> more or less what we use for our 10M documents/60GB+ index, with a load
> as I described above.
>
> I've always been wary of having to dictate hardware up front for such
> projects. It is a lot easier and cheaper to just build the software,
> then measure and buy hardware after that.
>

We have a very beefy VM server that we will use for benchmarking, but your
specs provide a starting point.  Thanks very much for that.

cheers,

Travis

Re: capacity planning

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Tue, 2011-10-11 at 14:36 +0200, Travis Low wrote:
> Greetings.  I have a paltry 23,000 database records that point to a
> voluminous 300GB worth of PDF, Word, Excel, and other documents.  We are
> planning on indexing the records and the documents they point to.  I have no
> clue on how we can calculate what kind of server we need for this.  I
> imagine the index isn't going to be bigger than the documents (is it?)

Sanity check: Let's say your average document is 200 pages with 1000
words of 5 characters each. That gives you 200 * 1000 * 5 * 23,000 ~=
21GB of raw text, which is a far cry from the 300GB.

Either your documents are extremely text heavy or they contain
illustrations and other elements that are not to be indexed. Is it
possible for you to estimate the number of characters in your corpus?

>  But what kind of processing power and memory might we need?

I am not well-versed in Tika and other PDF/Word/etc analyzing
frameworks, so I'll just focus on the search part here. Guessing wildly,
you're aiming for a low number of running updates or even just a nightly
batch update. Response times should be below 200 ms and the number of
concurrent searches is 2 to 4 at most.

Bold claim: Assuming that your corpus is more 20GB of raw text than
300GB, you'll get by just fine with an i7 machine with 8GB of RAM, a 1TB
7200 RPM drive for storage and a 256GB consumer SSD for search. That is
more or less what we use for our 10M documents/60GB+ index, with a load
as I described above.

I've always been wary of having to dictate hardware up front for such
projects. It is a lot easier and cheaper to just build the software,
then measure and buy hardware after that.

Regards,
Toke Eskildsen