You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Marcus Herou <ma...@tailsweep.com> on 2009/06/27 00:00:40 UTC

Scaling out/up or a mix

Hi.

I currently have an index which is 16GB per machine (8 machines = 128GB)
(data is stored externally, not in index) and is growing like crazy (we are
indexing blogs which is crazy by nature) and have only allocated 2GB per
machine to the Lucene app since we are running some other stuff there in
parallell.

Each doc should be roughly the size of a blog post, no more than 20k.

We currently have about 90M documents and it is increasing rapidly so
getting into the G+ document range is not going to be too far away.

Now due to search performance I think I need to move these instances to
dedicated index/search machines (or index on some machines and search on
others). Anyway I would like to get some feedback about two things:

1. What is the most important hardware aspect when it comes to add document
to the index and optimize it.
1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
1.2 Is it RAM ?
1.3 Is is CPU ?

My guess would be disk-io, right, wrong ?

2. What is the most important hardware aspect when it comes to searching
documents in my setup ? (result-set is limited to return only the top 10
matches with page handling)
2.1 Is it disk read throughput ? (sequential or random-io ?)
2.2 Is it RAM ?
2.3 Is is CPU ?

I have no clue since the data might not fit into memory. What is then the
most important factor ? read-performance while scanning the index ? CPU
while comparing fields and collecting results ?

What I'm trying to find out is what I can do to get most bang for the buck
with a limited (aren't we all limited?) budget.

Kindly

//Marcus





-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Sat, 2009-06-27 at 00:00 +0200, Marcus Herou wrote:
> We currently have about 90M documents and it is increasing rapidly so
> getting into the G+ document range is not going to be too far away.

We've performed fairly extensive tests regarding hardware for searches
and some minor tests on hardware for indexing. The tests were primarily
with regard to cores, RAM and storage (so no focus on CPU-speed or
bus-speed). Our "standard" index was 37GB with 9 million documents,
although we did try our hands with running 40 million documents on a
single machine. 

You might want to take a look at some unordered notes and graphs from
our tests: http://wiki.statsbiblioteket.dk/summa/Hardware

> 2. What is the most important hardware aspect when it comes to searching
> documents in my setup ? (result-set is limited to return only the top 10
> matches with page handling)
> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> 2.2 Is it RAM ?
> 2.3 Is is CPU ?

For searches, random access is king, so go for Solid State Drives. 
As there is a lot of crap our there, be sure to read some reviews. 
The Intel X25 seems like a safe bet right now.

While not quite on par with holding the full index in RAM, SSDs comes
quite close (744 searches/second vs. 951 searches/second in one of our
tests with a standard RAMDirectory). The same test for 2 * 15.000 RPM
conventional harddisks in RAID 1 gave us ~200 searches/second. This is
of course highly dependent of the index.

As opposed to conventional harddisks, SSDs aren't nearly as reliant on
RAM for caching. On the other hand, SSDs are capable of serving larger
indexes than conventional harddisks and as such, more RAM will be needed
for the JVM with the Lucene searcher.

Our pick for the 50 million documents, 150-200GB of indexes per machine
range was 4 core Intel Xeons, 16GB RAM, 4*64GB SSDs for the index
(RAID0ing them does not change the speed significantly, we just do it to
get a single volume) and conventional harddisks for storage.

Just as Eric Bowman discovered, processing power easily becomes the
bottleneck when switching for SSDs. This happened for us too and
triggered a great deal of profiling (VisualVM is free, very_ easy to use
and helps tremendously with this) to pinpoint where the CPUs used their
energy.

Regards,
Toke Eskildsen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scaling out/up or a mix

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi thanks for your answer, comments inline.

On Mon, Jun 29, 2009 at 10:06 AM, eks dev <ek...@yahoo.co.uk> wrote:

>
> depends on your architecture, will you partition your index? What is max
> expected size of your index (you said 128G and growing..) what do you mean
> with growing? You have in both options enogh memory to load it into RAM...

Yes we partition the index, with a simple RoundRobin algo.
The options was just to give the reader some visibility in what kind of
hardware you get depending on which path you choose. I do not really have
that amount of money to spend right now. More like 1/6th of that really.
We crawl blogs... The number of blogs we find is still increasing and we are
not nearly indexing all languages => It will grow at least linear. Let's say
at least 10-20G a month or so ?

>
>
> I would definitly try to have less machines and alot of memory, so that
> your index fits into ram comfortably...

OK, so you mean that one should aim for fitting the shard into RAM...

>
>
> IMO, 8Gig per machine is rather smalish, but depends heavily on your access
> patterns... how many documents you need to load from disk per query? If this
> does not create huge on IO, you could try to load everything but stored
> fields into RAM

We store no fields in the index besides the actual DB id. We load no more
than 50 docs at a time.

>
>
> What are your requirements on Indexing side (once a day, week, 15 Minutes),
> how you distribute index to all these machines...

We index all non office hours.

>
>
> Your question: IO or CPU bound, depends, if you load it into RAM it becomes
> Memeory-bus/CPU bound, if it is mainly on disk it will be IO bound

OK like I suspected, answers my previous question(s).


Final question:

Based on your findings what is the most challenging part to tune ? Sorting
or querying or what else?

//Marcus



>
>
>
>
>
>
> ----- Original Message ----
> > From: Marcus Herou <ma...@tailsweep.com>
> > To: java-user@lucene.apache.org
> > Sent: Monday, 29 June, 2009 9:47:13
> > Subject: Re: Scaling out/up or a mix
> >
> > Thanks for the answer.
> >
> > Don't you think that part 1 of the email would give you a hint of nature
> of
> > the index ?
> >
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> > publishedDate... = Very simple, no fuzzy searches etc. However since the
> > dataset is large it will consume memory on sorting I guess.
> >
> > Could not one draw any conclusions about best-practice in terms of
> hardware
> > given the above "specs" ?
> >
> > Basically I would like to know if I really need 8 cores since machines
> with
> > dual-cpu support are the most expensive and I would like to not throw
> away
> > money so getting it right is a matter of economy.
> >
> > I mean it is very simple: Let's say someone gives me a budget of 50 000
> USD
> > and I then want to get the most bang for the buck for my workload.
> > Should I go for
> > X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me
> 1200USD
> > a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM)
> > or
> > X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing
> me
> > 3400 USD a piece (giving me 15 machines:  60 disks, 120 cores,  540G RAM)
> >
> > Basically I would like to know what factors make the workload IO bound vs
> > CPU bound ?
> >
> > //Marcus
> >
> >
> >
> >
> >
> >
> > On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman wrote:
> >
> > > There is no single answer -- this is always application specific.
> > >
> > > Without knowing anything about what you are doing:
> > >
> > > 1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
> > > you can, if performance is absolutely critical
> > > 2. Sometimes CPU can become an issue, but 8 cores is probably enough
> > > unless you are doing especially cpu-bound searches.
> > >
> > > Unless you are doing something with hard performance requirements, or
> > > really quite unusual, buying "good" kit is probably good enough, and
> you
> > > won't really know for sure until you measure.  Lucene is a general
> > > enough tool that there isn't a terribly universal answer to this.  We
> > > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
> > > instance, but we ended up taking an unusual path.  YMMV.
> > >
> > > Marcus Herou wrote:
> > > > Hi. I think I need to be more specific.
> > > >
> > > > What I am trying to find out is if I should aim for:
> > > >
> > > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> > > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> > > > RAM - if the index does not fit into RAM how much RAM should I then
> buy ?
> > > >
> > > > Please any hints would be appreciated since I am going to invest
> soon.
> > > >
> > > > //Marcus
> > > >
> > > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> > > > wrote:
> > > >
> > > >
> > > >> Hi.
> > > >>
> > > >> I currently have an index which is 16GB per machine (8 machines =
> 128GB)
> > > >> (data is stored externally, not in index) and is growing like crazy
> (we
> > > are
> > > >> indexing blogs which is crazy by nature) and have only allocated 2GB
> per
> > > >> machine to the Lucene app since we are running some other stuff
> there in
> > > >> parallell.
> > > >>
> > > >> Each doc should be roughly the size of a blog post, no more than
> 20k.
> > > >>
> > > >> We currently have about 90M documents and it is increasing rapidly
> so
> > > >> getting into the G+ document range is not going to be too far away.
> > > >>
> > > >> Now due to search performance I think I need to move these instances
> to
> > > >> dedicated index/search machines (or index on some machines and
> search on
> > > >> others). Anyway I would like to get some feedback about two things:
> > > >>
> > > >> 1. What is the most important hardware aspect when it comes to add
> > > document
> > > >> to the index and optimize it.
> > > >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
> > > >> 1.2 Is it RAM ?
> > > >> 1.3 Is is CPU ?
> > > >>
> > > >> My guess would be disk-io, right, wrong ?
> > > >>
> > > >> 2. What is the most important hardware aspect when it comes to
> searching
> > > >> documents in my setup ? (result-set is limited to return only the
> top 10
> > > >> matches with page handling)
> > > >> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> > > >> 2.2 Is it RAM ?
> > > >> 2.3 Is is CPU ?
> > > >>
> > > >> I have no clue since the data might not fit into memory. What is
> then
> > > the
> > > >> most important factor ? read-performance while scanning the index ?
> CPU
> > > >> while comparing fields and collecting results ?
> > > >>
> > > >> What I'm trying to find out is what I can do to get most bang for
> the
> > > buck
> > > >> with a limited (aren't we all limited?) budget.
> > > >>
> > > >> Kindly
> > > >>
> > > >> //Marcus
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Marcus Herou CTO and co-founder Tailsweep AB
> > > >> +46702561312
> > > >> marcus.herou@tailsweep.com
> > > >> http://www.tailsweep.com/
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Eric Bowman
> > > Boboco Ltd
> > > ebowman@boboco.ie
> > > http://www.boboco.ie/ebowman/pubkey.pgp
> > >
> > +35318394189/+353872801532
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by eks dev <ek...@yahoo.co.uk>.

depends on your architecture, will you partition your index? What is max expected size of your index (you said 128G and growing..) what do you mean with growing? You have in both options enogh memory to load it into RAM...

I would definitly try to have less machines and alot of memory, so that your index fits into ram comfortably...

IMO, 8Gig per machine is rather smalish, but depends heavily on your access patterns... how many documents you need to load from disk per query? If this does not create huge on IO, you could try to load everything but stored fields into RAM

What are your requirements on Indexing side (once a day, week, 15 Minutes), how you distribute index to all these machines... 
  
Your question: IO or CPU bound, depends, if you load it into RAM it becomes Memeory-bus/CPU bound, if it is mainly on disk it will be IO bound






----- Original Message ----
> From: Marcus Herou <ma...@tailsweep.com>
> To: java-user@lucene.apache.org
> Sent: Monday, 29 June, 2009 9:47:13
> Subject: Re: Scaling out/up or a mix
> 
> Thanks for the answer.
> 
> Don't you think that part 1 of the email would give you a hint of nature of
> the index ?
> 
> Index size(and growing): 16Gx8 = 128G
> Doc size (data): 20k
> Num docs: 90M
> Num users: Few hundred but most critical is that the admin staff which is
> using the index all day long.
> Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> publishedDate... = Very simple, no fuzzy searches etc. However since the
> dataset is large it will consume memory on sorting I guess.
> 
> Could not one draw any conclusions about best-practice in terms of hardware
> given the above "specs" ?
> 
> Basically I would like to know if I really need 8 cores since machines with
> dual-cpu support are the most expensive and I would like to not throw away
> money so getting it right is a matter of economy.
> 
> I mean it is very simple: Let's say someone gives me a budget of 50 000 USD
> and I then want to get the most bang for the buck for my workload.
> Should I go for
> X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me 1200USD
> a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM)
> or
> X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing me
> 3400 USD a piece (giving me 15 machines:  60 disks, 120 cores,  540G RAM)
> 
> Basically I would like to know what factors make the workload IO bound vs
> CPU bound ?
> 
> //Marcus
> 
> 
> 
> 
> 
> 
> On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman wrote:
> 
> > There is no single answer -- this is always application specific.
> >
> > Without knowing anything about what you are doing:
> >
> > 1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
> > you can, if performance is absolutely critical
> > 2. Sometimes CPU can become an issue, but 8 cores is probably enough
> > unless you are doing especially cpu-bound searches.
> >
> > Unless you are doing something with hard performance requirements, or
> > really quite unusual, buying "good" kit is probably good enough, and you
> > won't really know for sure until you measure.  Lucene is a general
> > enough tool that there isn't a terribly universal answer to this.  We
> > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
> > instance, but we ended up taking an unusual path.  YMMV.
> >
> > Marcus Herou wrote:
> > > Hi. I think I need to be more specific.
> > >
> > > What I am trying to find out is if I should aim for:
> > >
> > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> > > RAM - if the index does not fit into RAM how much RAM should I then buy ?
> > >
> > > Please any hints would be appreciated since I am going to invest soon.
> > >
> > > //Marcus
> > >
> > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> > > wrote:
> > >
> > >
> > >> Hi.
> > >>
> > >> I currently have an index which is 16GB per machine (8 machines = 128GB)
> > >> (data is stored externally, not in index) and is growing like crazy (we
> > are
> > >> indexing blogs which is crazy by nature) and have only allocated 2GB per
> > >> machine to the Lucene app since we are running some other stuff there in
> > >> parallell.
> > >>
> > >> Each doc should be roughly the size of a blog post, no more than 20k.
> > >>
> > >> We currently have about 90M documents and it is increasing rapidly so
> > >> getting into the G+ document range is not going to be too far away.
> > >>
> > >> Now due to search performance I think I need to move these instances to
> > >> dedicated index/search machines (or index on some machines and search on
> > >> others). Anyway I would like to get some feedback about two things:
> > >>
> > >> 1. What is the most important hardware aspect when it comes to add
> > document
> > >> to the index and optimize it.
> > >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
> > >> 1.2 Is it RAM ?
> > >> 1.3 Is is CPU ?
> > >>
> > >> My guess would be disk-io, right, wrong ?
> > >>
> > >> 2. What is the most important hardware aspect when it comes to searching
> > >> documents in my setup ? (result-set is limited to return only the top 10
> > >> matches with page handling)
> > >> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> > >> 2.2 Is it RAM ?
> > >> 2.3 Is is CPU ?
> > >>
> > >> I have no clue since the data might not fit into memory. What is then
> > the
> > >> most important factor ? read-performance while scanning the index ? CPU
> > >> while comparing fields and collecting results ?
> > >>
> > >> What I'm trying to find out is what I can do to get most bang for the
> > buck
> > >> with a limited (aren't we all limited?) budget.
> > >>
> > >> Kindly
> > >>
> > >> //Marcus
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Marcus Herou CTO and co-founder Tailsweep AB
> > >> +46702561312
> > >> marcus.herou@tailsweep.com
> > >> http://www.tailsweep.com/
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
> >
> > --
> > Eric Bowman
> > Boboco Ltd
> > ebowman@boboco.ie
> > http://www.boboco.ie/ebowman/pubkey.pgp
> > 
> +35318394189/+353872801532
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> -- 
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scaling out/up or a mix

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Tue, 2009-06-30 at 11:29 +0200, Uwe Schindler wrote:
> So the simple answer is always:
> If 64 bit platform with lots of RAM, use MMapDirectory.

Fair enough. That makes the RAM-focused solution much more scalable.
My point still stands though, as Marcus is currently examining his
hardware options and a lot of RAM is not a given: When we don't know the
performance goal, it is hard to balance a machine.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scaling out/up or a mix

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi, like the sound of this.

What I am not familiar with in terms of Lucene is how the index get's
swapped in and out of memory. When it comes to database tables (non
partitionable tables at least) I know that one should have enough memory to
fit the entire index into memory to avoid file-sorts for instance which of
course is slow.

If one operates mostly on the newest added entries will those always be
mapped into a cache or something.

How should I think ?

Hmmm... I think now that I need to create a rotation-partition scheme where
I separate cold (old entries) and warm entries...

/M


On Tue, Jun 30, 2009 at 11:29 AM, Uwe Schindler <us...@pangaea.de>wrote:

> > On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> > > Index size(and growing): 16Gx8 = 128G
> > > Doc size (data): 20k
> > > Num docs: 90M
> > > Num users: Few hundred but most critical is that the admin staff which
> > is
> > > using the index all day long.
> > > Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> > > publishedDate... = Very simple, no fuzzy searches etc. However since
> the
> > > dataset is large it will consume memory on sorting I guess.
> > >
> > > Could not one draw any conclusions about best-practice in terms of
> > hardware
> > > given the above "specs" ?
> >
> > Can you give us an estimate of the number of concurrent searches in
> > prime time and in what range a satisfactory response time would be?
> >
> > Going for a fully RAM-based search on a corpus of this size would mean
> > that each machine holds about 30GB of index (taken from your hardware
> > suggestion). I would expect that such a machine would be able to serve
> > something like 500-1000 searches/second (highly dependent on the index
> > and the searches, but what you're describing sounds simple enough) if we
> > just measure the raw search time and lookup of one or two fields for the
> > first 20 hits. It that what you're aiming for?
> >
> > Wrapping in web services and such lowers the number of searches that can
> > be performed, which makes the RAM-option even more expensive relative to
> > a harddisk or SSD solution.
>
> I would never say: "I copy an index into a RAMDirectory or something like
> that". I would buy enough RAM to fit as most as possible into RAM and (as
> we
> for sure are on a 64bit platform) use MMapDirectory instead of
> SimpleFSDirectory or NIOFSDirectory (I am talking with Lucene 2.9 class
> names, where FSDirs can be simply instantiated, as you may have noticed).
> MMapDirectory uses the index like a swap file that is mapped into address
> space. The OS kernel's will then use the index like RAM and map it into
> real
> RAM as needed. We had the discussion a lot of times in this mailing list
> (search for MMapDirectory in the archives). So the simple answer is always:
> If 64 bit platform with lots of RAM, use MMapDirectory. On Windows this is
> still buggy (but with 2.9 there is a workaround in MMapDirectory). When you
> warm your searchers before (I think you will do...), the Operating system
> kernel will "swap" in as much as possible from the index.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

RE: Scaling out/up or a mix

Posted by Uwe Schindler <us...@pangaea.de>.

> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which
> is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> > publishedDate... = Very simple, no fuzzy searches etc. However since the
> > dataset is large it will consume memory on sorting I guess.
> >
> > Could not one draw any conclusions about best-practice in terms of
> hardware
> > given the above "specs" ?
> 
> Can you give us an estimate of the number of concurrent searches in
> prime time and in what range a satisfactory response time would be?
> 
> Going for a fully RAM-based search on a corpus of this size would mean
> that each machine holds about 30GB of index (taken from your hardware
> suggestion). I would expect that such a machine would be able to serve
> something like 500-1000 searches/second (highly dependent on the index
> and the searches, but what you're describing sounds simple enough) if we
> just measure the raw search time and lookup of one or two fields for the
> first 20 hits. It that what you're aiming for?
> 
> Wrapping in web services and such lowers the number of searches that can
> be performed, which makes the RAM-option even more expensive relative to
> a harddisk or SSD solution.

I would never say: "I copy an index into a RAMDirectory or something like
that". I would buy enough RAM to fit as most as possible into RAM and (as we
for sure are on a 64bit platform) use MMapDirectory instead of
SimpleFSDirectory or NIOFSDirectory (I am talking with Lucene 2.9 class
names, where FSDirs can be simply instantiated, as you may have noticed).
MMapDirectory uses the index like a swap file that is mapped into address
space. The OS kernel's will then use the index like RAM and map it into real
RAM as needed. We had the discussion a lot of times in this mailing list
(search for MMapDirectory in the archives). So the simple answer is always:
If 64 bit platform with lots of RAM, use MMapDirectory. On Windows this is
still buggy (but with 2.9 there is a workaround in MMapDirectory). When you
warm your searchers before (I think you will do...), the Operating system
kernel will "swap" in as much as possible from the index.

Uwe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scaling out/up or a mix

Posted by Uwe Schindler <uw...@thetaphi.de>.

> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which
> is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> > publishedDate... = Very simple, no fuzzy searches etc. However since the
> > dataset is large it will consume memory on sorting I guess.
> >
> > Could not one draw any conclusions about best-practice in terms of
> hardware
> > given the above "specs" ?
> 
> Can you give us an estimate of the number of concurrent searches in
> prime time and in what range a satisfactory response time would be?
> 
> Going for a fully RAM-based search on a corpus of this size would mean
> that each machine holds about 30GB of index (taken from your hardware
> suggestion). I would expect that such a machine would be able to serve
> something like 500-1000 searches/second (highly dependent on the index
> and the searches, but what you're describing sounds simple enough) if we
> just measure the raw search time and lookup of one or two fields for the
> first 20 hits. It that what you're aiming for?
> 
> Wrapping in web services and such lowers the number of searches that can
> be performed, which makes the RAM-option even more expensive relative to
> a harddisk or SSD solution.

I would never say: "I copy an index into a RAMDirectory or something like
that". I would buy enough RAM to fit as most as possible into RAM and (as we
for sure are on a 64bit platform) use MMapDirectory instead of
SimpleFSDirectory or NIOFSDirectory (I am talking with Lucene 2.9 class
names, where FSDirs can be simply instantiated, as you may have noticed).
MMapDirectory uses the index like a swap file that is mapped into address
space. The OS kernel's will then use the index like RAM and map it into real
RAM as needed. We had the discussion a lot of times in this mailing list
(search for MMapDirectory in the archives). So the simple answer is always:
If 64 bit platform with lots of RAM, use MMapDirectory. On Windows this is
still buggy (but with 2.9 there is a workaround in MMapDirectory). When you
warm your searchers before (I think you will do...), the Operating system
kernel will "swap" in as much as possible from the index.

Uwe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scaling out/up or a mix

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Disclaimer: I only skimmed the thread.

RAM.  If you can get the OS to buffer hot pages of your index you'll be good.  The more the better, the faster the queries.  More cores/CPUs means more concurrency, and if things are fast because the data is cached, it means you need fewer CPUs/cores.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Marcus Herou <ma...@tailsweep.com>
> To: solr-user@lucene.apache.org; java-user@lucene.apache.org
> Sent: Wednesday, July 1, 2009 10:31:28 AM
> Subject: Re: Scaling out/up or a mix
> 
> Hi agree that faceting might be the thing that defines this app. The app is
> mostly snappy during daytime since we optimize the index around 7.00 GMT.
> However faceting is never snappy.
> 
> We speeded things up a whole bunch by creating various "less cardinal"
> fields from the originating publishedDate which is a timestamp (very unique
> = very many fields). During query time I do not see any apparent signs of
> high CPU-load but it is hard to determine since we have hadoop on the same
> machines. This is one of the reasons of this topic. I need to buy hardware
> to separate the architecture.
> 
> About the index size: I seem to have been confused when I wrote that.
> Facts:
> 16GB per machine
> 8 machines
> which leads to an index size of totally 128 GB index (ooops as we speak it
> is actually 20GB per machine = 160GB index)
> 
> Thinking about buying a few machines with 48GB ram (4G modules), 2x2.0GHz
> Quad, 4 disks RAIDed 1+0. I can get these for about *$4700 or do you think I
> should go for buying fewer 2U boxes with 8 disks ?*
> 
> I know that it is hard to just give a straight answer without a thorough
> benchmark but just asking about the gut feeling here basically.
> 
> Cheers
> 
> //Marcus
> 
> 
> 
> 
> On Wed, Jul 1, 2009 at 1:31 PM, Toke Eskildsen wrote:
> 
> > On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote:
> > > The number of concurrent users today is insignficant but once we push
> > > for the service we will get into trouble... I know that since even one
> > > simple faceting query (which we will use to display trend graphs) can
> > > take forever (talking about SOLR bytw).
> >
> > Ah, faceting. That could very well be the defining requirement for your
> > selection of hardware. As far as I remember, Solr supports two different
> > ways of faceting (depending on whether there are few or many tags in a
> > facet), where at least one of them uses counters corresponding to the
> > number of documents in the index. That scheme is similar to the approach
> > we're taking and in our experience this quickly moves the bottleneck to
> > RAM access speed. Now, I'm not at all a Solr expert so they might have
> > done something clever in that area; I'd recommend that you also state
> > your question on the Solr mailing list and mention what kind of faceting
> > you'll be performing.
> >
> > > "Normal" Lucene queries (title:blah OR description:blah) timing is
> > > reasonable for the current hardware but not good (Currently 8 machines
> > > 2GB RAM each serving 130G index). It takes less than 10 secs at all
> > > times which of course is very bad user experience.
> >
> > In your first post you stated "I currently have an index which is 16GB
> > per machine (8 machines = 128GB)" so it's a bit confusing to me what you
> > have?
> >
> > > Example of a public query (no sorting on publisheddate but rather on
> > > relevance = faster):
> > > http://blogsearch.tailsweep.com/search.do?wa=test&la=all
> >
> > I tried a few searches and that seemed snappy enough. Not anywhere near
> > 10 seconds?
> >
> > > Sorry not meaning to advertise but I could not help it :)
> >
> > No problem. The BlogSpace thingy was nice eye-candy.
> >
> >
> 
> 
> -- 
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scaling out/up or a mix

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi agree that faceting might be the thing that defines this app. The app is
mostly snappy during daytime since we optimize the index around 7.00 GMT.
However faceting is never snappy.

We speeded things up a whole bunch by creating various "less cardinal"
fields from the originating publishedDate which is a timestamp (very unique
= very many fields). During query time I do not see any apparent signs of
high CPU-load but it is hard to determine since we have hadoop on the same
machines. This is one of the reasons of this topic. I need to buy hardware
to separate the architecture.

About the index size: I seem to have been confused when I wrote that.
Facts:
16GB per machine
8 machines
which leads to an index size of totally 128 GB index (ooops as we speak it
is actually 20GB per machine = 160GB index)

Thinking about buying a few machines with 48GB ram (4G modules), 2x2.0GHz
Quad, 4 disks RAIDed 1+0. I can get these for about *$4700 or do you think I
should go for buying fewer 2U boxes with 8 disks ?*

I know that it is hard to just give a straight answer without a thorough
benchmark but just asking about the gut feeling here basically.

Cheers

//Marcus

On Wed, Jul 1, 2009 at 1:31 PM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

> On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote:
> > The number of concurrent users today is insignficant but once we push
> > for the service we will get into trouble... I know that since even one
> > simple faceting query (which we will use to display trend graphs) can
> > take forever (talking about SOLR bytw).
>
> Ah, faceting. That could very well be the defining requirement for your
> selection of hardware. As far as I remember, Solr supports two different
> ways of faceting (depending on whether there are few or many tags in a
> facet), where at least one of them uses counters corresponding to the
> number of documents in the index. That scheme is similar to the approach
> we're taking and in our experience this quickly moves the bottleneck to
> RAM access speed. Now, I'm not at all a Solr expert so they might have
> done something clever in that area; I'd recommend that you also state
> your question on the Solr mailing list and mention what kind of faceting
> you'll be performing.
>
> > "Normal" Lucene queries (title:blah OR description:blah) timing is
> > reasonable for the current hardware but not good (Currently 8 machines
> > 2GB RAM each serving 130G index). It takes less than 10 secs at all
> > times which of course is very bad user experience.
>
> In your first post you stated "I currently have an index which is 16GB
> per machine (8 machines = 128GB)" so it's a bit confusing to me what you
> have?
>
> > Example of a public query (no sorting on publisheddate but rather on
> > relevance = faster):
> > http://blogsearch.tailsweep.com/search.do?wa=test&la=all
>
> I tried a few searches and that seemed snappy enough. Not anywhere near
> 10 seconds?
>
> > Sorry not meaning to advertise but I could not help it :)
>
> No problem. The BlogSpace thingy was nice eye-candy.
>
>

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi agree that faceting might be the thing that defines this app. The app is
mostly snappy during daytime since we optimize the index around 7.00 GMT.
However faceting is never snappy.

We speeded things up a whole bunch by creating various "less cardinal"
fields from the originating publishedDate which is a timestamp (very unique
= very many fields). During query time I do not see any apparent signs of
high CPU-load but it is hard to determine since we have hadoop on the same
machines. This is one of the reasons of this topic. I need to buy hardware
to separate the architecture.

About the index size: I seem to have been confused when I wrote that.
Facts:
16GB per machine
8 machines
which leads to an index size of totally 128 GB index (ooops as we speak it
is actually 20GB per machine = 160GB index)

Thinking about buying a few machines with 48GB ram (4G modules), 2x2.0GHz
Quad, 4 disks RAIDed 1+0. I can get these for about *$4700 or do you think I
should go for buying fewer 2U boxes with 8 disks ?*

I know that it is hard to just give a straight answer without a thorough
benchmark but just asking about the gut feeling here basically.

Cheers

//Marcus

On Wed, Jul 1, 2009 at 1:31 PM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

> On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote:
> > The number of concurrent users today is insignficant but once we push
> > for the service we will get into trouble... I know that since even one
> > simple faceting query (which we will use to display trend graphs) can
> > take forever (talking about SOLR bytw).
>
> Ah, faceting. That could very well be the defining requirement for your
> selection of hardware. As far as I remember, Solr supports two different
> ways of faceting (depending on whether there are few or many tags in a
> facet), where at least one of them uses counters corresponding to the
> number of documents in the index. That scheme is similar to the approach
> we're taking and in our experience this quickly moves the bottleneck to
> RAM access speed. Now, I'm not at all a Solr expert so they might have
> done something clever in that area; I'd recommend that you also state
> your question on the Solr mailing list and mention what kind of faceting
> you'll be performing.
>
> > "Normal" Lucene queries (title:blah OR description:blah) timing is
> > reasonable for the current hardware but not good (Currently 8 machines
> > 2GB RAM each serving 130G index). It takes less than 10 secs at all
> > times which of course is very bad user experience.
>
> In your first post you stated "I currently have an index which is 16GB
> per machine (8 machines = 128GB)" so it's a bit confusing to me what you
> have?
>
> > Example of a public query (no sorting on publisheddate but rather on
> > relevance = faster):
> > http://blogsearch.tailsweep.com/search.do?wa=test&la=all
>
> I tried a few searches and that seemed snappy enough. Not anywhere near
> 10 seconds?
>
> > Sorry not meaning to advertise but I could not help it :)
>
> No problem. The BlogSpace thingy was nice eye-candy.
>
>

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote:
> The number of concurrent users today is insignficant but once we push
> for the service we will get into trouble... I know that since even one
> simple faceting query (which we will use to display trend graphs) can
> take forever (talking about SOLR bytw).

Ah, faceting. That could very well be the defining requirement for your
selection of hardware. As far as I remember, Solr supports two different
ways of faceting (depending on whether there are few or many tags in a
facet), where at least one of them uses counters corresponding to the
number of documents in the index. That scheme is similar to the approach
we're taking and in our experience this quickly moves the bottleneck to
RAM access speed. Now, I'm not at all a Solr expert so they might have
done something clever in that area; I'd recommend that you also state
your question on the Solr mailing list and mention what kind of faceting
you'll be performing.

> "Normal" Lucene queries (title:blah OR description:blah) timing is
> reasonable for the current hardware but not good (Currently 8 machines
> 2GB RAM each serving 130G index). It takes less than 10 secs at all
> times which of course is very bad user experience.

In your first post you stated "I currently have an index which is 16GB
per machine (8 machines = 128GB)" so it's a bit confusing to me what you
have?

> Example of a public query (no sorting on publisheddate but rather on
> relevance = faster):
> http://blogsearch.tailsweep.com/search.do?wa=test&la=all

I tried a few searches and that seemed snappy enough. Not anywhere near
10 seconds?

> Sorry not meaning to advertise but I could not help it :) 

No problem. The BlogSpace thingy was nice eye-candy.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scaling out/up or a mix

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi.

The number of concurrent users today is insignficant but once we push for
the service we will get into trouble... I know that since even one simple
faceting query (which we will use to display trend graphs) can take forever
(talking about SOLR bytw). "Normal" Lucene queries (title:blah OR
description:blah) timing is reasonable for the current hardware but not good
(Currently 8 machines 2GB RAM each serving 130G index). It takes less than
10 secs at all times which of course is very bad user experience.

If someone need to understand more about the nature of this app I think we
are quite alike technorati (if we would show all bling-bling) or twingly.com.
Basically a blogsearch app.

Example of a public query (no sorting on publisheddate but rather on
relevance = faster):
http://blogsearch.tailsweep.com/search.do?wa=test&la=all

And while you are at it, look at our cool BlogSpace:
http://blogsearch.tailsweep.com/showFeed.do?feedId=114799

Sorry not meaning to advertise but I could not help it :)


//Marcus




On Tue, Jun 30, 2009 at 10:49 AM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> > publishedDate... = Very simple, no fuzzy searches etc. However since the
> > dataset is large it will consume memory on sorting I guess.
> >
> > Could not one draw any conclusions about best-practice in terms of
> hardware
> > given the above "specs" ?
>
> Can you give us an estimate of the number of concurrent searches in
> prime time and in what range a satisfactory response time would be?
>
> Going for a fully RAM-based search on a corpus of this size would mean
> that each machine holds about 30GB of index (taken from your hardware
> suggestion). I would expect that such a machine would be able to serve
> something like 500-1000 searches/second (highly dependent on the index
> and the searches, but what you're describing sounds simple enough) if we
> just measure the raw search time and lookup of one or two fields for the
> first 20 hits. It that what you're aiming for?
>
> Wrapping in web services and such lowers the number of searches that can
> be performed, which makes the RAM-option even more expensive relative to
> a harddisk or SSD solution.
>
> > I mean it is very simple: Let's say someone gives me a budget of 50 000
> USD
> > and I then want to get the most bang for the buck for my workload.
>
> I am a bit unclear on your overall goal. Do you expect the number of
> users to grow significantly?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: 
> Index size(and growing): 16Gx8 = 128G
> Doc size (data): 20k
> Num docs: 90M
> Num users: Few hundred but most critical is that the admin staff which is
> using the index all day long.
> Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> publishedDate... = Very simple, no fuzzy searches etc. However since the
> dataset is large it will consume memory on sorting I guess.
> 
> Could not one draw any conclusions about best-practice in terms of hardware
> given the above "specs" ?

Can you give us an estimate of the number of concurrent searches in
prime time and in what range a satisfactory response time would be?

Going for a fully RAM-based search on a corpus of this size would mean
that each machine holds about 30GB of index (taken from your hardware
suggestion). I would expect that such a machine would be able to serve
something like 500-1000 searches/second (highly dependent on the index
and the searches, but what you're describing sounds simple enough) if we
just measure the raw search time and lookup of one or two fields for the
first 20 hits. It that what you're aiming for?

Wrapping in web services and such lowers the number of searches that can
be performed, which makes the RAM-option even more expensive relative to
a harddisk or SSD solution.

> I mean it is very simple: Let's say someone gives me a budget of 50 000 USD
> and I then want to get the most bang for the buck for my workload.

I am a bit unclear on your overall goal. Do you expect the number of
users to grow significantly?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scaling out/up or a mix

Posted by Andy Goodell <go...@gmail.com>.

I have improved date-sorted searching performance pretty dramatically by
replacing the two step "search then sort" operation with a one step "use the
date as the score" algorithm.  The main gotcha was making sure to not affect
which results get counted as hits in boolean searches, but overall I only
spent about a week on the project, and got a 60x speed improvement on the
target set. (from minutes to seconds)  YMMV however, since the app requires
the collection of the complete set of results for analysis.

- andy g

On Mon, Jun 29, 2009 at 12:47 AM, Marcus Herou
<ma...@tailsweep.com>wrote:

> Thanks for the answer.
>
> Don't you think that part 1 of the email would give you a hint of nature of
> the index ?
>
> Index size(and growing): 16Gx8 = 128G
> Doc size (data): 20k
> Num docs: 90M
> Num users: Few hundred but most critical is that the admin staff which is
> using the index all day long.
> Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> publishedDate... = Very simple, no fuzzy searches etc. However since the
> dataset is large it will consume memory on sorting I guess.
>
> Could not one draw any conclusions about best-practice in terms of hardware
> given the above "specs" ?
>
> Basically I would like to know if I really need 8 cores since machines with
> dual-cpu support are the most expensive and I would like to not throw away
> money so getting it right is a matter of economy.
>
> I mean it is very simple: Let's say someone gives me a budget of 50 000 USD
> and I then want to get the most bang for the buck for my workload.
> Should I go for
> X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me
> 1200USD
> a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM)
> or
> X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing me
> 3400 USD a piece (giving me 15 machines:  60 disks, 120 cores,  540G RAM)
>
> Basically I would like to know what factors make the workload IO bound vs
> CPU bound ?
>
> //Marcus
>
>
>
>
>
>
> On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman <eb...@boboco.ie> wrote:
>
> > There is no single answer -- this is always application specific.
> >
> > Without knowing anything about what you are doing:
> >
> > 1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
> > you can, if performance is absolutely critical
> > 2. Sometimes CPU can become an issue, but 8 cores is probably enough
> > unless you are doing especially cpu-bound searches.
> >
> > Unless you are doing something with hard performance requirements, or
> > really quite unusual, buying "good" kit is probably good enough, and you
> > won't really know for sure until you measure.  Lucene is a general
> > enough tool that there isn't a terribly universal answer to this.  We
> > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
> > instance, but we ended up taking an unusual path.  YMMV.
> >
> > Marcus Herou wrote:
> > > Hi. I think I need to be more specific.
> > >
> > > What I am trying to find out is if I should aim for:
> > >
> > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> > > RAM - if the index does not fit into RAM how much RAM should I then buy
> ?
> > >
> > > Please any hints would be appreciated since I am going to invest soon.
> > >
> > > //Marcus
> > >
> > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> > > <ma...@tailsweep.com>wrote:
> > >
> > >
> > >> Hi.
> > >>
> > >> I currently have an index which is 16GB per machine (8 machines =
> 128GB)
> > >> (data is stored externally, not in index) and is growing like crazy
> (we
> > are
> > >> indexing blogs which is crazy by nature) and have only allocated 2GB
> per
> > >> machine to the Lucene app since we are running some other stuff there
> in
> > >> parallell.
> > >>
> > >> Each doc should be roughly the size of a blog post, no more than 20k.
> > >>
> > >> We currently have about 90M documents and it is increasing rapidly so
> > >> getting into the G+ document range is not going to be too far away.
> > >>
> > >> Now due to search performance I think I need to move these instances
> to
> > >> dedicated index/search machines (or index on some machines and search
> on
> > >> others). Anyway I would like to get some feedback about two things:
> > >>
> > >> 1. What is the most important hardware aspect when it comes to add
> > document
> > >> to the index and optimize it.
> > >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
> > >> 1.2 Is it RAM ?
> > >> 1.3 Is is CPU ?
> > >>
> > >> My guess would be disk-io, right, wrong ?
> > >>
> > >> 2. What is the most important hardware aspect when it comes to
> searching
> > >> documents in my setup ? (result-set is limited to return only the top
> 10
> > >> matches with page handling)
> > >> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> > >> 2.2 Is it RAM ?
> > >> 2.3 Is is CPU ?
> > >>
> > >> I have no clue since the data might not fit into memory. What is then
> > the
> > >> most important factor ? read-performance while scanning the index ?
> CPU
> > >> while comparing fields and collecting results ?
> > >>
> > >> What I'm trying to find out is what I can do to get most bang for the
> > buck
> > >> with a limited (aren't we all limited?) budget.
> > >>
> > >> Kindly
> > >>
> > >> //Marcus
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Marcus Herou CTO and co-founder Tailsweep AB
> > >> +46702561312
> > >> marcus.herou@tailsweep.com
> > >> http://www.tailsweep.com/
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
> >
> > --
> > Eric Bowman
> > Boboco Ltd
> > ebowman@boboco.ie
> > http://www.boboco.ie/ebowman/pubkey.pgp
> > +35318394189/+353872801532<
> http://www.boboco.ie/ebowman/pubkey.pgp%0A+35318394189/+353872801532>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
>

Re: Scaling out/up or a mix

Posted by Marcus Herou <ma...@tailsweep.com>.

Thanks for the answer.

Don't you think that part 1 of the email would give you a hint of nature of
the index ?

Index size(and growing): 16Gx8 = 128G
Doc size (data): 20k
Num docs: 90M
Num users: Few hundred but most critical is that the admin staff which is
using the index all day long.
Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
publishedDate... = Very simple, no fuzzy searches etc. However since the
dataset is large it will consume memory on sorting I guess.

Could not one draw any conclusions about best-practice in terms of hardware
given the above "specs" ?

Basically I would like to know if I really need 8 cores since machines with
dual-cpu support are the most expensive and I would like to not throw away
money so getting it right is a matter of economy.

I mean it is very simple: Let's say someone gives me a budget of 50 000 USD
and I then want to get the most bang for the buck for my workload.
Should I go for
X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me 1200USD
a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM)
or
X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing me
3400 USD a piece (giving me 15 machines:  60 disks, 120 cores,  540G RAM)

Basically I would like to know what factors make the workload IO bound vs
CPU bound ?

//Marcus






On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman <eb...@boboco.ie> wrote:

> There is no single answer -- this is always application specific.
>
> Without knowing anything about what you are doing:
>
> 1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
> you can, if performance is absolutely critical
> 2. Sometimes CPU can become an issue, but 8 cores is probably enough
> unless you are doing especially cpu-bound searches.
>
> Unless you are doing something with hard performance requirements, or
> really quite unusual, buying "good" kit is probably good enough, and you
> won't really know for sure until you measure.  Lucene is a general
> enough tool that there isn't a terribly universal answer to this.  We
> were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
> instance, but we ended up taking an unusual path.  YMMV.
>
> Marcus Herou wrote:
> > Hi. I think I need to be more specific.
> >
> > What I am trying to find out is if I should aim for:
> >
> > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> > RAM - if the index does not fit into RAM how much RAM should I then buy ?
> >
> > Please any hints would be appreciated since I am going to invest soon.
> >
> > //Marcus
> >
> > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> > <ma...@tailsweep.com>wrote:
> >
> >
> >> Hi.
> >>
> >> I currently have an index which is 16GB per machine (8 machines = 128GB)
> >> (data is stored externally, not in index) and is growing like crazy (we
> are
> >> indexing blogs which is crazy by nature) and have only allocated 2GB per
> >> machine to the Lucene app since we are running some other stuff there in
> >> parallell.
> >>
> >> Each doc should be roughly the size of a blog post, no more than 20k.
> >>
> >> We currently have about 90M documents and it is increasing rapidly so
> >> getting into the G+ document range is not going to be too far away.
> >>
> >> Now due to search performance I think I need to move these instances to
> >> dedicated index/search machines (or index on some machines and search on
> >> others). Anyway I would like to get some feedback about two things:
> >>
> >> 1. What is the most important hardware aspect when it comes to add
> document
> >> to the index and optimize it.
> >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
> >> 1.2 Is it RAM ?
> >> 1.3 Is is CPU ?
> >>
> >> My guess would be disk-io, right, wrong ?
> >>
> >> 2. What is the most important hardware aspect when it comes to searching
> >> documents in my setup ? (result-set is limited to return only the top 10
> >> matches with page handling)
> >> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> >> 2.2 Is it RAM ?
> >> 2.3 Is is CPU ?
> >>
> >> I have no clue since the data might not fit into memory. What is then
> the
> >> most important factor ? read-performance while scanning the index ? CPU
> >> while comparing fields and collecting results ?
> >>
> >> What I'm trying to find out is what I can do to get most bang for the
> buck
> >> with a limited (aren't we all limited?) budget.
> >>
> >> Kindly
> >>
> >> //Marcus
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Marcus Herou CTO and co-founder Tailsweep AB
> >> +46702561312
> >> marcus.herou@tailsweep.com
> >> http://www.tailsweep.com/
> >>
> >>
> >>
> >
> >
> >
>
>
> --
> Eric Bowman
> Boboco Ltd
> ebowman@boboco.ie
> http://www.boboco.ie/ebowman/pubkey.pgp
> +35318394189/+353872801532<http://www.boboco.ie/ebowman/pubkey.pgp%0A+35318394189/+353872801532>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by Eric Bowman <eb...@boboco.ie>.

There is no single answer -- this is always application specific.

Without knowing anything about what you are doing:

1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
you can, if performance is absolutely critical
2. Sometimes CPU can become an issue, but 8 cores is probably enough
unless you are doing especially cpu-bound searches.

Unless you are doing something with hard performance requirements, or
really quite unusual, buying "good" kit is probably good enough, and you
won't really know for sure until you measure.  Lucene is a general
enough tool that there isn't a terribly universal answer to this.  We
were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
instance, but we ended up taking an unusual path.  YMMV.

Marcus Herou wrote:
> Hi. I think I need to be more specific.
>
> What I am trying to find out is if I should aim for:
>
> CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> RAM - if the index does not fit into RAM how much RAM should I then buy ?
>
> Please any hints would be appreciated since I am going to invest soon.
>
> //Marcus
>
> On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> <ma...@tailsweep.com>wrote:
>
>   
>> Hi.
>>
>> I currently have an index which is 16GB per machine (8 machines = 128GB)
>> (data is stored externally, not in index) and is growing like crazy (we are
>> indexing blogs which is crazy by nature) and have only allocated 2GB per
>> machine to the Lucene app since we are running some other stuff there in
>> parallell.
>>
>> Each doc should be roughly the size of a blog post, no more than 20k.
>>
>> We currently have about 90M documents and it is increasing rapidly so
>> getting into the G+ document range is not going to be too far away.
>>
>> Now due to search performance I think I need to move these instances to
>> dedicated index/search machines (or index on some machines and search on
>> others). Anyway I would like to get some feedback about two things:
>>
>> 1. What is the most important hardware aspect when it comes to add document
>> to the index and optimize it.
>> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
>> 1.2 Is it RAM ?
>> 1.3 Is is CPU ?
>>
>> My guess would be disk-io, right, wrong ?
>>
>> 2. What is the most important hardware aspect when it comes to searching
>> documents in my setup ? (result-set is limited to return only the top 10
>> matches with page handling)
>> 2.1 Is it disk read throughput ? (sequential or random-io ?)
>> 2.2 Is it RAM ?
>> 2.3 Is is CPU ?
>>
>> I have no clue since the data might not fit into memory. What is then the
>> most important factor ? read-performance while scanning the index ? CPU
>> while comparing fields and collecting results ?
>>
>> What I'm trying to find out is what I can do to get most bang for the buck
>> with a limited (aren't we all limited?) budget.
>>
>> Kindly
>>
>> //Marcus
>>
>>
>>
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> marcus.herou@tailsweep.com
>> http://www.tailsweep.com/
>>
>>
>>     
>
>
>   


-- 
Eric Bowman
Boboco Ltd
ebowman@boboco.ie
http://www.boboco.ie/ebowman/pubkey.pgp
+35318394189/+353872801532


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scaling out/up or a mix

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi. I think I need to be more specific.

What I am trying to find out is if I should aim for:

CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
RAM - if the index does not fit into RAM how much RAM should I then buy ?

Please any hints would be appreciated since I am going to invest soon.

//Marcus

On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
<ma...@tailsweep.com>wrote:

> Hi.
>
> I currently have an index which is 16GB per machine (8 machines = 128GB)
> (data is stored externally, not in index) and is growing like crazy (we are
> indexing blogs which is crazy by nature) and have only allocated 2GB per
> machine to the Lucene app since we are running some other stuff there in
> parallell.
>
> Each doc should be roughly the size of a blog post, no more than 20k.
>
> We currently have about 90M documents and it is increasing rapidly so
> getting into the G+ document range is not going to be too far away.
>
> Now due to search performance I think I need to move these instances to
> dedicated index/search machines (or index on some machines and search on
> others). Anyway I would like to get some feedback about two things:
>
> 1. What is the most important hardware aspect when it comes to add document
> to the index and optimize it.
> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
> 1.2 Is it RAM ?
> 1.3 Is is CPU ?
>
> My guess would be disk-io, right, wrong ?
>
> 2. What is the most important hardware aspect when it comes to searching
> documents in my setup ? (result-set is limited to return only the top 10
> matches with page handling)
> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> 2.2 Is it RAM ?
> 2.3 Is is CPU ?
>
> I have no clue since the data might not fit into memory. What is then the
> most important factor ? read-performance while scanning the index ? CPU
> while comparing fields and collecting results ?
>
> What I'm trying to find out is what I can do to get most bang for the buck
> with a limited (aren't we all limited?) budget.
>
> Kindly
>
> //Marcus
>
>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/