You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Marcus Herou <ma...@tailsweep.com> on 2009/07/01 16:31:28 UTC

Re: Scaling out/up or a mix

Hi agree that faceting might be the thing that defines this app. The app is
mostly snappy during daytime since we optimize the index around 7.00 GMT.
However faceting is never snappy.

We speeded things up a whole bunch by creating various "less cardinal"
fields from the originating publishedDate which is a timestamp (very unique
= very many fields). During query time I do not see any apparent signs of
high CPU-load but it is hard to determine since we have hadoop on the same
machines. This is one of the reasons of this topic. I need to buy hardware
to separate the architecture.

About the index size: I seem to have been confused when I wrote that.
Facts:
16GB per machine
8 machines
which leads to an index size of totally 128 GB index (ooops as we speak it
is actually 20GB per machine = 160GB index)

Thinking about buying a few machines with 48GB ram (4G modules), 2x2.0GHz
Quad, 4 disks RAIDed 1+0. I can get these for about *$4700 or do you think I
should go for buying fewer 2U boxes with 8 disks ?*

I know that it is hard to just give a straight answer without a thorough
benchmark but just asking about the gut feeling here basically.

Cheers

//Marcus

On Wed, Jul 1, 2009 at 1:31 PM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

> On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote:
> > The number of concurrent users today is insignficant but once we push
> > for the service we will get into trouble... I know that since even one
> > simple faceting query (which we will use to display trend graphs) can
> > take forever (talking about SOLR bytw).
>
> Ah, faceting. That could very well be the defining requirement for your
> selection of hardware. As far as I remember, Solr supports two different
> ways of faceting (depending on whether there are few or many tags in a
> facet), where at least one of them uses counters corresponding to the
> number of documents in the index. That scheme is similar to the approach
> we're taking and in our experience this quickly moves the bottleneck to
> RAM access speed. Now, I'm not at all a Solr expert so they might have
> done something clever in that area; I'd recommend that you also state
> your question on the Solr mailing list and mention what kind of faceting
> you'll be performing.
>
> > "Normal" Lucene queries (title:blah OR description:blah) timing is
> > reasonable for the current hardware but not good (Currently 8 machines
> > 2GB RAM each serving 130G index). It takes less than 10 secs at all
> > times which of course is very bad user experience.
>
> In your first post you stated "I currently have an index which is 16GB
> per machine (8 machines = 128GB)" so it's a bit confusing to me what you
> have?
>
> > Example of a public query (no sorting on publisheddate but rather on
> > relevance = faster):
> > http://blogsearch.tailsweep.com/search.do?wa=test&la=all
>
> I tried a few searches and that seemed snappy enough. Not anywhere near
> 10 seconds?
>
> > Sorry not meaning to advertise but I could not help it :)
>
> No problem. The BlogSpace thingy was nice eye-candy.
>
>

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Disclaimer: I only skimmed the thread.

RAM.  If you can get the OS to buffer hot pages of your index you'll be good.  The more the better, the faster the queries.  More cores/CPUs means more concurrency, and if things are fast because the data is cached, it means you need fewer CPUs/cores.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Marcus Herou <ma...@tailsweep.com>
> To: solr-user@lucene.apache.org; java-user@lucene.apache.org
> Sent: Wednesday, July 1, 2009 10:31:28 AM
> Subject: Re: Scaling out/up or a mix
> 
> Hi agree that faceting might be the thing that defines this app. The app is
> mostly snappy during daytime since we optimize the index around 7.00 GMT.
> However faceting is never snappy.
> 
> We speeded things up a whole bunch by creating various "less cardinal"
> fields from the originating publishedDate which is a timestamp (very unique
> = very many fields). During query time I do not see any apparent signs of
> high CPU-load but it is hard to determine since we have hadoop on the same
> machines. This is one of the reasons of this topic. I need to buy hardware
> to separate the architecture.
> 
> About the index size: I seem to have been confused when I wrote that.
> Facts:
> 16GB per machine
> 8 machines
> which leads to an index size of totally 128 GB index (ooops as we speak it
> is actually 20GB per machine = 160GB index)
> 
> Thinking about buying a few machines with 48GB ram (4G modules), 2x2.0GHz
> Quad, 4 disks RAIDed 1+0. I can get these for about *$4700 or do you think I
> should go for buying fewer 2U boxes with 8 disks ?*
> 
> I know that it is hard to just give a straight answer without a thorough
> benchmark but just asking about the gut feeling here basically.
> 
> Cheers
> 
> //Marcus
> 
> 
> 
> 
> On Wed, Jul 1, 2009 at 1:31 PM, Toke Eskildsen wrote:
> 
> > On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote:
> > > The number of concurrent users today is insignficant but once we push
> > > for the service we will get into trouble... I know that since even one
> > > simple faceting query (which we will use to display trend graphs) can
> > > take forever (talking about SOLR bytw).
> >
> > Ah, faceting. That could very well be the defining requirement for your
> > selection of hardware. As far as I remember, Solr supports two different
> > ways of faceting (depending on whether there are few or many tags in a
> > facet), where at least one of them uses counters corresponding to the
> > number of documents in the index. That scheme is similar to the approach
> > we're taking and in our experience this quickly moves the bottleneck to
> > RAM access speed. Now, I'm not at all a Solr expert so they might have
> > done something clever in that area; I'd recommend that you also state
> > your question on the Solr mailing list and mention what kind of faceting
> > you'll be performing.
> >
> > > "Normal" Lucene queries (title:blah OR description:blah) timing is
> > > reasonable for the current hardware but not good (Currently 8 machines
> > > 2GB RAM each serving 130G index). It takes less than 10 secs at all
> > > times which of course is very bad user experience.
> >
> > In your first post you stated "I currently have an index which is 16GB
> > per machine (8 machines = 128GB)" so it's a bit confusing to me what you
> > have?
> >
> > > Example of a public query (no sorting on publisheddate but rather on
> > > relevance = faster):
> > > http://blogsearch.tailsweep.com/search.do?wa=test&la=all
> >
> > I tried a few searches and that seemed snappy enough. Not anywhere near
> > 10 seconds?
> >
> > > Sorry not meaning to advertise but I could not help it :)
> >
> > No problem. The BlogSpace thingy was nice eye-candy.
> >
> >
> 
> 
> -- 
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org