You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chris Were <ch...@gmail.com> on 2009/10/07 00:57:02 UTC

How to setup a scalable deployment?

Hi,
I've been using lucene for a project and it works great on the one dev.
machine. Next step is to investigate the best method of deploying lucene so
that multiple web servers can access the lucene directory of indexes.

I see four potential options:

1) Each web server indexes the content separately. This will potentially
cause different web servers to have slightly different indexes at any given
time and also duplicates the work load of indexing the content
2) Using rsync (or a similar tool) to regularly update a local lucene index
directory on each web server. I imagine there will be locking issues that
need to be resolved here.
3) Using a network file system that all the web servers can access. I don't
have much experience in this area, but potentially latency on searches will
be high?
4) Some alternative lucene specific solution that I haven't found in the
wiki / lucene documentation.

The indexes aim to be as real-time as possible, I currently update my
IndexReaders in a background thread every 20 seconds.

Does anyone have any recommendations on the best approach?

Cheers,
Chris

Re: How to setup a scalable deployment?

Posted by no spam <mr...@gmail.com>.

Have you investigated using Terracotta / Compass?  We need real-time updates
across the index using multiple web servers.  I recently got this up and
running and we're going to be doing some performance testing.  It's very
easy, essentially you just replace your FSDirectoryProvider with a
TerracottaDirectoryProvider that wraps the compass TerracottaDirectory.

http://relation.to/Bloggers/HibernateSearchClusteringWithTerracotta

Re: How to setup a scalable deployment?

Posted by Jason Rutherglen <ja...@gmail.com>.

Chris,

It sounds like you're on the right track. Have you looked at
Solr which uses the rsync/Java replication method you mentioned?
Replication and near realtime in Solr aren't quite there yet,
however it wouldn't be too hard to add it.

-J

On Tue, Oct 6, 2009 at 3:57 PM, Chris Were <ch...@gmail.com> wrote:
> Hi,
> I've been using lucene for a project and it works great on the one dev.
> machine. Next step is to investigate the best method of deploying lucene so
> that multiple web servers can access the lucene directory of indexes.
>
> I see four potential options:
>
> 1) Each web server indexes the content separately. This will potentially
> cause different web servers to have slightly different indexes at any given
> time and also duplicates the work load of indexing the content
> 2) Using rsync (or a similar tool) to regularly update a local lucene index
> directory on each web server. I imagine there will be locking issues that
> need to be resolved here.
> 3) Using a network file system that all the web servers can access. I don't
> have much experience in this area, but potentially latency on searches will
> be high?
> 4) Some alternative lucene specific solution that I haven't found in the
> wiki / lucene documentation.
>
> The indexes aim to be as real-time as possible, I currently update my
> IndexReaders in a background thread every 20 seconds.
>
> Does anyone have any recommendations on the best approach?
>
> Cheers,
> Chris
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to setup a scalable deployment?

Posted by Chris Were <ch...@gmail.com>.

Thanks for all the excellent replies.
Lots of great of software mentioned that I'd never heard of -- and I thought
I'd Google'd this subject to death already!

Cheers,
Chris.

Re: How to setup a scalable deployment?

Posted by Chris Were <ch...@gmail.com>.

Hi Jake,
Thanks for the great insight and suggestions.

I will investigate different optimize() levels and see if that helps my
particular use case -- if not I'll be considering the Zoie route and let you
know how I get on.

Cheers,
Chris

On Fri, Oct 9, 2009 at 3:40 PM, Jake Mannix <ja...@gmail.com> wrote:

>
>
> On Thu, Oct 8, 2009 at 9:32 PM, Chris Were <ch...@gmail.com> wrote:
>
>> Zoie looks very close to what I'm after, however my whole app is written
>> in
>> Python and uses PyLucene, so there is a non-trivial amount of work to make
>> things work with Zoie.
>>
>
> I've never used PyLucene before, but since it's a wrapper, plugging in zoie
>
> should be doable - instead of accessing the raw Lucene API via PyLucene,
> you could instantiate a proj.zoie.impl.indexing.ZoieSystem, then it would
> take care of the IndexReader management for you, and you'd just ask it
> for IndexReaders and you'd pass it indexing events.
>
> Not sure how JVM interactions are with Python.  Whenever I use other
> languages,
> I try and turn that paradigm upside down - use Jython and JRuby and Scala,
> and live *inside* of the JVM.  Then everything is nice and happy. :)
>
> I'm currently experiencing a bottleneck when optimising my index. How is
>> this handled / solved with Zoie?
>>
>
> In a realtime environment, optimizing your index isn't always the best
> thing
> to do - optimize down to small number of segments (with the under-used
> IndexWriter.optimize(int maxNumSegments) call) if you need to, but
> optimizing down to 1 segment means you completely blow away your
> OS disk cache, as all of your old files are gone and you have one huge new
> ginormous one.  Keeping a balanced set of segments which only a couple
> of the big ones merge together occasionally (and the small ones are
> continously
> being integrated into bigger ones) keeps you only blowing away parts of
> the IO cache at a time.
>
> But to answer your question more directly: with Zoie, the disk index is
> only
> refreshed every 15 minutes or so (this is configurable), because you have
> a RAMDirectory serving realtime results as well.  Any background segment
> merging and optimizing can be done on the disk index during this 15 minute
> interval, and typical use cases keep the target number of segments at
> a fixed moderately small number (5, 10 segments or so).
>
> The optimize() call takes progressively more time the fewer segments you
> have left, and much of the time is in the final 2 to 1 segment merge, so if
> you
> never do that last step, things go a lot faster - you can try this without
> zoie - replace your optimize() calls with optimize(2, or 5, or 10), and see
>   a) how long this takes instead, and
>   b) what effect on your latency this has (it will cost you something - but
> how much: check and see!)
>
>   If you end up trying zoie in PyLucene somehow, please post about it,
> we'd love to hear about it used in different sorts of environments.
>
>   -jake
>

Re: How to setup a scalable deployment?

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 8, 2009 at 9:32 PM, Chris Were <ch...@gmail.com> wrote:

> Zoie looks very close to what I'm after, however my whole app is written in
> Python and uses PyLucene, so there is a non-trivial amount of work to make
> things work with Zoie.
>

I've never used PyLucene before, but since it's a wrapper, plugging in zoie
should be doable - instead of accessing the raw Lucene API via PyLucene,
you could instantiate a proj.zoie.impl.indexing.ZoieSystem, then it would
take care of the IndexReader management for you, and you'd just ask it
for IndexReaders and you'd pass it indexing events.

Not sure how JVM interactions are with Python.  Whenever I use other
languages,
I try and turn that paradigm upside down - use Jython and JRuby and Scala,
and live *inside* of the JVM.  Then everything is nice and happy. :)

I'm currently experiencing a bottleneck when optimising my index. How is
> this handled / solved with Zoie?
>

In a realtime environment, optimizing your index isn't always the best thing
to do - optimize down to small number of segments (with the under-used
IndexWriter.optimize(int maxNumSegments) call) if you need to, but
optimizing down to 1 segment means you completely blow away your
OS disk cache, as all of your old files are gone and you have one huge new
ginormous one.  Keeping a balanced set of segments which only a couple
of the big ones merge together occasionally (and the small ones are
continously
being integrated into bigger ones) keeps you only blowing away parts of
the IO cache at a time.

But to answer your question more directly: with Zoie, the disk index is only

refreshed every 15 minutes or so (this is configurable), because you have
a RAMDirectory serving realtime results as well.  Any background segment
merging and optimizing can be done on the disk index during this 15 minute
interval, and typical use cases keep the target number of segments at
a fixed moderately small number (5, 10 segments or so).

The optimize() call takes progressively more time the fewer segments you
have left, and much of the time is in the final 2 to 1 segment merge, so if
you
never do that last step, things go a lot faster - you can try this without
zoie - replace your optimize() calls with optimize(2, or 5, or 10), and see
  a) how long this takes instead, and
  b) what effect on your latency this has (it will cost you something - but
how much: check and see!)

  If you end up trying zoie in PyLucene somehow, please post about it,
we'd love to hear about it used in different sorts of environments.

  -jake

Re: How to setup a scalable deployment?

Posted by Chris Were <ch...@gmail.com>.

>
> In this case, I'd say that if you have a reliable, scalable queueing system
> for
> getting indexing events distributed to all of your servers, then indexing
> on
> all replicas simultaneously can be the best way to have maximally realtime
> search, either using the very new feature of "near realtime search" in
> Lucene 2.9, by using something home-brewed which indexes in memory
> and on disk simultaneously, or using Zoie ( http://zoie.googlecode.com ),
> an open-source realtime search library built on top of Lucene which we at
> LinkedIn built and have been using in production for serving tens of
> millions of queries a day in real time (meaning milliseconds, even under
> fairly high indexing load) for the past year.


Zoie looks very close to what I'm after, however my whole app is written in
Python and uses PyLucene, so there is a non-trivial amount of work to make
things work with Zoie.

I'm currently experiencing a bottleneck when optimising my index. How is
this handled / solved with Zoie?

Cheers,
Chris

Re: How to setup a scalable deployment?

Posted by Jake Mannix <ja...@gmail.com>.

Hi Chris,

  Answering your question depends in part whether your kind of scalability
is dependent on sharding (your index size is expected to grow to very large)
or just replication (your query load is large, and you need failover).  It
sounds like you're mostly thinking about the latter.

1) Each web server indexes the content separately. This will potentially
> cause different web servers to have slightly different indexes at any given
> time and also duplicates the work load of indexing the content
>

  If your indexing throughput is small enough, this can be a perfectly
simple
way to do this.  Just make sure you're not DOS'ing your DB if you're
indexing
via direct DB queries (ie. if you have a message queue or something else
firing off indexing events, instead of all web servers firing off
simultaneous
identical DB queries from different places.  DB caching will deal with this
pretty well, but you still need to be careful).


> 2) Using rsync (or a similar tool) to regularly update a local lucene index
> directory on each web server. I imagine there will be locking issues that
> need to be resolved here.
>

  rsync can work great, and and as Jason said, it is how Solr works and it
scales great.  Locking isn't really a worry, because in this setup, the
slaves
are read-only in this case, so they won't compete with rsync for write
access.


> 3) Using a network file system that all the web servers can access. I don't
> have much experience in this area, but potentially latency on searches will
> be high?
>

  Generally this is a really bad idea, and can lead to really hard-to-debug
performance problems.


> 4) Some alternative lucene specific solution that I haven't found in the
> wiki / lucene documentation.


> The indexes aim to be as real-time as possible, I currently update my
> IndexReaders in a background thread every 20 seconds.
>

This is where things diverge from common practice, especially if you at some
point decide to lower that to 10 or 5 seconds.

In this case, I'd say that if you have a reliable, scalable queueing system
for
getting indexing events distributed to all of your servers, then indexing on
all replicas simultaneously can be the best way to have maximally realtime
search, either using the very new feature of "near realtime search" in
Lucene 2.9, by using something home-brewed which indexes in memory
and on disk simultaneously, or using Zoie ( http://zoie.googlecode.com ),
an open-source realtime search library built on top of Lucene which we at
LinkedIn built and have been using in production for serving tens of
millions of queries a day in real time (meaning milliseconds, even under
fairly high indexing load) for the past year.

  -jake mannix