You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Patric Forsgard <pa...@tasteful.se> on 2011/06/07 21:01:46 UTC

[Lucene.Net] Best practices for index on multiple servers.

Hi.

Whats the best practices to setup index with multiple servers?

I have found the following alternatives, today I use alternative 1 but its
problematic to made all index up2date with multiple instance of the
application.

1, Run the index on each machine and also update/rebuild the index on each
machine
2, Run index writing task on one machine and replicate the committed content
to the other machines
3, Let all machines have one index-location, one machine will be responsible
to update index. Should it be some problem to run this through samba share?
4, Let all machines have one index-location, all machines will be
responsible to update index with local changes.
5, Let all machines have one index-location, all machines will send changes
to an "change-service" that will commit the change into index.

Whats the best choice to start trying to get working?

I found the link http://wiki.apache.org/lucene-java/NearRealtimeSearch, will
that work for an index with multiple source that made updates?

// Patric

RE: [Lucene.Net] Best practices for index on multiple servers.

Posted by Moray McConnachie <mm...@oxford-analytica.com>.
In your scenario I would build a search server, and have your local apps
call the search server for results. 

But if you don't want to do that, and the indices are meant to be the
same on all machines (I don't think you make that clear below), could
you have a configuration which when a rebuild is triggered copies a new
version of the index from another instance?

Then when you need to rebuild, you could take one machine out of the
farm, rebuild the index on it, and progressively trigger rebuilds on the
other machines to replicate the index files elsewhere, and then add it
back into the farm.

Yours,
Moray

-------------------------------------
Moray McConnachie
Director of IT    +44 1865 261 600
Oxford Analytica  http://www.oxan.com

-----Original Message-----
From: Patric Forsgard [mailto:patric@tasteful.se] 
Sent: 08 June 2011 21:45
To: lucene-net-user@lucene.apache.org
Subject: Re: [Lucene.Net] Best practices for index on multiple servers.

Thanks for both your answer, I will add little more information about
our hardware and today's setup.

We are using standard windows environment today and the application is
an ASP.NET application that is hosted in IIS on MS WIndows Server
2008r2, for DB we use MS SQL 2008r2. 90% of all search is paged with a
page size between 10 and 50.

For the moment we have split and have 27 different indexes depend on
content and language on the content (one index for each language), i'm
not sure if this is the best way or if it should be better to only have
one index with all information in and filter the result depend on
content and language. The biggest index is around 100Mb and because that
we have different index for each language we also have some language
independent information that is stored in all index.

Today each machine in the farm will build and have their own index local
on the machine. The problem is if one of the machine will go down we
need to rebuild the index on this machine to ensure that the index is
up2date for the client. Also after upgrade of the application we need
sometime rebuild the index and its a pain in the ass to synchronize that
on multiple fronts (we use 4 fronts today...)

Inside the application that using the index we are listening to
different .net events to hook up and rebuild necessary index documents,
the event will be fired on all machines on the farm.

We also tried to use a network share to store the index on that all
machines was using, the problem was with updating the index that get one
machine that was responsible for that or all machine should be
responsible for the own change to be stored in the index.

Also have problem with network share for search/indexwrite because of
kerberos tooken that was used in network authentication for the share
need revalidation and Lucene.Net was throwing exception because
index-folder was not accessible and we need to recreate the
searcher/writer to get it to work. Probably the same as Justins fails
point 1.

Someone else that have any suggestions about how we can change and make
our index to be more easier to maintain and still fast to search?

// Patric

---------------------------------------------------------
Disclaimer 

This message and any attachments are confidential and/or privileged. If this has been sent to you in error, please do not use, retain or disclose them, and contact the sender as soon as possible.

Oxford Analytica Ltd
Registered in England: No. 1196703
5 Alfred Street, Oxford
United Kingdom, OX1 4EH
---------------------------------------------------------


Re: [Lucene.Net] Best practices for index on multiple servers.

Posted by Patric Forsgard <pa...@tasteful.se>.
Thanks for both your answer, I will add little more information about our
hardware and today's setup.

We are using standard windows environment today and the application is an
ASP.NET application that is hosted in IIS on MS WIndows Server 2008r2, for
DB we use MS SQL 2008r2. 90% of all search is paged with a page size between
10 and 50.

For the moment we have split and have 27 different indexes depend on content
and language on the content (one index for each language), i'm not sure if
this is the best way or if it should be better to only have one index with
all information in and filter the result depend on content and language. The
biggest index is around 100Mb and because that we have different index for
each language we also have some language independent information that is
stored in all index.

Today each machine in the farm will build and have their own index local on
the machine. The problem is if one of the machine will go down we need to
rebuild the index on this machine to ensure that the index is up2date for
the client. Also after upgrade of the application we need sometime rebuild
the index and its a pain in the ass to synchronize that on multiple fronts
(we use 4 fronts today...)

Inside the application that using the index we are listening to different
.net events to hook up and rebuild necessary index documents, the event will
be fired on all machines on the farm.

We also tried to use a network share to store the index on that all machines
was using, the problem was with updating the index that get one machine that
was responsible for that or all machine should be responsible for the own
change to be stored in the index.

Also have problem with network share for search/indexwrite because of
kerberos tooken that was used in network authentication for the share need
revalidation and Lucene.Net was throwing exception because index-folder was
not accessible and we need to recreate the searcher/writer to get it to
work. Probably the same as Justins fails point 1.

Someone else that have any suggestions about how we can change and make our
index to be more easier to maintain and still fast to search?

// Patric

RE: [Lucene.Net] Best practices for index on multiple servers.

Posted by Moray McConnachie <mm...@oxford-analytica.com>.
Check out the following thread from last month "Server farm sharing
Lucene"

http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-user/201105.m
box/browser

Ben West pointed out that 

'The Lucene FAQ
(http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) specifically
warns
against using remote file systems. Depending on what you mean by
"network-accessible", it
could be a lot slower.'

I would agree with this, it is not a good design to use a remote file
system unless you have small amounts of data or blisteringly quick (say
10G or virtual) networking. 

Obviously one way round it is to have the index, indexing apps and
searching apps locally on all machines.

Another way to deal with this, and the one we use, is to run a search
server providing a wrapper to your Lucene-based functionality. This way
you have one indexing application and one searching application.
Depending on the volume of queries, this can be a highly effective way
to do it, particularly if most of your searches return paged results (a
typical web scenario, but not just a web scenario), as the amount of
data returned in each result set is fairly small and therefore highly
transmissible. It also simplifies the deployment of code.

You can find out more about how we do things in my response to the above
thread at 
http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-user/201105.m
box/%3C71B5A2C57E0AD04F97E997AA244D455E01C90129@BEDIVERE.alfredst.oxford
-analytica.local%3E

Moray


-------------------------------------
Moray McConnachie
Director of IT    +44 1865 261 600
Oxford Analytica  http://www.oxan.com

-----Original Message-----
From: Justin Crossman [mailto:justin@deliriousvisions.com] 
Sent: 07 June 2011 20:25
To: lucene-net-user@lucene.apache.org
Subject: Re: [Lucene.Net] Best practices for index on multiple servers.

Patric

We have a setup with one DB server and one Index and 16 front end web
servers. We have a service run once every minute to keep the DB in sync
with the Index. We have a Windows share UNC path on each web front end
pointing at the Index. This works for us but I appreciate this question
you've posed because I can give you my story but I'm very interested in
any professional feedback anyone else can offer. We found that while
this works, we have a case where would would have liked to hit the Index
WAY more often but the system just wouldn't hold up under the load.

Our configuration works because:

1. Our network is close enough and fast enough

2. We have one single point of data storage accessible by a Windows
service

This configuration fails us:

1. When Windows shares fail (I could use some advice on this). I'll have
one front end suddenly through me 1000's of errors because the network
connection was no longer available. On review, it always is found to be
working fine and if left it will go away on its own. At the same time no
other front ends are having this issue.

2. In our case, a truly heavy load on the Index isn't possible due to
network constraints. To achieve what we need we'll need an even faster
connection/interface (>1GB Ethernet) or manage an Index on each front
end. Or something else?


It sounds like you might be in a (partial) Linux environment so my
experience falls short if that's true. However, in as far as I can offer
currently, I recommend against: 1 because replication is a real pain, 2
because while replication isn't as complicated, you still need some way
to know when replication has failed, 4 I don't know if it'll work (may
just be beyond me at the moment) and 3 because we found that managing
the whole Index (if large) remotely was unreasonably resource intense.
We never did figure out how to do it successfully so we always manage
our Index on the same server it's stored on.

For clarity, our infrastructure is completely Windows-based as it
pertains to this area. Windows Server 2008, MS SQL 2008 R2, IIS 7, etc.

I'm very interested to see what comes of this discussion as I would love
to be doing this as well as I can be (no errors, high load capacity,
ease of administration, etc).

Best,

:: Justin

On Jun 7, 2011, at 12:01 PM, Patric Forsgard wrote:

Hi.

Whats the best practices to setup index with multiple servers?

I have found the following alternatives, today I use alternative 1 but
its problematic to made all index up2date with multiple instance of the
application.

1, Run the index on each machine and also update/rebuild the index on
each machine 2, Run index writing task on one machine and replicate the
committed content to the other machines 3, Let all machines have one
index-location, one machine will be responsible to update index. Should
it be some problem to run this through samba share?
4, Let all machines have one index-location, all machines will be
responsible to update index with local changes.
5, Let all machines have one index-location, all machines will send
changes to an "change-service" that will commit the change into index.

Whats the best choice to start trying to get working?

I found the link http://wiki.apache.org/lucene-java/NearRealtimeSearch,
will that work for an index with multiple source that made updates?

// Patric

---------------------------------------------------------
Disclaimer 

This message and any attachments are confidential and/or privileged. If this has been sent to you in error, please do not use, retain or disclose them, and contact the sender as soon as possible.

Oxford Analytica Ltd
Registered in England: No. 1196703
5 Alfred Street, Oxford
United Kingdom, OX1 4EH
---------------------------------------------------------


Re: [Lucene.Net] Best practices for index on multiple servers.

Posted by Justin Crossman <ju...@deliriousvisions.com>.
Patric

We have a setup with one DB server and one Index and 16 front end web servers. We have a service run once every minute to keep the DB in sync with the Index. We have a Windows share UNC path on each web front end pointing at the Index. This works for us but I appreciate this question you've posed because I can give you my story but I'm very interested in any professional feedback anyone else can offer. We found that while this works, we have a case where would would have liked to hit the Index WAY more often but the system just wouldn't hold up under the load.

Our configuration works because:

1. Our network is close enough and fast enough

2. We have one single point of data storage accessible by a Windows service

This configuration fails us:

1. When Windows shares fail (I could use some advice on this). I'll have one front end suddenly through me 1000's of errors because the network connection was no longer available. On review, it always is found to be working fine and if left it will go away on its own. At the same time no other front ends are having this issue.

2. In our case, a truly heavy load on the Index isn't possible due to network constraints. To achieve what we need we'll need an even faster connection/interface (>1GB Ethernet) or manage an Index on each front end. Or something else?


It sounds like you might be in a (partial) Linux environment so my experience falls short if that's true. However, in as far as I can offer currently, I recommend against: 1 because replication is a real pain, 2 because while replication isn't as complicated, you still need some way to know when replication has failed, 4 I don't know if it'll work (may just be beyond me at the moment) and 3 because we found that managing the whole Index (if large) remotely was unreasonably resource intense. We never did figure out how to do it successfully so we always manage our Index on the same server it's stored on.

For clarity, our infrastructure is completely Windows-based as it pertains to this area. Windows Server 2008, MS SQL 2008 R2, IIS 7, etc.

I'm very interested to see what comes of this discussion as I would love to be doing this as well as I can be (no errors, high load capacity, ease of administration, etc).

Best,

:: Justin

On Jun 7, 2011, at 12:01 PM, Patric Forsgard wrote:

Hi.

Whats the best practices to setup index with multiple servers?

I have found the following alternatives, today I use alternative 1 but its
problematic to made all index up2date with multiple instance of the
application.

1, Run the index on each machine and also update/rebuild the index on each
machine
2, Run index writing task on one machine and replicate the committed content
to the other machines
3, Let all machines have one index-location, one machine will be responsible
to update index. Should it be some problem to run this through samba share?
4, Let all machines have one index-location, all machines will be
responsible to update index with local changes.
5, Let all machines have one index-location, all machines will send changes
to an "change-service" that will commit the change into index.

Whats the best choice to start trying to get working?

I found the link http://wiki.apache.org/lucene-java/NearRealtimeSearch, will
that work for an index with multiple source that made updates?

// Patric