You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Nigel <ni...@gmail.com> on 2009/10/01 23:15:14 UTC

Efficiently reopening remotely-distributed indexes in 2.9?

I have a question about the reopen functionality in Lucene 2.9.  As I
understand it, since FieldCaches are now per-segment, it can avoid reloading
everything when the index is reopened, and instead just load the new
segments.

For background, like many people we have a distributed architecture where
indexes are created on one server and copied to multiple other servers.  The
way that copying works now is something like the following:

   1. Let's say the current index is in /indexes/a and is open
   2. An empty directory for the updated index is created, let's say
   /indexes/b
   3. Hard links for the files in /indexes/a are created in /indexes/b
   4. We rsync the current index on the server with /indexes/b, thus copying
   over new cfs files and deleting hard links to files no longer in use
   5. A new IndexReader is opened for /indexes/b and warmed up
   6. The application starts using the new reader instead of the old one
   7. The old IndexReader is closed and /indexes/a is deleted

I'm simplifying a few steps, but I think this is familiar to many people,
and it's my impression that Solr implements something similar.

The point is, the updated index lives in a new directory in this scheme, and
so we don't actually reopen the existing IndexReader; we open a new one with
a different FSDirectory.

Before Lucene 2.9, I don't think this made any difference, as (I think) the
only advantage to calling reopen vs. just creating another IndexReader was
having reopen figure out whether the index had actually changed.  (And whave
a different way to figure that out, so it was a non-issue.)

With Lucene 2.9, there's now a big difference, namely the per-segment
caching mentioned above.  So the question is how to make use of reopen with
our distribution scheme.  Is there an informal best practice for handling
this case?  For example, should step #5 above rename /indexes/b to
/indexes/a so the index can be reopened in the same physical location?  Or
should rsync operate on the existing directory in-place, updating the
segments* files last and relying on the fact that deleted files will not
really be deleted (on Linux, at least) as long as the app is still holding
them open?

I guess the answer may depend on how exactly reopen knows which files are
the "same" (e.g. does it look at filenames, or file descriptors, etc.).

Thanks,
Chris

Re: Efficiently reopening remotely-distributed indexes in 2.9?

Posted by Michael Busch <bu...@gmail.com>.

On 10/5/09 5:30 PM, Nigel wrote:
>> Before Lucene 2.9, I don't think this made any difference, as (I think) the
>> only advantage to calling reopen vs. just creating another IndexReader was
>> having reopen figure out whether the index had actually changed.  (And whave
>> a different way to figure that out, so it was a non-issue.)
>>
>>      

There was a big difference before too: reopen() in 2.4.x only loads 
internal data structures from new segments, like the terms dictionary 
and norms. The performance improvements were significant already (see 
https://issues.apache.org/jira/browse/LUCENE-743?focusedCommentId=12532585&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12532585). 
Now in 2.9 this also works for field caches.

  Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Efficiently reopening remotely-distributed indexes in 2.9?

Posted by Mark Miller <ma...@gmail.com>.

I keep considering a full response too this, but I just can't get over
the hump and spend the time writing something up. Figured someone else
would get to it - perhaps they still will.

I will make a comment here though:

>Before Lucene 2.9, I don't think this made any difference, as (I think) the
>only advantage to calling reopen vs. just creating another IndexReader was
>having reopen figure out whether the index had actually changed.  (And whave
>a different way to figure that out, so it was a non-issue.)

Thats not quite right. Reopen did not just check if the index had
changed - it also only reloaded the segments that changed. The big
change allowed by per segment searching is that now only the *FieldCache
pieces that have changed* are also reloaded, rather than the whole
FieldCache. So reopen was nice and advantageous before - but now it is
more so if you are using FieldCaches.


Nigel wrote:
> Anyone have any ideas here?  I imagine a lot of other people will have a
> similar question when trying to take advantage of the reopen improvements in
> 2.9.
>
> Thanks,
> Chris
>
> On Thu, Oct 1, 2009 at 5:15 PM, Nigel <ni...@gmail.com> wrote:
>
>   
>> I have a question about the reopen functionality in Lucene 2.9.  As I
>> understand it, since FieldCaches are now per-segment, it can avoid reloading
>> everything when the index is reopened, and instead just load the new
>> segments.
>>
>> For background, like many people we have a distributed architecture where
>> indexes are created on one server and copied to multiple other servers.  The
>> way that copying works now is something like the following:
>>
>>    1. Let's say the current index is in /indexes/a and is open
>>    2. An empty directory for the updated index is created, let's say
>>    /indexes/b
>>    3. Hard links for the files in /indexes/a are created in /indexes/b
>>    4. We rsync the current index on the server with /indexes/b, thus
>>    copying over new cfs files and deleting hard links to files no longer in use
>>    5. A new IndexReader is opened for /indexes/b and warmed up
>>    6. The application starts using the new reader instead of the old one
>>    7. The old IndexReader is closed and /indexes/a is deleted
>>
>> I'm simplifying a few steps, but I think this is familiar to many people,
>> and it's my impression that Solr implements something similar.
>>
>> The point is, the updated index lives in a new directory in this scheme,
>> and so we don't actually reopen the existing IndexReader; we open a new one
>> with a different FSDirectory.
>>
>> Before Lucene 2.9, I don't think this made any difference, as (I think) the
>> only advantage to calling reopen vs. just creating another IndexReader was
>> having reopen figure out whether the index had actually changed.  (And whave
>> a different way to figure that out, so it was a non-issue.)
>>
>> With Lucene 2.9, there's now a big difference, namely the per-segment
>> caching mentioned above.  So the question is how to make use of reopen with
>> our distribution scheme.  Is there an informal best practice for handling
>> this case?  For example, should step #5 above rename /indexes/b to
>> /indexes/a so the index can be reopened in the same physical location?  Or
>> should rsync operate on the existing directory in-place, updating the
>> segments* files last and relying on the fact that deleted files will not
>> really be deleted (on Linux, at least) as long as the app is still holding
>> them open?
>>
>> I guess the answer may depend on how exactly reopen knows which files are
>> the "same" (e.g. does it look at filenames, or file descriptors, etc.).
>>
>> Thanks,
>> Chris
>>
>>     
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Efficiently reopening remotely-distributed indexes in 2.9?

Posted by Nigel <ni...@gmail.com>.

Got it -- thanks, Mark!  (Recently I read elsewhere in the archives of this
list about the value or lack thereof of segments.gen, so skipping that file
was in the back of my mind as well.)

Chris

On Thu, Oct 8, 2009 at 3:04 PM, Mark Miller <ma...@gmail.com> wrote:

> Nigel wrote:
> > Thanks, Mark.  That makes sense.  I guess if you do it in the right
> order,
> > you're guaranteed to have the files in a consistent state, since the only
> > thing that's actually overwritten is the segments.gen file at the end.
> >
> The main thing to do is to copy the segments_N files last - thats what a
> Reader will use to see there is a new index version. The segments.gen
> files is a backup resort that shouldn't likely be used unless your on
> NFS from what I know.
> > What about the technique of creating a copy of the directory with hard
> links
> > and rsyncing changes into that copy?  Is that only necessary if you want
> to
> > be using the old and updated versions of the index simultaneously?
> >
> I think this was only necessary before IndexDeletionPolices - you didn't
> want the IndexWriter removing the files before you were done copying
> them out. You can manage that with a delete policy now though.
> > Thanks,
> > Chris
> >
> > On Wed, Oct 7, 2009 at 4:02 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >
> >
> >> Solr just copies them into the same directory - Lucene files are write
> >> once, so its not much different than what happens locally.
> >>
> >> Nigel wrote:
> >>
> >>> Right now we logically re-open an index by making an updated copy of
> the
> >>> index in a new directory (using rsync etc.), opening the new copy, and
> >>> closing the old one.  We don't use IndexReader.reopen() because the
> >>>
> >> updated
> >>
> >>> index is in a different directory (as opposed to being updated
> in-place).
> >>>
> >>> (Reading about some of the 2.9 changes motivated me to look into
> actually
> >>> using reopen().  And Michael Busch and Mark Miller both pointed out
> that
> >>>
> >> I
> >>
> >>> was incorrect in saying that pre-2.9 reopen() wasn't more efficient
> than
> >>> just opening a new index -- I've read through that code now so I have
> at
> >>> least a basic understanding of what's happening there.  Anyway, it
> seems
> >>> like reopen() is a Good Thing, so I'd like to use it. (-:)
> >>>
> >>> So, my real question was whether there is a "recommended" way to update
> >>>
> >> an
> >>
> >>> index in-place with files copied from a separate indexing server.
> >>>
> >>> For example, do you simply rsync in the new cfs files, overwrite the
> >>> segments.gen and segments_XX files, and call reopen()?  Or create an
> >>>
> >> updated
> >>
> >>> copy in a new directory, then rename new directory to the old name once
> >>> you're sure you've copied everything successfully, then call reopen()?
> >>>
> >>  What
> >>
> >>> does Solr do?
> >>>
> >>> Thanks,
> >>> Chris
> >>>
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Efficiently reopening remotely-distributed indexes in 2.9?

Posted by Mark Miller <ma...@gmail.com>.

Nigel wrote:
> Thanks, Mark.  That makes sense.  I guess if you do it in the right order,
> you're guaranteed to have the files in a consistent state, since the only
> thing that's actually overwritten is the segments.gen file at the end.
>   
The main thing to do is to copy the segments_N files last - thats what a
Reader will use to see there is a new index version. The segments.gen
files is a backup resort that shouldn't likely be used unless your on
NFS from what I know.
> What about the technique of creating a copy of the directory with hard links
> and rsyncing changes into that copy?  Is that only necessary if you want to
> be using the old and updated versions of the index simultaneously?
>   
I think this was only necessary before IndexDeletionPolices - you didn't
want the IndexWriter removing the files before you were done copying
them out. You can manage that with a delete policy now though.
> Thanks,
> Chris
>
> On Wed, Oct 7, 2009 at 4:02 PM, Mark Miller <ma...@gmail.com> wrote:
>
>   
>> Solr just copies them into the same directory - Lucene files are write
>> once, so its not much different than what happens locally.
>>
>> Nigel wrote:
>>     
>>> Right now we logically re-open an index by making an updated copy of the
>>> index in a new directory (using rsync etc.), opening the new copy, and
>>> closing the old one.  We don't use IndexReader.reopen() because the
>>>       
>> updated
>>     
>>> index is in a different directory (as opposed to being updated in-place).
>>>
>>> (Reading about some of the 2.9 changes motivated me to look into actually
>>> using reopen().  And Michael Busch and Mark Miller both pointed out that
>>>       
>> I
>>     
>>> was incorrect in saying that pre-2.9 reopen() wasn't more efficient than
>>> just opening a new index -- I've read through that code now so I have at
>>> least a basic understanding of what's happening there.  Anyway, it seems
>>> like reopen() is a Good Thing, so I'd like to use it. (-:)
>>>
>>> So, my real question was whether there is a "recommended" way to update
>>>       
>> an
>>     
>>> index in-place with files copied from a separate indexing server.
>>>
>>> For example, do you simply rsync in the new cfs files, overwrite the
>>> segments.gen and segments_XX files, and call reopen()?  Or create an
>>>       
>> updated
>>     
>>> copy in a new directory, then rename new directory to the old name once
>>> you're sure you've copied everything successfully, then call reopen()?
>>>       
>>  What
>>     
>>> does Solr do?
>>>
>>> Thanks,
>>> Chris
>>>       
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Efficiently reopening remotely-distributed indexes in 2.9?

Posted by Nigel <ni...@gmail.com>.

Thanks, Mark.  That makes sense.  I guess if you do it in the right order,
you're guaranteed to have the files in a consistent state, since the only
thing that's actually overwritten is the segments.gen file at the end.

What about the technique of creating a copy of the directory with hard links
and rsyncing changes into that copy?  Is that only necessary if you want to
be using the old and updated versions of the index simultaneously?

Thanks,
Chris

On Wed, Oct 7, 2009 at 4:02 PM, Mark Miller <ma...@gmail.com> wrote:

> Solr just copies them into the same directory - Lucene files are write
> once, so its not much different than what happens locally.
>
> Nigel wrote:
> > Right now we logically re-open an index by making an updated copy of the
> > index in a new directory (using rsync etc.), opening the new copy, and
> > closing the old one.  We don't use IndexReader.reopen() because the
> updated
> > index is in a different directory (as opposed to being updated in-place).
> >
> > (Reading about some of the 2.9 changes motivated me to look into actually
> > using reopen().  And Michael Busch and Mark Miller both pointed out that
> I
> > was incorrect in saying that pre-2.9 reopen() wasn't more efficient than
> > just opening a new index -- I've read through that code now so I have at
> > least a basic understanding of what's happening there.  Anyway, it seems
> > like reopen() is a Good Thing, so I'd like to use it. (-:)
> >
> > So, my real question was whether there is a "recommended" way to update
> an
> > index in-place with files copied from a separate indexing server.
> >
> > For example, do you simply rsync in the new cfs files, overwrite the
> > segments.gen and segments_XX files, and call reopen()?  Or create an
> updated
> > copy in a new directory, then rename new directory to the old name once
> > you're sure you've copied everything successfully, then call reopen()?
>  What
> > does Solr do?
> >
> > Thanks,
> > Chris
>

Re: Efficiently reopening remotely-distributed indexes in 2.9?

Posted by Mark Miller <ma...@gmail.com>.

Solr just copies them into the same directory - Lucene files are write
once, so its not much different than what happens locally.

Nigel wrote:
> Right now we logically re-open an index by making an updated copy of the
> index in a new directory (using rsync etc.), opening the new copy, and
> closing the old one.  We don't use IndexReader.reopen() because the updated
> index is in a different directory (as opposed to being updated in-place).
>
> (Reading about some of the 2.9 changes motivated me to look into actually
> using reopen().  And Michael Busch and Mark Miller both pointed out that I
> was incorrect in saying that pre-2.9 reopen() wasn't more efficient than
> just opening a new index -- I've read through that code now so I have at
> least a basic understanding of what's happening there.  Anyway, it seems
> like reopen() is a Good Thing, so I'd like to use it. (-:)
>
> So, my real question was whether there is a "recommended" way to update an
> index in-place with files copied from a separate indexing server.
>
> For example, do you simply rsync in the new cfs files, overwrite the
> segments.gen and segments_XX files, and call reopen()?  Or create an updated
> copy in a new directory, then rename new directory to the old name once
> you're sure you've copied everything successfully, then call reopen()?  What
> does Solr do?
>
> Thanks,
> Chris
>
> On Mon, Oct 5, 2009 at 8:39 PM, Jason Rutherglen <jason.rutherglen@gmail.com
>   
>> wrote:
>>     
>
>   
>> I'm not sure I understand the question. You're trying to reopen
>> the segments that you're replicated and you're wondering what's
>> changed in Lucene?
>>
>> On Mon, Oct 5, 2009 at 5:30 PM, Nigel <ni...@gmail.com> wrote:
>>     
>>> Anyone have any ideas here?  I imagine a lot of other people will have a
>>> similar question when trying to take advantage of the reopen improvements
>>>       
>> in
>>     
>>> 2.9.
>>>
>>> Thanks,
>>> Chris
>>>
>>> On Thu, Oct 1, 2009 at 5:15 PM, Nigel <ni...@gmail.com> wrote:
>>>
>>>       
>>>> I have a question about the reopen functionality in Lucene 2.9.  As I
>>>> understand it, since FieldCaches are now per-segment, it can avoid
>>>>         
>> reloading
>>     
>>>> everything when the index is reopened, and instead just load the new
>>>> segments.
>>>>
>>>> For background, like many people we have a distributed architecture
>>>>         
>> where
>>     
>>>> indexes are created on one server and copied to multiple other servers.
>>>>         
>>  The
>>     
>>>> way that copying works now is something like the following:
>>>>
>>>>    1. Let's say the current index is in /indexes/a and is open
>>>>    2. An empty directory for the updated index is created, let's say
>>>>    /indexes/b
>>>>    3. Hard links for the files in /indexes/a are created in /indexes/b
>>>>    4. We rsync the current index on the server with /indexes/b, thus
>>>>    copying over new cfs files and deleting hard links to files no longer
>>>>         
>> in use
>>     
>>>>    5. A new IndexReader is opened for /indexes/b and warmed up
>>>>    6. The application starts using the new reader instead of the old one
>>>>    7. The old IndexReader is closed and /indexes/a is deleted
>>>>
>>>> I'm simplifying a few steps, but I think this is familiar to many
>>>>         
>> people,
>>     
>>>> and it's my impression that Solr implements something similar.
>>>>
>>>> The point is, the updated index lives in a new directory in this scheme,
>>>> and so we don't actually reopen the existing IndexReader; we open a new
>>>>         
>> one
>>     
>>>> with a different FSDirectory.
>>>>
>>>> Before Lucene 2.9, I don't think this made any difference, as (I think)
>>>>         
>> the
>>     
>>>> only advantage to calling reopen vs. just creating another IndexReader
>>>>         
>> was
>>     
>>>> having reopen figure out whether the index had actually changed.  (And
>>>>         
>> whave
>>     
>>>> a different way to figure that out, so it was a non-issue.)
>>>>
>>>> With Lucene 2.9, there's now a big difference, namely the per-segment
>>>> caching mentioned above.  So the question is how to make use of reopen
>>>>         
>> with
>>     
>>>> our distribution scheme.  Is there an informal best practice for
>>>>         
>> handling
>>     
>>>> this case?  For example, should step #5 above rename /indexes/b to
>>>> /indexes/a so the index can be reopened in the same physical location?
>>>>         
>>  Or
>>     
>>>> should rsync operate on the existing directory in-place, updating the
>>>> segments* files last and relying on the fact that deleted files will not
>>>> really be deleted (on Linux, at least) as long as the app is still
>>>>         
>> holding
>>     
>>>> them open?
>>>>
>>>> I guess the answer may depend on how exactly reopen knows which files
>>>>         
>> are
>>     
>>>> the "same" (e.g. does it look at filenames, or file descriptors, etc.).
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>>         
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Efficiently reopening remotely-distributed indexes in 2.9?

Posted by Nigel <ni...@gmail.com>.

Right now we logically re-open an index by making an updated copy of the
index in a new directory (using rsync etc.), opening the new copy, and
closing the old one.  We don't use IndexReader.reopen() because the updated
index is in a different directory (as opposed to being updated in-place).

(Reading about some of the 2.9 changes motivated me to look into actually
using reopen().  And Michael Busch and Mark Miller both pointed out that I
was incorrect in saying that pre-2.9 reopen() wasn't more efficient than
just opening a new index -- I've read through that code now so I have at
least a basic understanding of what's happening there.  Anyway, it seems
like reopen() is a Good Thing, so I'd like to use it. (-:)

So, my real question was whether there is a "recommended" way to update an
index in-place with files copied from a separate indexing server.

For example, do you simply rsync in the new cfs files, overwrite the
segments.gen and segments_XX files, and call reopen()?  Or create an updated
copy in a new directory, then rename new directory to the old name once
you're sure you've copied everything successfully, then call reopen()?  What
does Solr do?

Thanks,
Chris

On Mon, Oct 5, 2009 at 8:39 PM, Jason Rutherglen <jason.rutherglen@gmail.com
> wrote:

> I'm not sure I understand the question. You're trying to reopen
> the segments that you're replicated and you're wondering what's
> changed in Lucene?
>
> On Mon, Oct 5, 2009 at 5:30 PM, Nigel <ni...@gmail.com> wrote:
> > Anyone have any ideas here?  I imagine a lot of other people will have a
> > similar question when trying to take advantage of the reopen improvements
> in
> > 2.9.
> >
> > Thanks,
> > Chris
> >
> > On Thu, Oct 1, 2009 at 5:15 PM, Nigel <ni...@gmail.com> wrote:
> >
> >> I have a question about the reopen functionality in Lucene 2.9.  As I
> >> understand it, since FieldCaches are now per-segment, it can avoid
> reloading
> >> everything when the index is reopened, and instead just load the new
> >> segments.
> >>
> >> For background, like many people we have a distributed architecture
> where
> >> indexes are created on one server and copied to multiple other servers.
>  The
> >> way that copying works now is something like the following:
> >>
> >>    1. Let's say the current index is in /indexes/a and is open
> >>    2. An empty directory for the updated index is created, let's say
> >>    /indexes/b
> >>    3. Hard links for the files in /indexes/a are created in /indexes/b
> >>    4. We rsync the current index on the server with /indexes/b, thus
> >>    copying over new cfs files and deleting hard links to files no longer
> in use
> >>    5. A new IndexReader is opened for /indexes/b and warmed up
> >>    6. The application starts using the new reader instead of the old one
> >>    7. The old IndexReader is closed and /indexes/a is deleted
> >>
> >> I'm simplifying a few steps, but I think this is familiar to many
> people,
> >> and it's my impression that Solr implements something similar.
> >>
> >> The point is, the updated index lives in a new directory in this scheme,
> >> and so we don't actually reopen the existing IndexReader; we open a new
> one
> >> with a different FSDirectory.
> >>
> >> Before Lucene 2.9, I don't think this made any difference, as (I think)
> the
> >> only advantage to calling reopen vs. just creating another IndexReader
> was
> >> having reopen figure out whether the index had actually changed.  (And
> whave
> >> a different way to figure that out, so it was a non-issue.)
> >>
> >> With Lucene 2.9, there's now a big difference, namely the per-segment
> >> caching mentioned above.  So the question is how to make use of reopen
> with
> >> our distribution scheme.  Is there an informal best practice for
> handling
> >> this case?  For example, should step #5 above rename /indexes/b to
> >> /indexes/a so the index can be reopened in the same physical location?
>  Or
> >> should rsync operate on the existing directory in-place, updating the
> >> segments* files last and relying on the fact that deleted files will not
> >> really be deleted (on Linux, at least) as long as the app is still
> holding
> >> them open?
> >>
> >> I guess the answer may depend on how exactly reopen knows which files
> are
> >> the "same" (e.g. does it look at filenames, or file descriptors, etc.).
> >>
> >> Thanks,
> >> Chris
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Efficiently reopening remotely-distributed indexes in 2.9?

Posted by Jason Rutherglen <ja...@gmail.com>.

I'm not sure I understand the question. You're trying to reopen
the segments that you're replicated and you're wondering what's
changed in Lucene?

On Mon, Oct 5, 2009 at 5:30 PM, Nigel <ni...@gmail.com> wrote:
> Anyone have any ideas here?  I imagine a lot of other people will have a
> similar question when trying to take advantage of the reopen improvements in
> 2.9.
>
> Thanks,
> Chris
>
> On Thu, Oct 1, 2009 at 5:15 PM, Nigel <ni...@gmail.com> wrote:
>
>> I have a question about the reopen functionality in Lucene 2.9.  As I
>> understand it, since FieldCaches are now per-segment, it can avoid reloading
>> everything when the index is reopened, and instead just load the new
>> segments.
>>
>> For background, like many people we have a distributed architecture where
>> indexes are created on one server and copied to multiple other servers.  The
>> way that copying works now is something like the following:
>>
>>    1. Let's say the current index is in /indexes/a and is open
>>    2. An empty directory for the updated index is created, let's say
>>    /indexes/b
>>    3. Hard links for the files in /indexes/a are created in /indexes/b
>>    4. We rsync the current index on the server with /indexes/b, thus
>>    copying over new cfs files and deleting hard links to files no longer in use
>>    5. A new IndexReader is opened for /indexes/b and warmed up
>>    6. The application starts using the new reader instead of the old one
>>    7. The old IndexReader is closed and /indexes/a is deleted
>>
>> I'm simplifying a few steps, but I think this is familiar to many people,
>> and it's my impression that Solr implements something similar.
>>
>> The point is, the updated index lives in a new directory in this scheme,
>> and so we don't actually reopen the existing IndexReader; we open a new one
>> with a different FSDirectory.
>>
>> Before Lucene 2.9, I don't think this made any difference, as (I think) the
>> only advantage to calling reopen vs. just creating another IndexReader was
>> having reopen figure out whether the index had actually changed.  (And whave
>> a different way to figure that out, so it was a non-issue.)
>>
>> With Lucene 2.9, there's now a big difference, namely the per-segment
>> caching mentioned above.  So the question is how to make use of reopen with
>> our distribution scheme.  Is there an informal best practice for handling
>> this case?  For example, should step #5 above rename /indexes/b to
>> /indexes/a so the index can be reopened in the same physical location?  Or
>> should rsync operate on the existing directory in-place, updating the
>> segments* files last and relying on the fact that deleted files will not
>> really be deleted (on Linux, at least) as long as the app is still holding
>> them open?
>>
>> I guess the answer may depend on how exactly reopen knows which files are
>> the "same" (e.g. does it look at filenames, or file descriptors, etc.).
>>
>> Thanks,
>> Chris
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Efficiently reopening remotely-distributed indexes in 2.9?

Posted by Nigel <ni...@gmail.com>.

Anyone have any ideas here?  I imagine a lot of other people will have a
similar question when trying to take advantage of the reopen improvements in
2.9.

Thanks,
Chris

On Thu, Oct 1, 2009 at 5:15 PM, Nigel <ni...@gmail.com> wrote:

> I have a question about the reopen functionality in Lucene 2.9.  As I
> understand it, since FieldCaches are now per-segment, it can avoid reloading
> everything when the index is reopened, and instead just load the new
> segments.
>
> For background, like many people we have a distributed architecture where
> indexes are created on one server and copied to multiple other servers.  The
> way that copying works now is something like the following:
>
>    1. Let's say the current index is in /indexes/a and is open
>    2. An empty directory for the updated index is created, let's say
>    /indexes/b
>    3. Hard links for the files in /indexes/a are created in /indexes/b
>    4. We rsync the current index on the server with /indexes/b, thus
>    copying over new cfs files and deleting hard links to files no longer in use
>    5. A new IndexReader is opened for /indexes/b and warmed up
>    6. The application starts using the new reader instead of the old one
>    7. The old IndexReader is closed and /indexes/a is deleted
>
> I'm simplifying a few steps, but I think this is familiar to many people,
> and it's my impression that Solr implements something similar.
>
> The point is, the updated index lives in a new directory in this scheme,
> and so we don't actually reopen the existing IndexReader; we open a new one
> with a different FSDirectory.
>
> Before Lucene 2.9, I don't think this made any difference, as (I think) the
> only advantage to calling reopen vs. just creating another IndexReader was
> having reopen figure out whether the index had actually changed.  (And whave
> a different way to figure that out, so it was a non-issue.)
>
> With Lucene 2.9, there's now a big difference, namely the per-segment
> caching mentioned above.  So the question is how to make use of reopen with
> our distribution scheme.  Is there an informal best practice for handling
> this case?  For example, should step #5 above rename /indexes/b to
> /indexes/a so the index can be reopened in the same physical location?  Or
> should rsync operate on the existing directory in-place, updating the
> segments* files last and relying on the fact that deleted files will not
> really be deleted (on Linux, at least) as long as the app is still holding
> them open?
>
> I guess the answer may depend on how exactly reopen knows which files are
> the "same" (e.g. does it look at filenames, or file descriptors, etc.).
>
> Thanks,
> Chris
>