You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/11/10 16:49:34 UTC

[lucy-dev] Index modernizer

Greets,

As the index format changes, we accumulate cruft in our codebase to support
old indexes and old segments.  At some point, we need to purge such cruft and
abandon support for old indexes.  But if you are a user, it's hard to know
whether your index has old segments in it, and whether you can upgrade safely
to a given version of the library.

In theory, you can launch an Indexer, call Optimize(), and force it to rewrite
your index as one large segment.  But that hasn't always worked reliably, in
either Lucene or KinoSearch, because modernizing is orthogonal to optimizing
for search speed.  Both libraries, at one time or another, have detected the
case of an index with a single segment with no deletions, at which point they
decide that the index is already optimized and bail out.

I think a strategy dedicated specifically to modernization of an index is
called for.  For Lucy, it can be achieved with a application combining a
BackgroundMerger and an IndexManager which implements a custom merge policy.
Instead of rewriting to one large segment, this modernizer app should launch a
BackgroundMerger once for each segment, rewriting them one at a time.  Once
all segments are brought up to date, the app exits.

If possible, the modernizer should not rewrite segments that already use the
most up-to-date format.  This will be possible so long as the user has not
subclassed Architecture to plug in custom index components.  Under the default
Architecture, the stack of writers is known and finite, and we can easily
determine whether a given segment uses the most modern format for each
component.  

If, on the other hand, a user has subclassed Architecture, we have to punt and
rewrite all segments.  Even that may not be sufficient, depending on whether
custom components operate outside of the segment system -- but that's a
far-off theoretical case, and I don't think adding an abstract Modernize()
method to DataWriter which all components must implement is justified.

I'm torn as to where to implement this functionality.  Since it may be
necessary to load custom classes, e.g. FieldType or Schema subclasses, that
suggests a Cookbook/sample app which the user might modify.  On the other
hand, if we are going to require that users run this app in order to upgrade
-- and we will, sooner or later -- maybe there ought to be a core class,
Lucy::Index::Modernizer...  Probably best to start with Cookbook/sample code
which makes no public API promises, methinks...

Marvin Humphrey

Re: [lucy-dev] Index modernizer

Posted by Nathan Kurz <na...@verse.com>.

On Thu, Nov 18, 2010 at 1:37 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
>> But if you were to make it automatic and easy, I'd concentrate more on
>> automatic imports from Lucene than from KS. :)
>
> I doubt that Lucene imports will ever happen.  Certainly I'm never going to
> write that code.  The Lucene file format is much too complicated.

Perhaps an insidious inside job then: export rather than import.
Presumably it would be fairly easy to write a plugin that would be
capable of writing out a Lucy file from within Lucene.  This might be
a fine project for a Java programmer with Lucene experience to gain
familiarity with the Lucy file format.   Anyone lurkers fit this
description? :)

In general, I do think it would be useful to be able to benchmark
against Lucene.  One doesn't need to have full compatibility, but the
ability to go back on forth for the simple stuff seems like it would
be a boon.  If the Lucy format is simple enough, one could even make
things fair by adding a Lucy import option to Lucene, so someone can
easily see which one fits best for their needs.

--nate

Re: [lucy-dev] Index modernizer

Posted by WARC <vo...@gmail.com>.

Hi List,

>> But if you were to make it automatic and easy, I'd concentrate more  
>> on
>> automatic imports from Lucene than from KS. :)
>
> I doubt that Lucene imports will ever happen.  Certainly I'm never  
> going to
> write that code.  The Lucene file format is much too complicated.

Beign able to import Lucene index will attract people!

N.B: is the Lucy file format documented somewhere?

cheers
V.

Re: [lucy-dev] Index modernizer

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Nov 18, 2010 at 10:48:54AM -0800, Nathan Kurz wrote:
> It feels like generous severance pay for fired executives.  In theory,
> it helps you hire the next person if they see how well you treated the
> outgoing one. In practice, it tends to look like wasted money that
> could be better spent.  Yes, you want to treat your existing users
> well, but be careful about spending to much effort on them and not
> enough on improving the product.   Concentrate most of your efforts on
> the 10x as many new users you hope to have and provide them with the
> best experience you can.

I agree that we should privilege active developers.
 
> It's open source software.  Perhaps the automated solution could be
> provided to the community by someone who needs it, rather than by the
> main developer?  At least you could wait for their plaintive wailing.
> Unless of course you need it internally, in which case it would be a
> great thing for you to be working on. 

I've already written the patch which allows Lucy to read indexes written by
recent versions of KinoSearch.  :)  All it does is alias a few class names so
that deserialization of schema files doesn't blow up.

> But if you were to make it automatic and easy, I'd concentrate more on
> automatic imports from Lucene than from KS. :)

I doubt that Lucene imports will ever happen.  Certainly I'm never going to
write that code.  The Lucene file format is much too complicated.

Marvin Humphrey

Re: [lucy-dev] Index modernizer

Posted by Nathan Kurz <na...@verse.com>.

On Thu, Nov 11, 2010 at 8:18 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
>> Maybe write the cookbook recipe for upgrading KS to Lucy, and then we can
>> see if it needs to be formalized into a part of the core?
>
> OK, sounds like the cookbook approach is feasible and prudent.  We don't
> technically *need* the modernizer until we decide to drop support for an old
> format, though.

Since I'm catching up on this list today:  Even without the italics,
you don't need the modernizer *at all*.  No one has to make the
transition, they can keep using the software they've been using, so
you're not hurting them by not making it automatic.

It feels like generous severance pay for fired executives.  In theory,
it helps you hire the next person if they see how well you treated the
outgoing one. In practice, it tends to look like wasted money that
could be better spent.  Yes, you want to treat your existing users
well, but be careful about spending to much effort on them and not
enough on improving the product.   Concentrate most of your efforts on
the 10x as many new users you hope to have and provide them with the
best experience you can.

It's open source software.  Perhaps the automated solution could be
provided to the community by someone who needs it, rather than by the
main developer?  At least you could wait for their plaintive wailing.
Unless of course you need it internally, in which case it would be a
great thing for you to be working on.  But if you were to make it
automatic and easy, I'd concentrate more on automatic imports from
Lucene than from KS. :)

--nate

Re: [lucy-dev] Index modernizer

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Nov 11, 2010 at 09:22:00AM -0600, Peter Karman wrote:
> > As the index format changes, we accumulate cruft in our codebase to support
> > old indexes and old segments.  At some point, we need to purge such cruft and
> > abandon support for old indexes.  But if you are a user, it's hard to know
> > whether your index has old segments in it, and whether you can upgrade safely
> > to a given version of the library.
> 
> You're describing the back compat path for KS users switching to Lucy, yes?

The "modernizer" approach addresses a general problem for the Lucy/Lucene
segmented index design, and it will be useful at every major index format
break going forward.  But yes, I'm thinking that the first use case would be
dropping support for segments originally written under KinoSearch.

I've recently whipped up a patch that allows Lucy to read KinoSearch indexes.
All that we need to do is alias a few class names so that deserializing a
Schema works properly -- i.e. when the Schema JSON file contains a serialized
"KinoSearch::Analysis::CaseFolder", the object that emerges from the
deserializer is a Lucy::Analysis::CaseFolder.

However, the Lucy codebase currently supports a number of obsolete KinoSearch
segment formats, and it would be nice to drop that support and clean out the
cruft at some time in the future.  For whatever Lucy release we decide to do
that on, we would announce that users who had migrated indexes from KinoSearch
need to run the modernizer.  (Indexes initialized under Lucy would not need
modernization, as we could guarantee that they were written in a recent
format.)

Providing a clean migration path for KinoSearch users allows us to put
KinoSearch into maintenance mode and focus exclusively on developing Lucy.  At
the same time, having the modernizer in reserve holds the promise that we
won't be burdened forever by the backwards compatibility requirements of old
index formats.

> Maybe write the cookbook recipe for upgrading KS to Lucy, and then we can
> see if it needs to be formalized into a part of the core?

OK, sounds like the cookbook approach is feasible and prudent.  We don't
technically *need* the modernizer until we decide to drop support for an old
format, though.

Marvin Humphrey

Re: [lucy-dev] Index modernizer

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 11/10/2010 09:49 AM:
> Greets,
> 
> As the index format changes, we accumulate cruft in our codebase to support
> old indexes and old segments.  At some point, we need to purge such cruft and
> abandon support for old indexes.  But if you are a user, it's hard to know
> whether your index has old segments in it, and whether you can upgrade safely
> to a given version of the library.
> 

You're describing the back compat path for KS users switching to Lucy, yes?

> I'm torn as to where to implement this functionality.  Since it may be
> necessary to load custom classes, e.g. FieldType or Schema subclasses, that
> suggests a Cookbook/sample app which the user might modify.  On the other
> hand, if we are going to require that users run this app in order to upgrade
> -- and we will, sooner or later -- maybe there ought to be a core class,
> Lucy::Index::Modernizer...  Probably best to start with Cookbook/sample code
> which makes no public API promises, methinks...
> 

Yes. Maybe write the cookbook recipe for upgrading KS to Lucy, and then
we can see if it needs to be formalized into a part of the core?

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com