You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/01/13 22:56:33 UTC

[KinoSearch] Release strategies (was "fields and swish3")

cc to lucy-dev...

On Tue, Jan 12, 2010 at 07:50:46PM -0600, Peter Karman wrote:
> one-to-one field names is cool as long as FullText is sortable.

OK, glad that'll work.

> Of course, my next question is predictable: when will 0.30 be fully cooked? 

The goal has been to get the KS dev branch file format and API stable before
displacing the stable branch.  However, there are still difficult file format
problems to solve, particularly with regards to term dictionaries and posting
lists -- such as support for multi-stream posting formats and indexing of
non-text field types.

I haven't really been working on those problems too much lately.  After
reaching our goals for index opening speed and integration of memory mapped
sort caches last year, I could have gone back to that -- but instead, I've
gone to work on Lucy.  Lucy isn't that far off at this point: N months.  

There are some people who are using stable branch KS and who would be
disrupted if we simply clobber the stable branch by releasing the dev branch
on top of it, e.g. the MojoMojo folks
(<http://mojomojo.org/features#Searching>).  I'm reluctant to do that, since
we haven't reached our goals for file format and API stability.  Yeah, they
were warned by the "alpha" label, but KS has also been promising a level of
stability which we have yet to deliver.  A one-time painful switch might have
been OK, but forcing them back into an ongoing dev cycle isn't.

To avoid disrupting such users, we could take one of two paths:

  * Fork the current stable release under "KinoSearch0" and expect existing
    users to switch.
  * Move the dev branch (svn trunk) under "KinoSearch2" and release it as an
    alpha.  (I lean towards this option because it sets a precedent for how I
    think we'll need to handle versioning in Lucy.)

If we'd managed to launch Lucy by now, this question would be academic,
because Lucy would have become the successor to the KS dev branch.  And I've
kind of been working on Lucy with that in mind.

Lucy remains my main goal.  From a marketing perspective, I'm not sure that
it's ideal to launch "KinoSearch2" as an alpha, then deprecate it in favor of
Lucy a few months later.  And once Lucy is launched in earnest and people
outside our small circle start contributing, KS will have to be deprecated
because licensing issues will eventually prevent us from backporting some
important chunk of Lucy code to KS.

So that's why I've been kind of keeping my head down and working feverishly on
Lucy.  I figured we'd get Lucy out as an alpha, grow its user base by
releasing Ruby and Python bindings, then harness the excitement from that to
work on the difficult problems that have held back KS.  Designing a pluggable
indexing framework is hard; it's almost impossible without a large user base,
since only a small subset of users will be in a situation where they can test
drive the pluggability features and help us refine the API.

> And, how can I help?

You and Nate have been very helpful with regards to code and API review.  If
we go down the current path towards Lucy, I'd ask you to continue exploring
new areas and providing feedback about how it went.

If we decide to make a formal CPAN release of dev branch somehow, there will
be some mechanical work to do.  If you wanted to do that, you could -- but I'm
under the impression that you don't have that much time (compared with the 60
or so hours I've been putting in each week) and I don't want to squander a
limited resource.

If I could go back in time, I would have released the KS 0.20 branch under the
namespace "KinoSearch2" and the 0.30 branch under "KinoSearch3".  Maybe that
points the way forward.  Whatever we do, though, I'm determined not to let
progress towards Lucy flag again.

Marvin Humphrey

Re: Release strategies

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 01/21/2010 11:01 PM:

> So that's one item checked off the TODO list.
> 

\o/


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [KinoSearch] Release strategies

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Jan 14, 2010 at 08:22:32PM -0800, Marvin Humphrey wrote:

> Then there's one long standing file format bug: 
> 
>   * Skipping on SegPostingList is disabled because the skip files are broken.
> 
> I'd like to get that one taken care of before we make a non-dev release, as it
> will help performance for a variety of queries.

I believe I have finally squashed this bug with KS commit r5729, which
re-enables the skipping optimization on SegPostingList.  It turns out to have
been a search-time problem in SegPostingList after all, rather than something
hiding in PostingsWriter's main loop -- but that didn't become clear until
after I'd cleaned out most of the
PostingsWriter/PostingPool/PostingPoolQueue/SortExternal/SortExRun rat's nest.

So that's one item checked off the TODO list.

Marvin Humphrey

Re: Release strategies

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 1/22/10 6:32 PM:
> On Thu, Jan 14, 2010 at 08:22:32PM -0800, Marvin Humphrey wrote:
>>> How can I help move toward a KS 0.30 release?
>> I'll draw up a todo list.
> 
> I've updated the TODO list on the KinoSearch wiki:
> 
>   http://www.rectangular.com/kinosearch/wiki/ToDoList
> 
> To summarize, I think we need to move in three stages:

I'm still not sure that it's really necessary to fork a KinoSearch2, and I have 
some worry that it might confuse more than help. But I'll defer to you, Marvin, 
on this since this is your baby.

Thanks for responding to my original plea in such a gracious way.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [KinoSearch] Release strategies

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Jan 14, 2010 at 08:22:32PM -0800, Marvin Humphrey wrote:
> > How can I help move toward a KS 0.30 release?
> 
> I'll draw up a todo list.

I've updated the TODO list on the KinoSearch wiki:

  http://www.rectangular.com/kinosearch/wiki/ToDoList

To summarize, I think we need to move in three stages:

First, we need a KinoSearch 0.30_08 release, which must be index-compatible
with 0.30_072.  There has been a lot of churn in the code base since 0.30_07
came out, and it would be nice to give our users the option of downgrading
without needing to reindex if 0.30_08 turns out to have problems.

Next, we need a KinoSearch 0.30_09 release, where we make some changes to the
index format which earlier releases will not be able to read.

After that, we can release KinoSearch2.  The main thing I think we should do
in between KinoSearch 0.30_09 and KinoSearch2 is finish the class
reorganization we discussed a little while back (Moving Schema, FieldType
under KinoSearch2::Plan, Searcher to KinoSearch2::Search::IndexSearcher, etc).

One thing that is omitted from the TODO list for KinoSearch2 is opening up the
APIs for classes under Store -- InStream, OutStream, FileHandle, DirHandle --
which have been substantially improved since 0.30_07 came out and are close to
ready for public use.  I had originally planned to make them available for
0.30_08, but they shouldn't block.

Marvin Humphrey

Re: [KinoSearch] Release strategies

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Jan 14, 2010 at 10:20:34AM -0600, Peter Karman wrote:
> Are the file format problems actually bugs in the current format, or features 
> you would like to see added?

Both.  With regards to features, there are two:

  * Make term dictionaries pluggable to work with non-text field types.
  * Make posting formats able to work with multiple streams.

We can probably handle those without hard compatibility breaks; it will just
be more of a pain. 

Then there's one long standing file format bug: 

  * Skipping on SegPostingList is disabled because the skip files are broken.

I'd like to get that one taken care of before we make a non-dev release, as it
will help performance for a variety of queries.

> CPAN's versioning is not ideal in that regard.

I'll be less kind.  It's grievously flawed.

But these problems are also just hard to avoid by nature when dealing with
dynamic dependencies and global namespaces.  We have the same problem with C.

> However, there are already checks in Build.PL for incompatible index formats, 

If we release into a new "KinoSearch2" namespace, those checks can be
discarded.  They're pretty unfriendly to the CPAN toolchain -- you can get
yourself on a CPAN Testers blacklist by hanging on manual user input.

> Rebuilding an index is not the end of the world. We (and by we I mean search 
> developers) do it all the time, even with big doc corpora.

If you have a lot of fast moving indexes, and an expectation of uninterrupted
up-time, it can be difficult to schedule swaps.  That's our situation here at
Eventful.  We could do it, but it would be a major PITA.

Ironically, this pushes in the direction of release, because managing a hard
compat break would probably cost us more than it costs us to have me write
bridge code.

> Small, stable, incremental and frequent releases to CPAN. I've been converted to 
> that idea.

I've also seen the benefits of date-driven release schedules, for instance as
now practiced by the Perl 5 Porters.  Jesse Vincent has managed to get a bunch
of good devs more highly involved by separating the roles of release mananger
and pumpking.

In the abstract, I'd like to try that.  It would require changes to my personal
development routines, and Git is better suited to it than Subversion, but I
think we could make it work.  

> What about #3: stabilize svn trunk and release it as KS 0.30.
> 
> When there's another index compat change, release it as KS 0.40, etc.

I don't want to screw over the MojoMojo people, the Socialtext people, etc. By
releasing into a new namespace we give our users a lot more options for
transitioning.

I actually think we might be able to go a long way without hard compat breaks
in the file format.  Maybe even from here on out.  Now that all metadata is in
JSON, the sort cache issue is solved, and we have a provisional implementation
for pluggable index components, it should be a lot easier to keep compat
across a transitional release at least, and usually longer.  

> How can I help move toward a KS 0.30 release?

I'll draw up a todo list.

Marvin Humphrey

Re: Release strategies (was "fields and swish3")

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 1/13/10 3:56 PM:

>> Of course, my next question is predictable: when will 0.30 be fully cooked? 
> 
> The goal has been to get the KS dev branch file format and API stable before
> displacing the stable branch.  However, there are still difficult file format
> problems to solve, particularly with regards to term dictionaries and posting
> lists -- such as support for multi-stream posting formats and indexing of
> non-text field types.

Are the file format problems actually bugs in the current format, or features 
you would like to see added? IMO there's a big difference.

I understand your long-time aversion to changing the index format between 
releases, and how that can break existing indexes in the wild. I've been bitten 
by that situation with other Perl projects from CPAN where ModuleA gets 
installed as part of a regular sysadmin upgrade because it is pulled in as a 
dependency by ModuleB, and then my code that depends on ModuleA suddenly stops 
working. CPAN's versioning is not ideal in that regard.

However, there are already checks in Build.PL for incompatible index formats, 
and KS is not likely to be pulled in blindly as a dependency. That is, if 
someone is installing KS from CPAN, they are doing it intentionally and Build.PL 
will (or could be made to) help prevent them from shooting themselves.

Rebuilding an index is not the end of the world. We (and by we I mean search 
developers) do it all the time, even with big doc corpora.

The perfect is the enemy of the good. I.e., if we wait until the perfect, 
ultimate file format is finished, a stable KS 0.30 releases might never see CPAN 
till KS is made obsolete by Lucy. That would be too bad, I think, because there 
are sooo many good improvements in the .30 branch, stable and trunk, that people 
could be taking advantage of without having to install a dev release or keep up 
with svn trunk.

Small, stable, incremental and frequent releases to CPAN. I've been converted to 
that idea. I'm trying now to convince you, Marvin. How am I doing? :)


> 
> I haven't really been working on those problems too much lately.  After
> reaching our goals for index opening speed and integration of memory mapped
> sort caches last year, I could have gone back to that -- but instead, I've
> gone to work on Lucy.  Lucy isn't that far off at this point: N months.  
> 

That's great. Really. And as one of your faithful commit list readers, I applaud 
everything you've achieved so far. It's monumental. It's good.

I just worry about the perfect being the enemy of the good. It's something I've 
struggled with myself wrt Swish3, which has been gestating about as long as KS has.

> There are some people who are using stable branch KS and who would be
> disrupted if we simply clobber the stable branch by releasing the dev branch
> on top of it, e.g. the MojoMojo folks
> (<http://mojomojo.org/features#Searching>).  I'm reluctant to do that, since
> we haven't reached our goals for file format and API stability.  Yeah, they
> were warned by the "alpha" label, but KS has also been promising a level of
> stability which we have yet to deliver.  A one-time painful switch might have
> been OK, but forcing them back into an ongoing dev cycle isn't.

It's only painful and disruptive if existing users install the newest KS. They 
don't have to. And as above, we could come up with a reasonable system to help 
them be very intentional about it. It could be as simple as changing the magic 
version number in Build.PL (which is currently 0.20) whenever the index format 
changes in some backwards incompatible way.

> 
> To avoid disrupting such users, we could take one of two paths:
> 
>   * Fork the current stable release under "KinoSearch0" and expect existing
>     users to switch.
>   * Move the dev branch (svn trunk) under "KinoSearch2" and release it as an
>     alpha.  (I lean towards this option because it sets a precedent for how I
>     think we'll need to handle versioning in Lucy.)
> 

What about #3: stabilize svn trunk and release it as KS 0.30.

When there's another index compat change, release it as KS 0.40, etc.

> If we'd managed to launch Lucy by now, this question would be academic,
> because Lucy would have become the successor to the KS dev branch.  And I've
> kind of been working on Lucy with that in mind.
> 
> Lucy remains my main goal.  From a marketing perspective, I'm not sure that
> it's ideal to launch "KinoSearch2" as an alpha, then deprecate it in favor of
> Lucy a few months later.  And once Lucy is launched in earnest and people
> outside our small circle start contributing, KS will have to be deprecated
> because licensing issues will eventually prevent us from backporting some
> important chunk of Lucy code to KS.

Deprecating KS in the future is fine and good. Between now and then, though, 
let's get 0.30 released.

> 
> So that's why I've been kind of keeping my head down and working feverishly on
> Lucy.  I figured we'd get Lucy out as an alpha, grow its user base by
> releasing Ruby and Python bindings, then harness the excitement from that to
> work on the difficult problems that have held back KS.  Designing a pluggable
> indexing framework is hard; it's almost impossible without a large user base,
> since only a small subset of users will be in a situation where they can test
> drive the pluggability features and help us refine the API.

That makes total sense to me. Lucy is the future. And its viability to date 
depends upon ideas worked out in the past years in KS. I expect that kind of 
cross-fertilization to continue as long as the IP issues remain compatible. And 
I expect to continue to help as I can.

> 
>> And, how can I help?
> 
> You and Nate have been very helpful with regards to code and API review.  If
> we go down the current path towards Lucy, I'd ask you to continue exploring
> new areas and providing feedback about how it went.

Will do.

> 
> If we decide to make a formal CPAN release of dev branch somehow, there will
> be some mechanical work to do.  If you wanted to do that, you could -- but I'm
> under the impression that you don't have that much time (compared with the 60
> or so hours I've been putting in each week) and I don't want to squander a
> limited resource.

I wouldn't call it squandering. I would call it sharing. :)

Regarding time, I have some at the moment because I want to use KS at $work.

> 
> If I could go back in time, I would have released the KS 0.20 branch under the
> namespace "KinoSearch2" and the 0.30 branch under "KinoSearch3".  Maybe that
> points the way forward.  Whatever we do, though, I'm determined not to let
> progress towards Lucy flag again.
> 

I agree. Lucy should remain your primary concern.

How can I help move toward a KS 0.30 release?


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [KinoSearch] Release strategies (was "fields and swish3")

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Jan 14, 2010 at 05:07:14PM +0000, Dermot wrote:
> Are you saying that you want to get Lucy up-to-date with Lucene? 

Nope.  Lucy a "loose C" port of Lucene.  It's not intended to be compatible.

> I curious because that would suggest that the index formats would eventually
> be cross-compatible? 

The Lucene file format is waaaaaayyyyy too complicated to support.

    http://rectangular.com/pipermail/kinosearch/2009-December/007174.html

The Lucy file format will be much simpler.  It's possible that at some point,
people will write Java modules to read it.  That's the extent of the
cross-compatibility I'd expect to see in my lifetime.  But we're speculating
about vaporware, so take that FWIW...

> Won't that have a negative impact on KinoSearch usage because there is
> already a Lucene module on CPAN that uses Lucy (I know it's out-of-date) but
> someone is bound to create those bindings again. 

There is a "Lucene" CPAN distro, last updated in 2007, which binds to the C++
port CLucene -- not Lucy.  (There's also a "CLucene" distro, last updated in
2005, which also binds to CLucene.)  If other people want to give their
modules misleading names, I can't do anything about that.  However, I've
reserved the CPAN namespace for "Lucy", so nobody else will be able to
cybersquat it.

> I'm also curious about what this that will mean for my current project. I
> took the 0.03_072 release and I'm a bit nervous about continued support for
> KinoSearch.

If and when we deprecate KinoSearch, or KinoSearch2 or whatever it's called at
that point, you'll be expected to switch to Lucy.  But note that the
KinoSearch dev branch (svn trunk) and Lucy are nearly identical.  Unless the
licensing change from GPL/Artistic to Apache affects you, having to switch
won't be that different from the kind of adjustments you have to make from
release to release when using the KinoSearch dev branch.

Marvin Humphrey