You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Peter Karman <pe...@peknet.com> on 2010/01/14 17:20:34 UTC

Re: Release strategies (was "fields and swish3")

Marvin Humphrey wrote on 1/13/10 3:56 PM:

>> Of course, my next question is predictable: when will 0.30 be fully cooked? 
> 
> The goal has been to get the KS dev branch file format and API stable before
> displacing the stable branch.  However, there are still difficult file format
> problems to solve, particularly with regards to term dictionaries and posting
> lists -- such as support for multi-stream posting formats and indexing of
> non-text field types.

Are the file format problems actually bugs in the current format, or features 
you would like to see added? IMO there's a big difference.

I understand your long-time aversion to changing the index format between 
releases, and how that can break existing indexes in the wild. I've been bitten 
by that situation with other Perl projects from CPAN where ModuleA gets 
installed as part of a regular sysadmin upgrade because it is pulled in as a 
dependency by ModuleB, and then my code that depends on ModuleA suddenly stops 
working. CPAN's versioning is not ideal in that regard.

However, there are already checks in Build.PL for incompatible index formats, 
and KS is not likely to be pulled in blindly as a dependency. That is, if 
someone is installing KS from CPAN, they are doing it intentionally and Build.PL 
will (or could be made to) help prevent them from shooting themselves.

Rebuilding an index is not the end of the world. We (and by we I mean search 
developers) do it all the time, even with big doc corpora.

The perfect is the enemy of the good. I.e., if we wait until the perfect, 
ultimate file format is finished, a stable KS 0.30 releases might never see CPAN 
till KS is made obsolete by Lucy. That would be too bad, I think, because there 
are sooo many good improvements in the .30 branch, stable and trunk, that people 
could be taking advantage of without having to install a dev release or keep up 
with svn trunk.

Small, stable, incremental and frequent releases to CPAN. I've been converted to 
that idea. I'm trying now to convince you, Marvin. How am I doing? :)


> 
> I haven't really been working on those problems too much lately.  After
> reaching our goals for index opening speed and integration of memory mapped
> sort caches last year, I could have gone back to that -- but instead, I've
> gone to work on Lucy.  Lucy isn't that far off at this point: N months.  
> 

That's great. Really. And as one of your faithful commit list readers, I applaud 
everything you've achieved so far. It's monumental. It's good.

I just worry about the perfect being the enemy of the good. It's something I've 
struggled with myself wrt Swish3, which has been gestating about as long as KS has.

> There are some people who are using stable branch KS and who would be
> disrupted if we simply clobber the stable branch by releasing the dev branch
> on top of it, e.g. the MojoMojo folks
> (<http://mojomojo.org/features#Searching>).  I'm reluctant to do that, since
> we haven't reached our goals for file format and API stability.  Yeah, they
> were warned by the "alpha" label, but KS has also been promising a level of
> stability which we have yet to deliver.  A one-time painful switch might have
> been OK, but forcing them back into an ongoing dev cycle isn't.

It's only painful and disruptive if existing users install the newest KS. They 
don't have to. And as above, we could come up with a reasonable system to help 
them be very intentional about it. It could be as simple as changing the magic 
version number in Build.PL (which is currently 0.20) whenever the index format 
changes in some backwards incompatible way.

> 
> To avoid disrupting such users, we could take one of two paths:
> 
>   * Fork the current stable release under "KinoSearch0" and expect existing
>     users to switch.
>   * Move the dev branch (svn trunk) under "KinoSearch2" and release it as an
>     alpha.  (I lean towards this option because it sets a precedent for how I
>     think we'll need to handle versioning in Lucy.)
> 

What about #3: stabilize svn trunk and release it as KS 0.30.

When there's another index compat change, release it as KS 0.40, etc.

> If we'd managed to launch Lucy by now, this question would be academic,
> because Lucy would have become the successor to the KS dev branch.  And I've
> kind of been working on Lucy with that in mind.
> 
> Lucy remains my main goal.  From a marketing perspective, I'm not sure that
> it's ideal to launch "KinoSearch2" as an alpha, then deprecate it in favor of
> Lucy a few months later.  And once Lucy is launched in earnest and people
> outside our small circle start contributing, KS will have to be deprecated
> because licensing issues will eventually prevent us from backporting some
> important chunk of Lucy code to KS.

Deprecating KS in the future is fine and good. Between now and then, though, 
let's get 0.30 released.

> 
> So that's why I've been kind of keeping my head down and working feverishly on
> Lucy.  I figured we'd get Lucy out as an alpha, grow its user base by
> releasing Ruby and Python bindings, then harness the excitement from that to
> work on the difficult problems that have held back KS.  Designing a pluggable
> indexing framework is hard; it's almost impossible without a large user base,
> since only a small subset of users will be in a situation where they can test
> drive the pluggability features and help us refine the API.

That makes total sense to me. Lucy is the future. And its viability to date 
depends upon ideas worked out in the past years in KS. I expect that kind of 
cross-fertilization to continue as long as the IP issues remain compatible. And 
I expect to continue to help as I can.

> 
>> And, how can I help?
> 
> You and Nate have been very helpful with regards to code and API review.  If
> we go down the current path towards Lucy, I'd ask you to continue exploring
> new areas and providing feedback about how it went.

Will do.

> 
> If we decide to make a formal CPAN release of dev branch somehow, there will
> be some mechanical work to do.  If you wanted to do that, you could -- but I'm
> under the impression that you don't have that much time (compared with the 60
> or so hours I've been putting in each week) and I don't want to squander a
> limited resource.

I wouldn't call it squandering. I would call it sharing. :)

Regarding time, I have some at the moment because I want to use KS at $work.

> 
> If I could go back in time, I would have released the KS 0.20 branch under the
> namespace "KinoSearch2" and the 0.30 branch under "KinoSearch3".  Maybe that
> points the way forward.  Whatever we do, though, I'm determined not to let
> progress towards Lucy flag again.
> 

I agree. Lucy should remain your primary concern.

How can I help move toward a KS 0.30 release?


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: Release strategies

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 01/21/2010 11:01 PM:

> So that's one item checked off the TODO list.
> 

\o/


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [KinoSearch] Release strategies

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Jan 14, 2010 at 08:22:32PM -0800, Marvin Humphrey wrote:

> Then there's one long standing file format bug: 
> 
>   * Skipping on SegPostingList is disabled because the skip files are broken.
> 
> I'd like to get that one taken care of before we make a non-dev release, as it
> will help performance for a variety of queries.

I believe I have finally squashed this bug with KS commit r5729, which
re-enables the skipping optimization on SegPostingList.  It turns out to have
been a search-time problem in SegPostingList after all, rather than something
hiding in PostingsWriter's main loop -- but that didn't become clear until
after I'd cleaned out most of the
PostingsWriter/PostingPool/PostingPoolQueue/SortExternal/SortExRun rat's nest.

So that's one item checked off the TODO list.

Marvin Humphrey


Re: Release strategies

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 1/22/10 6:32 PM:
> On Thu, Jan 14, 2010 at 08:22:32PM -0800, Marvin Humphrey wrote:
>>> How can I help move toward a KS 0.30 release?
>> I'll draw up a todo list.
> 
> I've updated the TODO list on the KinoSearch wiki:
> 
>   http://www.rectangular.com/kinosearch/wiki/ToDoList
> 
> To summarize, I think we need to move in three stages:

I'm still not sure that it's really necessary to fork a KinoSearch2, and I have 
some worry that it might confuse more than help. But I'll defer to you, Marvin, 
on this since this is your baby.

Thanks for responding to my original plea in such a gracious way.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [KinoSearch] Release strategies

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Jan 14, 2010 at 08:22:32PM -0800, Marvin Humphrey wrote:
> > How can I help move toward a KS 0.30 release?
> 
> I'll draw up a todo list.

I've updated the TODO list on the KinoSearch wiki:

  http://www.rectangular.com/kinosearch/wiki/ToDoList

To summarize, I think we need to move in three stages:

First, we need a KinoSearch 0.30_08 release, which must be index-compatible
with 0.30_072.  There has been a lot of churn in the code base since 0.30_07
came out, and it would be nice to give our users the option of downgrading
without needing to reindex if 0.30_08 turns out to have problems.

Next, we need a KinoSearch 0.30_09 release, where we make some changes to the
index format which earlier releases will not be able to read.

After that, we can release KinoSearch2.  The main thing I think we should do
in between KinoSearch 0.30_09 and KinoSearch2 is finish the class
reorganization we discussed a little while back (Moving Schema, FieldType
under KinoSearch2::Plan, Searcher to KinoSearch2::Search::IndexSearcher, etc).

One thing that is omitted from the TODO list for KinoSearch2 is opening up the
APIs for classes under Store -- InStream, OutStream, FileHandle, DirHandle --
which have been substantially improved since 0.30_07 came out and are close to
ready for public use.  I had originally planned to make them available for
0.30_08, but they shouldn't block.

Marvin Humphrey


Re: [KinoSearch] Release strategies

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Jan 14, 2010 at 10:20:34AM -0600, Peter Karman wrote:
> Are the file format problems actually bugs in the current format, or features 
> you would like to see added?

Both.  With regards to features, there are two:

  * Make term dictionaries pluggable to work with non-text field types.
  * Make posting formats able to work with multiple streams.

We can probably handle those without hard compatibility breaks; it will just
be more of a pain. 

Then there's one long standing file format bug: 

  * Skipping on SegPostingList is disabled because the skip files are broken.

I'd like to get that one taken care of before we make a non-dev release, as it
will help performance for a variety of queries.

> CPAN's versioning is not ideal in that regard.

I'll be less kind.  It's grievously flawed.

But these problems are also just hard to avoid by nature when dealing with
dynamic dependencies and global namespaces.  We have the same problem with C.

> However, there are already checks in Build.PL for incompatible index formats, 

If we release into a new "KinoSearch2" namespace, those checks can be
discarded.  They're pretty unfriendly to the CPAN toolchain -- you can get
yourself on a CPAN Testers blacklist by hanging on manual user input.

> Rebuilding an index is not the end of the world. We (and by we I mean search 
> developers) do it all the time, even with big doc corpora.

If you have a lot of fast moving indexes, and an expectation of uninterrupted
up-time, it can be difficult to schedule swaps.  That's our situation here at
Eventful.  We could do it, but it would be a major PITA.

Ironically, this pushes in the direction of release, because managing a hard
compat break would probably cost us more than it costs us to have me write
bridge code.

> Small, stable, incremental and frequent releases to CPAN. I've been converted to 
> that idea.

I've also seen the benefits of date-driven release schedules, for instance as
now practiced by the Perl 5 Porters.  Jesse Vincent has managed to get a bunch
of good devs more highly involved by separating the roles of release mananger
and pumpking.

In the abstract, I'd like to try that.  It would require changes to my personal
development routines, and Git is better suited to it than Subversion, but I
think we could make it work.  

> What about #3: stabilize svn trunk and release it as KS 0.30.
> 
> When there's another index compat change, release it as KS 0.40, etc.

I don't want to screw over the MojoMojo people, the Socialtext people, etc. By
releasing into a new namespace we give our users a lot more options for
transitioning.

I actually think we might be able to go a long way without hard compat breaks
in the file format.  Maybe even from here on out.  Now that all metadata is in
JSON, the sort cache issue is solved, and we have a provisional implementation
for pluggable index components, it should be a lot easier to keep compat
across a transitional release at least, and usually longer.  

> How can I help move toward a KS 0.30 release?

I'll draw up a todo list.

Marvin Humphrey