You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Ryan Ernst <ry...@iernst.net> on 2014/08/02 01:47:12 UTC

Lucene versioning logic

There has been a lot of heated discussion recently about version
tracking in Lucene [1] [2].  I wanted to have a fresh discussion
outside of jira to give a full description of the current state of
things, the problems I have heard, and a proposed solution.

CURRENT

We have 2 pieces of code that handle “versioning.”  The first is
Constants.LUCENE_MAIN_VERSION, which is written to the SegmentsInfo
for each segment.  This is a string version which is used to detect
when the current version of lucene is newer than the version that
wrote the segment (and how/if an upgrade to to a newer codec should be
done). There is some complication with the “display” version and
non-display version, which are distinguished by whether the version of
lucene was an official release, or an alpha/beta version (which was
added specifically for the 4.0.0 release ramp up).  This string
version also has its own parsing and comparison methods.

The second piece of versioning code is in Version.java, which is an
enum used by analyzers to maintain backwards compatible behavior given
a specific version of lucene.  The enum only contains values for dot
releases of lucene, not bug fixes (which was what spurred the recent
discussions over version). Analyzers’ constructors take a required
Version parameter, which is only actually used by the few analyzers
that have changed behavior recently.  Version.java contains a separate
version parsing and comparison methods.


CONCERNS

* Having 2 different pieces of code that do very similar things is
confusing for development.  Very few developers appear to really
understand the current system (especially when trying to understand
the alpha/beta setup).

* Users are generally confused by the Version passed to analyzers: I
know I was when I first started working with Lucene, and
Version.CURRENT_VERSION was deprecated because users used that without
understanding the implications.

* Most analyzers currently have dead code constructors, since they
never make use of Version.  There are also a lot of classes used by
analyzers which contain similar dead code.

* Backwards compatibility needs to be handled in some fashion, to
ensure users have a path to upgrade from one version of lucene to
another, without requiring immediate re-indexing.


PROPOSAL

I propose the following:

* Consolidate all version related enumeration, including reading and
writing string versions, into Version.java.  Have a static method that
returns the current lucene version (replacing
Constants.LUCENE_MAIN_VERSION).

* Make bug fix releases first class in the enumeration, so that they
can be distinguished for any compatibility issues that come up.

* Remove all snapshot/alpha/beta versioning logic.  Alpha/beta was
really only necessary for 4.0 because of the extreme changes that were
being made.  The system is much more stable now, and 5.0 should not
require preview releases, IMO.  I don’t think snapshots should be a
concern because any user building an index from an unreleased build
(which they built themselves) is just asking for trouble.  They do so
at their own risk (of figuring out how to upgrade their indexes if
they are not trash-able).  Backwards compatibility can be handled by
adding the alpha/beta/final versions of 4.0 to the enum (and special
parsing logic for this).  If lucene changes so much that we need
alpha/beta type discrimination in the future, we can revisit the
system if simply having extra versions in the enum won't work.

* Analyzers constructors should have Version removed, and a setter
should be added which allows production users to set the version used.
This way any analyzers can still use version if it is set to something
other than current (which would be the default), but users simply
prototyping do not need to worry about it.

* Classes that analyzers use, which take Version, should have Version
removed, and the analyzers should choose which settings/variants of
those classes to use based on the version they have set. In other
words, all version variant logic should be contained within the
analyzers.  For example, Lucene47WordDelimiterFilter, or
StandardAnalyzer can take the unicode version.
Factories could still take Version (e.g. TokenizerFactory,
TokenFilterFactory, etc) to produce the correct component (so nothing
will change for solr in this regard).

I’m sure not everyone will be happy with what I have proposed, but I’m
hoping we can work out a solution together, and then implement in a
team-like fashion, the way I have seen the community work in the past,
and I hope to see again in the future.

Thanks
Ryan

[1] https://issues.apache.org/jira/browse/LUCENE-5850
[2] https://issues.apache.org/jira/browse/LUCENE-5859

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Lucene versioning logic

Posted by Shai Erera <se...@gmail.com>.

Right, we can also use a SetOnce wrapper to restrict that.

Shai


On Mon, Aug 4, 2014 at 4:05 PM, Steve Molloy <sm...@opentext.com> wrote:

>  I think this proposal is what makes the most sense since this discussion
> started. As for making an instance not modifiable, the setVersion could
> check if it was already called and error-out if it was. Then you could go
> from default to version 123, but at least you couldn't hop from version to
> version while the analyzer is live. This should mostly be handled by
> factories anyhow, so if factories explicitly calls setVersion all the time
> before returning an instance, the instances wouldn't be modifiable.
>
> My 2 cents. :)
>
> Steve
>  ------------------------------
> *From:* Shai Erera [serera@gmail.com]
> *Sent:* August 3, 2014 8:51 AM
> *To:* dev@lucene.apache.org
> *Subject:* Re: Lucene versioning logic
>
>   OK I see, it's the Tokenizers and Filters that usually change, the
> Analyzer only needs Version to determine which TokenStream chain to return,
> and so we achieve that w/ Ryan's proposal of setVersion().
>
>  I'd still feel better if Version was a final setting on an Analyzer,
> i.e. that a single Analyzer instance always behaves consistently, and
> cannot alternate behavior down-stream if someone called setVersion(). But
> this is a really stupid thing to do. Maybe setVersion() can return an
> Analyzer, so you're sure that instance is not modifiable. But maybe this is
> just over engineering...
>
>  I'm +0.5 to that. I prefer Version to be mandatory somehow (class name,
> ctor argument), but I can live with setVersion as well...
>
> Shai
>
>
> On Sun, Aug 3, 2014 at 3:30 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> No, I didn't say any of this, please read it again :)
>>
>> Old back-compat *Tokenizer/Tokenfilter* are named this way.
>> *Tokenizer/Tokenfilter*
>> Only the old ones.
>> Just to be clear: *Tokenizer/Tokenfilter*
>> Their Factories still use Version to produce the right one (as we
>> can't remove version from there, or we will have complaints from solr
>> developers).
>>
>> So users who want the version-style back compat can just use the
>> factories. really.
>> On the other hand new users can do 'new LowerCaseFilter()' without the
>> bullshit.
>>
>> For Analyzers, there is a setter. Users who want to use *OUR
>> ANALYZERS* with back compat, call the setter. But its not
>> mandatory-in-your-face-ctor.
>>
>> I am +1 to Ryan's proposal, so please look for more elaboration there.
>> I am -1 to putting Versions in the name of Analyzers.
>>
>> On Sun, Aug 3, 2014 at 8:21 AM, Shai Erera <se...@gmail.com> wrote:
>> > Oh, I misread this part "I do think its ok to name ..." -- replaced "do"
>> > with "don't" :).
>> >
>> > So you say that if we have a FooAnalyzer in 4.5 and change its behavior
>> in
>> > 4.9, then we add a Foo45Analyzer as a back-compat support, and
>> FooAnalyzer
>> > in 4.9 keeps its name, but with different behavior?
>> >
>> > That means that an app who didn't read CHANGES will be broken upon
>> upgrade,
>> > but if it does read CHANGES, it at least has a way to retain desired
>> > behavior.
>> >
>> > So the thing now is whether FooAnalyzer is always _current_ and an app
>> > should choose a backwards version of it (if it wants to), vs if
>> FooAnalyzer
>> > is _always the same_, and if you want to move forward you have to
>> explicitly
>> > use a NewFooAnalyzer?
>> >
>> > Of course, when FooAnalyzer takes a Version, then an app only needs to
>> > change its Version CONSTANT, to get best behavior ... but as you point
>> out,
>> > seems like we failed to implement that approach in our code already,
>> which
>> > suggests this approach is not intuitive to our committers, so why do we
>> > expect our users to understand it ...
>> >
>> > I am +1 on either of the approaches (both get rid of Version.java). I
>> don't
>> > feel bad with asking users to read CHANGES before they upgrade, and it
>> does
>> > mean that FooAnalyzer always gives you the best behavior, which is
>> important
>> > for new users or if you always re-index. Vs the second approach which
>> always
>> > prefers backwards compatibility, and telling users to read the javadocs
>> (and
>> > CHANGES)  in order to find the best version of FooAnalyzer.
>> >
>> > There is another issue w/ a global Version CONSTANT, which today we
>> > encourage apps to use -- if you use two analyzers, but you want to work
>> with
>> > a different Version of each (because of all sorts of reasons), having a
>> > global constant is bad. The explicit Foo45Analyzer (or Foo49Analyzer,
>> > whichever) lets you mix whichever versions that you want.
>> >
>> > Shai
>> >
>> > On Sun, Aug 3, 2014 at 3:02 PM, Robert Muir <rc...@gmail.com> wrote:
>> >>
>> >> You don't read what i wrote.
>> >>
>> >> Read it again.
>> >>
>> >>
>> >> On Sun, Aug 3, 2014 at 7:49 AM, Shai Erera <se...@gmail.com> wrote:
>> >> > Yes, I agree that Foo49Analyzer is an odd name. Better if it was
>> named
>> >> > FooAnalyzerWithNoApostrophe, and I'm fine if that Analyzer chose to
>> name
>> >> > its
>> >> > different versions like that. But in the absence of better naming
>> ideas,
>> >> > I
>> >> > proposed the Foo49Analyzer.
>> >> >
>> >> > If we already have such Analyzers, then we are in fact implementing
>> that
>> >> > approach, only didn't make that decision globally. So whether it's
>> odd
>> >> > or
>> >> > not, let's first agree if we are willing to have these analyzers in
>> our
>> >> > code
>> >> > base (i.e. w/ the back-compat support). If we do, we can let each
>> >> > Analyzer
>> >> > decide on its naming.
>> >> >
>> >> > Analyzers aren't Codecs, I agree, and sticking the Lucene version in
>> >> > their
>> >> > name is probably not the best thing to do, as the Lucene version is
>> more
>> >> > associated with the index format. But if a fixed Analyzer cannot
>> come up
>> >> > w/
>> >> > a better name, I think the Lucene version there is not that horrible.
>> >> >
>> >> > And, it lets us easily remove Version.java.
>> >> >
>> >> > Shai
>> >> >
>> >> >
>> >> > On Sun, Aug 3, 2014 at 2:43 PM, Robert Muir <rc...@gmail.com>
>> wrote:
>> >> >>
>> >> >> On Sat, Aug 2, 2014 at 12:41 PM, Shai Erera <se...@gmail.com>
>> wrote:
>> >> >> > Another proposal that I made on LUCENE-5859 is to get rid of
>> Version
>> >> >> > (for
>> >> >> > Analyzers) and follow the solution we have with Codecs. If an
>> >> >> > Analyzer
>> >> >> > changes its runtime behavior, and e.g not marked @experimental, it
>> >> >> > can
>> >> >> > create a Foo49Analyzer with the new behavior. That way, apps are
>> >> >> > still
>> >> >> > safe
>> >> >> > when they upgrade, since their Foo45Analyzer still exists (but
>> >> >> > deprecated).
>> >> >> > And they can always copy a Foo45Analyzer when they upgrade to
>> Lucene
>> >> >> > 6.0
>> >> >> > where it no longer exists... with this approach, there's no single
>> >> >> > version
>> >> >> > across the app - it just uses the specific Analyzer impls.
>> >> >>
>> >> >> But the usability here would be really bad.
>> >> >>
>> >> >> For codecs there isn't much a better thing to name it anyway, and
>> >> >> codecs are super-expert to change.
>> >> >>
>> >> >> For analyzers usability is paramount.
>> >> >>
>> >> >> I do think its ok to name _backwards_ compat tokenizer/tokenfilter
>> >> >> classes this way. In fact its already this way in trunk for any back
>> >> >> compat *actually doing something*: Lucene43NgramTokenizer,
>> >> >> Lucene47WordDelimiterFilter. The Version parameters are just for
>> show,
>> >> >> not doing anything!
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>    ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org
>

Re: Lucene versioning logic

Posted by Shai Erera <se...@gmail.com>.

OK I see, it's the Tokenizers and Filters that usually change, the Analyzer
only needs Version to determine which TokenStream chain to return, and so
we achieve that w/ Ryan's proposal of setVersion().

I'd still feel better if Version was a final setting on an Analyzer, i.e.
that a single Analyzer instance always behaves consistently, and cannot
alternate behavior down-stream if someone called setVersion(). But this is
a really stupid thing to do. Maybe setVersion() can return an Analyzer, so
you're sure that instance is not modifiable. But maybe this is just over
engineering...

I'm +0.5 to that. I prefer Version to be mandatory somehow (class name,
ctor argument), but I can live with setVersion as well...

Shai


On Sun, Aug 3, 2014 at 3:30 PM, Robert Muir <rc...@gmail.com> wrote:

> No, I didn't say any of this, please read it again :)
>
> Old back-compat *Tokenizer/Tokenfilter* are named this way.
> *Tokenizer/Tokenfilter*
> Only the old ones.
> Just to be clear: *Tokenizer/Tokenfilter*
> Their Factories still use Version to produce the right one (as we
> can't remove version from there, or we will have complaints from solr
> developers).
>
> So users who want the version-style back compat can just use the
> factories. really.
> On the other hand new users can do 'new LowerCaseFilter()' without the
> bullshit.
>
> For Analyzers, there is a setter. Users who want to use *OUR
> ANALYZERS* with back compat, call the setter. But its not
> mandatory-in-your-face-ctor.
>
> I am +1 to Ryan's proposal, so please look for more elaboration there.
> I am -1 to putting Versions in the name of Analyzers.
>
> On Sun, Aug 3, 2014 at 8:21 AM, Shai Erera <se...@gmail.com> wrote:
> > Oh, I misread this part "I do think its ok to name ..." -- replaced "do"
> > with "don't" :).
> >
> > So you say that if we have a FooAnalyzer in 4.5 and change its behavior
> in
> > 4.9, then we add a Foo45Analyzer as a back-compat support, and
> FooAnalyzer
> > in 4.9 keeps its name, but with different behavior?
> >
> > That means that an app who didn't read CHANGES will be broken upon
> upgrade,
> > but if it does read CHANGES, it at least has a way to retain desired
> > behavior.
> >
> > So the thing now is whether FooAnalyzer is always _current_ and an app
> > should choose a backwards version of it (if it wants to), vs if
> FooAnalyzer
> > is _always the same_, and if you want to move forward you have to
> explicitly
> > use a NewFooAnalyzer?
> >
> > Of course, when FooAnalyzer takes a Version, then an app only needs to
> > change its Version CONSTANT, to get best behavior ... but as you point
> out,
> > seems like we failed to implement that approach in our code already,
> which
> > suggests this approach is not intuitive to our committers, so why do we
> > expect our users to understand it ...
> >
> > I am +1 on either of the approaches (both get rid of Version.java). I
> don't
> > feel bad with asking users to read CHANGES before they upgrade, and it
> does
> > mean that FooAnalyzer always gives you the best behavior, which is
> important
> > for new users or if you always re-index. Vs the second approach which
> always
> > prefers backwards compatibility, and telling users to read the javadocs
> (and
> > CHANGES)  in order to find the best version of FooAnalyzer.
> >
> > There is another issue w/ a global Version CONSTANT, which today we
> > encourage apps to use -- if you use two analyzers, but you want to work
> with
> > a different Version of each (because of all sorts of reasons), having a
> > global constant is bad. The explicit Foo45Analyzer (or Foo49Analyzer,
> > whichever) lets you mix whichever versions that you want.
> >
> > Shai
> >
> > On Sun, Aug 3, 2014 at 3:02 PM, Robert Muir <rc...@gmail.com> wrote:
> >>
> >> You don't read what i wrote.
> >>
> >> Read it again.
> >>
> >>
> >> On Sun, Aug 3, 2014 at 7:49 AM, Shai Erera <se...@gmail.com> wrote:
> >> > Yes, I agree that Foo49Analyzer is an odd name. Better if it was named
> >> > FooAnalyzerWithNoApostrophe, and I'm fine if that Analyzer chose to
> name
> >> > its
> >> > different versions like that. But in the absence of better naming
> ideas,
> >> > I
> >> > proposed the Foo49Analyzer.
> >> >
> >> > If we already have such Analyzers, then we are in fact implementing
> that
> >> > approach, only didn't make that decision globally. So whether it's odd
> >> > or
> >> > not, let's first agree if we are willing to have these analyzers in
> our
> >> > code
> >> > base (i.e. w/ the back-compat support). If we do, we can let each
> >> > Analyzer
> >> > decide on its naming.
> >> >
> >> > Analyzers aren't Codecs, I agree, and sticking the Lucene version in
> >> > their
> >> > name is probably not the best thing to do, as the Lucene version is
> more
> >> > associated with the index format. But if a fixed Analyzer cannot come
> up
> >> > w/
> >> > a better name, I think the Lucene version there is not that horrible.
> >> >
> >> > And, it lets us easily remove Version.java.
> >> >
> >> > Shai
> >> >
> >> >
> >> > On Sun, Aug 3, 2014 at 2:43 PM, Robert Muir <rc...@gmail.com> wrote:
> >> >>
> >> >> On Sat, Aug 2, 2014 at 12:41 PM, Shai Erera <se...@gmail.com>
> wrote:
> >> >> > Another proposal that I made on LUCENE-5859 is to get rid of
> Version
> >> >> > (for
> >> >> > Analyzers) and follow the solution we have with Codecs. If an
> >> >> > Analyzer
> >> >> > changes its runtime behavior, and e.g not marked @experimental, it
> >> >> > can
> >> >> > create a Foo49Analyzer with the new behavior. That way, apps are
> >> >> > still
> >> >> > safe
> >> >> > when they upgrade, since their Foo45Analyzer still exists (but
> >> >> > deprecated).
> >> >> > And they can always copy a Foo45Analyzer when they upgrade to
> Lucene
> >> >> > 6.0
> >> >> > where it no longer exists... with this approach, there's no single
> >> >> > version
> >> >> > across the app - it just uses the specific Analyzer impls.
> >> >>
> >> >> But the usability here would be really bad.
> >> >>
> >> >> For codecs there isn't much a better thing to name it anyway, and
> >> >> codecs are super-expert to change.
> >> >>
> >> >> For analyzers usability is paramount.
> >> >>
> >> >> I do think its ok to name _backwards_ compat tokenizer/tokenfilter
> >> >> classes this way. In fact its already this way in trunk for any back
> >> >> compat *actually doing something*: Lucene43NgramTokenizer,
> >> >> Lucene47WordDelimiterFilter. The Version parameters are just for
> show,
> >> >> not doing anything!
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Lucene versioning logic

Posted by Robert Muir <rc...@gmail.com>.

No, I didn't say any of this, please read it again :)

Old back-compat *Tokenizer/Tokenfilter* are named this way.
*Tokenizer/Tokenfilter*
Only the old ones.
Just to be clear: *Tokenizer/Tokenfilter*
Their Factories still use Version to produce the right one (as we
can't remove version from there, or we will have complaints from solr
developers).

So users who want the version-style back compat can just use the
factories. really.
On the other hand new users can do 'new LowerCaseFilter()' without the bullshit.

For Analyzers, there is a setter. Users who want to use *OUR
ANALYZERS* with back compat, call the setter. But its not
mandatory-in-your-face-ctor.

I am +1 to Ryan's proposal, so please look for more elaboration there.
I am -1 to putting Versions in the name of Analyzers.

On Sun, Aug 3, 2014 at 8:21 AM, Shai Erera <se...@gmail.com> wrote:
> Oh, I misread this part "I do think its ok to name ..." -- replaced "do"
> with "don't" :).
>
> So you say that if we have a FooAnalyzer in 4.5 and change its behavior in
> 4.9, then we add a Foo45Analyzer as a back-compat support, and FooAnalyzer
> in 4.9 keeps its name, but with different behavior?
>
> That means that an app who didn't read CHANGES will be broken upon upgrade,
> but if it does read CHANGES, it at least has a way to retain desired
> behavior.
>
> So the thing now is whether FooAnalyzer is always _current_ and an app
> should choose a backwards version of it (if it wants to), vs if FooAnalyzer
> is _always the same_, and if you want to move forward you have to explicitly
> use a NewFooAnalyzer?
>
> Of course, when FooAnalyzer takes a Version, then an app only needs to
> change its Version CONSTANT, to get best behavior ... but as you point out,
> seems like we failed to implement that approach in our code already, which
> suggests this approach is not intuitive to our committers, so why do we
> expect our users to understand it ...
>
> I am +1 on either of the approaches (both get rid of Version.java). I don't
> feel bad with asking users to read CHANGES before they upgrade, and it does
> mean that FooAnalyzer always gives you the best behavior, which is important
> for new users or if you always re-index. Vs the second approach which always
> prefers backwards compatibility, and telling users to read the javadocs (and
> CHANGES)  in order to find the best version of FooAnalyzer.
>
> There is another issue w/ a global Version CONSTANT, which today we
> encourage apps to use -- if you use two analyzers, but you want to work with
> a different Version of each (because of all sorts of reasons), having a
> global constant is bad. The explicit Foo45Analyzer (or Foo49Analyzer,
> whichever) lets you mix whichever versions that you want.
>
> Shai
>
> On Sun, Aug 3, 2014 at 3:02 PM, Robert Muir <rc...@gmail.com> wrote:
>>
>> You don't read what i wrote.
>>
>> Read it again.
>>
>>
>> On Sun, Aug 3, 2014 at 7:49 AM, Shai Erera <se...@gmail.com> wrote:
>> > Yes, I agree that Foo49Analyzer is an odd name. Better if it was named
>> > FooAnalyzerWithNoApostrophe, and I'm fine if that Analyzer chose to name
>> > its
>> > different versions like that. But in the absence of better naming ideas,
>> > I
>> > proposed the Foo49Analyzer.
>> >
>> > If we already have such Analyzers, then we are in fact implementing that
>> > approach, only didn't make that decision globally. So whether it's odd
>> > or
>> > not, let's first agree if we are willing to have these analyzers in our
>> > code
>> > base (i.e. w/ the back-compat support). If we do, we can let each
>> > Analyzer
>> > decide on its naming.
>> >
>> > Analyzers aren't Codecs, I agree, and sticking the Lucene version in
>> > their
>> > name is probably not the best thing to do, as the Lucene version is more
>> > associated with the index format. But if a fixed Analyzer cannot come up
>> > w/
>> > a better name, I think the Lucene version there is not that horrible.
>> >
>> > And, it lets us easily remove Version.java.
>> >
>> > Shai
>> >
>> >
>> > On Sun, Aug 3, 2014 at 2:43 PM, Robert Muir <rc...@gmail.com> wrote:
>> >>
>> >> On Sat, Aug 2, 2014 at 12:41 PM, Shai Erera <se...@gmail.com> wrote:
>> >> > Another proposal that I made on LUCENE-5859 is to get rid of Version
>> >> > (for
>> >> > Analyzers) and follow the solution we have with Codecs. If an
>> >> > Analyzer
>> >> > changes its runtime behavior, and e.g not marked @experimental, it
>> >> > can
>> >> > create a Foo49Analyzer with the new behavior. That way, apps are
>> >> > still
>> >> > safe
>> >> > when they upgrade, since their Foo45Analyzer still exists (but
>> >> > deprecated).
>> >> > And they can always copy a Foo45Analyzer when they upgrade to Lucene
>> >> > 6.0
>> >> > where it no longer exists... with this approach, there's no single
>> >> > version
>> >> > across the app - it just uses the specific Analyzer impls.
>> >>
>> >> But the usability here would be really bad.
>> >>
>> >> For codecs there isn't much a better thing to name it anyway, and
>> >> codecs are super-expert to change.
>> >>
>> >> For analyzers usability is paramount.
>> >>
>> >> I do think its ok to name _backwards_ compat tokenizer/tokenfilter
>> >> classes this way. In fact its already this way in trunk for any back
>> >> compat *actually doing something*: Lucene43NgramTokenizer,
>> >> Lucene47WordDelimiterFilter. The Version parameters are just for show,
>> >> not doing anything!
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Lucene versioning logic

Posted by Shai Erera <se...@gmail.com>.

Oh, I misread this part "I do think its ok to name ..." -- replaced "do"
with "don't" :).

So you say that if we have a FooAnalyzer in 4.5 and change its behavior in
4.9, then we add a Foo45Analyzer as a back-compat support, and FooAnalyzer
in 4.9 keeps its name, but with different behavior?

That means that an app who didn't read CHANGES will be broken upon upgrade,
but if it does read CHANGES, it at least has a way to retain desired
behavior.

So the thing now is whether FooAnalyzer is always _current_ and an app
should choose a backwards version of it (if it wants to), vs if FooAnalyzer
is _always the same_, and if you want to move forward you have to
explicitly use a NewFooAnalyzer?

Of course, when FooAnalyzer takes a Version, then an app only needs to
change its Version CONSTANT, to get best behavior ... but as you point out,
seems like we failed to implement that approach in our code already, which
suggests this approach is not intuitive to our committers, so why do we
expect our users to understand it ...

I am +1 on either of the approaches (both get rid of Version.java). I don't
feel bad with asking users to read CHANGES before they upgrade, and it does
mean that FooAnalyzer always gives you the best behavior, which is
important for new users or if you always re-index. Vs the second approach
which always prefers backwards compatibility, and telling users to read the
javadocs (and CHANGES)  in order to find the best version of FooAnalyzer.

There is another issue w/ a global Version CONSTANT, which today we
encourage apps to use -- if you use two analyzers, but you want to work
with a different Version of each (because of all sorts of reasons), having
a global constant is bad. The explicit Foo45Analyzer (or Foo49Analyzer,
whichever) lets you mix whichever versions that you want.

Shai

On Sun, Aug 3, 2014 at 3:02 PM, Robert Muir <rc...@gmail.com> wrote:

> You don't read what i wrote.
>
> Read it again.
>
>
> On Sun, Aug 3, 2014 at 7:49 AM, Shai Erera <se...@gmail.com> wrote:
> > Yes, I agree that Foo49Analyzer is an odd name. Better if it was named
> > FooAnalyzerWithNoApostrophe, and I'm fine if that Analyzer chose to name
> its
> > different versions like that. But in the absence of better naming ideas,
> I
> > proposed the Foo49Analyzer.
> >
> > If we already have such Analyzers, then we are in fact implementing that
> > approach, only didn't make that decision globally. So whether it's odd or
> > not, let's first agree if we are willing to have these analyzers in our
> code
> > base (i.e. w/ the back-compat support). If we do, we can let each
> Analyzer
> > decide on its naming.
> >
> > Analyzers aren't Codecs, I agree, and sticking the Lucene version in
> their
> > name is probably not the best thing to do, as the Lucene version is more
> > associated with the index format. But if a fixed Analyzer cannot come up
> w/
> > a better name, I think the Lucene version there is not that horrible.
> >
> > And, it lets us easily remove Version.java.
> >
> > Shai
> >
> >
> > On Sun, Aug 3, 2014 at 2:43 PM, Robert Muir <rc...@gmail.com> wrote:
> >>
> >> On Sat, Aug 2, 2014 at 12:41 PM, Shai Erera <se...@gmail.com> wrote:
> >> > Another proposal that I made on LUCENE-5859 is to get rid of Version
> >> > (for
> >> > Analyzers) and follow the solution we have with Codecs. If an Analyzer
> >> > changes its runtime behavior, and e.g not marked @experimental, it can
> >> > create a Foo49Analyzer with the new behavior. That way, apps are still
> >> > safe
> >> > when they upgrade, since their Foo45Analyzer still exists (but
> >> > deprecated).
> >> > And they can always copy a Foo45Analyzer when they upgrade to Lucene
> 6.0
> >> > where it no longer exists... with this approach, there's no single
> >> > version
> >> > across the app - it just uses the specific Analyzer impls.
> >>
> >> But the usability here would be really bad.
> >>
> >> For codecs there isn't much a better thing to name it anyway, and
> >> codecs are super-expert to change.
> >>
> >> For analyzers usability is paramount.
> >>
> >> I do think its ok to name _backwards_ compat tokenizer/tokenfilter
> >> classes this way. In fact its already this way in trunk for any back
> >> compat *actually doing something*: Lucene43NgramTokenizer,
> >> Lucene47WordDelimiterFilter. The Version parameters are just for show,
> >> not doing anything!
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Lucene versioning logic

Posted by Robert Muir <rc...@gmail.com>.

You don't read what i wrote.

Read it again.


On Sun, Aug 3, 2014 at 7:49 AM, Shai Erera <se...@gmail.com> wrote:
> Yes, I agree that Foo49Analyzer is an odd name. Better if it was named
> FooAnalyzerWithNoApostrophe, and I'm fine if that Analyzer chose to name its
> different versions like that. But in the absence of better naming ideas, I
> proposed the Foo49Analyzer.
>
> If we already have such Analyzers, then we are in fact implementing that
> approach, only didn't make that decision globally. So whether it's odd or
> not, let's first agree if we are willing to have these analyzers in our code
> base (i.e. w/ the back-compat support). If we do, we can let each Analyzer
> decide on its naming.
>
> Analyzers aren't Codecs, I agree, and sticking the Lucene version in their
> name is probably not the best thing to do, as the Lucene version is more
> associated with the index format. But if a fixed Analyzer cannot come up w/
> a better name, I think the Lucene version there is not that horrible.
>
> And, it lets us easily remove Version.java.
>
> Shai
>
>
> On Sun, Aug 3, 2014 at 2:43 PM, Robert Muir <rc...@gmail.com> wrote:
>>
>> On Sat, Aug 2, 2014 at 12:41 PM, Shai Erera <se...@gmail.com> wrote:
>> > Another proposal that I made on LUCENE-5859 is to get rid of Version
>> > (for
>> > Analyzers) and follow the solution we have with Codecs. If an Analyzer
>> > changes its runtime behavior, and e.g not marked @experimental, it can
>> > create a Foo49Analyzer with the new behavior. That way, apps are still
>> > safe
>> > when they upgrade, since their Foo45Analyzer still exists (but
>> > deprecated).
>> > And they can always copy a Foo45Analyzer when they upgrade to Lucene 6.0
>> > where it no longer exists... with this approach, there's no single
>> > version
>> > across the app - it just uses the specific Analyzer impls.
>>
>> But the usability here would be really bad.
>>
>> For codecs there isn't much a better thing to name it anyway, and
>> codecs are super-expert to change.
>>
>> For analyzers usability is paramount.
>>
>> I do think its ok to name _backwards_ compat tokenizer/tokenfilter
>> classes this way. In fact its already this way in trunk for any back
>> compat *actually doing something*: Lucene43NgramTokenizer,
>> Lucene47WordDelimiterFilter. The Version parameters are just for show,
>> not doing anything!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Lucene versioning logic

Posted by Shai Erera <se...@gmail.com>.

Yes, I agree that Foo49Analyzer is an odd name. Better if it was named
FooAnalyzerWithNoApostrophe, and I'm fine if that Analyzer chose to name
its different versions like that. But in the absence of better naming
ideas, I proposed the Foo49Analyzer.

If we already have such Analyzers, then we are in fact implementing that
approach, only didn't make that decision globally. So whether it's odd or
not, let's first agree if we are willing to have these analyzers in our
code base (i.e. w/ the back-compat support). If we do, we can let each
Analyzer decide on its naming.

Analyzers aren't Codecs, I agree, and sticking the Lucene version in their
name is probably not the best thing to do, as the Lucene version is more
associated with the index format. But if a fixed Analyzer cannot come up w/
a better name, I think the Lucene version there is not that horrible.

And, it lets us easily remove Version.java.

Shai

On Sun, Aug 3, 2014 at 2:43 PM, Robert Muir <rc...@gmail.com> wrote:

> On Sat, Aug 2, 2014 at 12:41 PM, Shai Erera <se...@gmail.com> wrote:
> > Another proposal that I made on LUCENE-5859 is to get rid of Version (for
> > Analyzers) and follow the solution we have with Codecs. If an Analyzer
> > changes its runtime behavior, and e.g not marked @experimental, it can
> > create a Foo49Analyzer with the new behavior. That way, apps are still
> safe
> > when they upgrade, since their Foo45Analyzer still exists (but
> deprecated).
> > And they can always copy a Foo45Analyzer when they upgrade to Lucene 6.0
> > where it no longer exists... with this approach, there's no single
> version
> > across the app - it just uses the specific Analyzer impls.
>
> But the usability here would be really bad.
>
> For codecs there isn't much a better thing to name it anyway, and
> codecs are super-expert to change.
>
> For analyzers usability is paramount.
>
> I do think its ok to name _backwards_ compat tokenizer/tokenfilter
> classes this way. In fact its already this way in trunk for any back
> compat *actually doing something*: Lucene43NgramTokenizer,
> Lucene47WordDelimiterFilter. The Version parameters are just for show,
> not doing anything!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Lucene versioning logic

Posted by Robert Muir <rc...@gmail.com>.

On Sat, Aug 2, 2014 at 12:41 PM, Shai Erera <se...@gmail.com> wrote:
> Another proposal that I made on LUCENE-5859 is to get rid of Version (for
> Analyzers) and follow the solution we have with Codecs. If an Analyzer
> changes its runtime behavior, and e.g not marked @experimental, it can
> create a Foo49Analyzer with the new behavior. That way, apps are still safe
> when they upgrade, since their Foo45Analyzer still exists (but deprecated).
> And they can always copy a Foo45Analyzer when they upgrade to Lucene 6.0
> where it no longer exists... with this approach, there's no single version
> across the app - it just uses the specific Analyzer impls.

But the usability here would be really bad.

For codecs there isn't much a better thing to name it anyway, and
codecs are super-expert to change.

For analyzers usability is paramount.

I do think its ok to name _backwards_ compat tokenizer/tokenfilter
classes this way. In fact its already this way in trunk for any back
compat *actually doing something*: Lucene43NgramTokenizer,
Lucene47WordDelimiterFilter. The Version parameters are just for show,
not doing anything!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Lucene versioning logic

Posted by Shai Erera <se...@gmail.com>.

Another proposal that I made on LUCENE-5859 is to get rid of Version (for
Analyzers) and follow the solution we have with Codecs. If an Analyzer
changes its runtime behavior, and e.g not marked @experimental, it can
create a Foo49Analyzer with the new behavior. That way, apps are still safe
when they upgrade, since their Foo45Analyzer still exists (but deprecated).
And they can always copy a Foo45Analyzer when they upgrade to Lucene 6.0
where it no longer exists... with this approach, there's no single version
across the app - it just uses the specific Analyzer impls.

Anyway, +1 to make bugfix release first class citizens. That's also the
only way to make sure we support back compat if a bug is fixed in an
Analyzer in e.g. 4.8.2.

Shai


On Sat, Aug 2, 2014 at 11:54 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> +1, this sounds like a great solution.  It simplifies the APIs (no
> more required Version to Analyzer), it consolidates the version logic
> to a "single source", dot releases are first class.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Aug 1, 2014 at 7:47 PM, Ryan Ernst <ry...@iernst.net> wrote:
> > There has been a lot of heated discussion recently about version
> > tracking in Lucene [1] [2].  I wanted to have a fresh discussion
> > outside of jira to give a full description of the current state of
> > things, the problems I have heard, and a proposed solution.
> >
> > CURRENT
> >
> > We have 2 pieces of code that handle “versioning.”  The first is
> > Constants.LUCENE_MAIN_VERSION, which is written to the SegmentsInfo
> > for each segment.  This is a string version which is used to detect
> > when the current version of lucene is newer than the version that
> > wrote the segment (and how/if an upgrade to to a newer codec should be
> > done). There is some complication with the “display” version and
> > non-display version, which are distinguished by whether the version of
> > lucene was an official release, or an alpha/beta version (which was
> > added specifically for the 4.0.0 release ramp up).  This string
> > version also has its own parsing and comparison methods.
> >
> > The second piece of versioning code is in Version.java, which is an
> > enum used by analyzers to maintain backwards compatible behavior given
> > a specific version of lucene.  The enum only contains values for dot
> > releases of lucene, not bug fixes (which was what spurred the recent
> > discussions over version). Analyzers’ constructors take a required
> > Version parameter, which is only actually used by the few analyzers
> > that have changed behavior recently.  Version.java contains a separate
> > version parsing and comparison methods.
> >
> >
> > CONCERNS
> >
> > * Having 2 different pieces of code that do very similar things is
> > confusing for development.  Very few developers appear to really
> > understand the current system (especially when trying to understand
> > the alpha/beta setup).
> >
> > * Users are generally confused by the Version passed to analyzers: I
> > know I was when I first started working with Lucene, and
> > Version.CURRENT_VERSION was deprecated because users used that without
> > understanding the implications.
> >
> > * Most analyzers currently have dead code constructors, since they
> > never make use of Version.  There are also a lot of classes used by
> > analyzers which contain similar dead code.
> >
> > * Backwards compatibility needs to be handled in some fashion, to
> > ensure users have a path to upgrade from one version of lucene to
> > another, without requiring immediate re-indexing.
> >
> >
> > PROPOSAL
> >
> > I propose the following:
> >
> > * Consolidate all version related enumeration, including reading and
> > writing string versions, into Version.java.  Have a static method that
> > returns the current lucene version (replacing
> > Constants.LUCENE_MAIN_VERSION).
> >
> > * Make bug fix releases first class in the enumeration, so that they
> > can be distinguished for any compatibility issues that come up.
> >
> > * Remove all snapshot/alpha/beta versioning logic.  Alpha/beta was
> > really only necessary for 4.0 because of the extreme changes that were
> > being made.  The system is much more stable now, and 5.0 should not
> > require preview releases, IMO.  I don’t think snapshots should be a
> > concern because any user building an index from an unreleased build
> > (which they built themselves) is just asking for trouble.  They do so
> > at their own risk (of figuring out how to upgrade their indexes if
> > they are not trash-able).  Backwards compatibility can be handled by
> > adding the alpha/beta/final versions of 4.0 to the enum (and special
> > parsing logic for this).  If lucene changes so much that we need
> > alpha/beta type discrimination in the future, we can revisit the
> > system if simply having extra versions in the enum won't work.
> >
> > * Analyzers constructors should have Version removed, and a setter
> > should be added which allows production users to set the version used.
> > This way any analyzers can still use version if it is set to something
> > other than current (which would be the default), but users simply
> > prototyping do not need to worry about it.
> >
> > * Classes that analyzers use, which take Version, should have Version
> > removed, and the analyzers should choose which settings/variants of
> > those classes to use based on the version they have set. In other
> > words, all version variant logic should be contained within the
> > analyzers.  For example, Lucene47WordDelimiterFilter, or
> > StandardAnalyzer can take the unicode version.
> > Factories could still take Version (e.g. TokenizerFactory,
> > TokenFilterFactory, etc) to produce the correct component (so nothing
> > will change for solr in this regard).
> >
> > I’m sure not everyone will be happy with what I have proposed, but I’m
> > hoping we can work out a solution together, and then implement in a
> > team-like fashion, the way I have seen the community work in the past,
> > and I hope to see again in the future.
> >
> > Thanks
> > Ryan
> >
> > [1] https://issues.apache.org/jira/browse/LUCENE-5850
> > [2] https://issues.apache.org/jira/browse/LUCENE-5859
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Lucene versioning logic

Posted by Michael McCandless <lu...@mikemccandless.com>.

+1, this sounds like a great solution.  It simplifies the APIs (no
more required Version to Analyzer), it consolidates the version logic
to a "single source", dot releases are first class.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Aug 1, 2014 at 7:47 PM, Ryan Ernst <ry...@iernst.net> wrote:
> There has been a lot of heated discussion recently about version
> tracking in Lucene [1] [2].  I wanted to have a fresh discussion
> outside of jira to give a full description of the current state of
> things, the problems I have heard, and a proposed solution.
>
> CURRENT
>
> We have 2 pieces of code that handle “versioning.”  The first is
> Constants.LUCENE_MAIN_VERSION, which is written to the SegmentsInfo
> for each segment.  This is a string version which is used to detect
> when the current version of lucene is newer than the version that
> wrote the segment (and how/if an upgrade to to a newer codec should be
> done). There is some complication with the “display” version and
> non-display version, which are distinguished by whether the version of
> lucene was an official release, or an alpha/beta version (which was
> added specifically for the 4.0.0 release ramp up).  This string
> version also has its own parsing and comparison methods.
>
> The second piece of versioning code is in Version.java, which is an
> enum used by analyzers to maintain backwards compatible behavior given
> a specific version of lucene.  The enum only contains values for dot
> releases of lucene, not bug fixes (which was what spurred the recent
> discussions over version). Analyzers’ constructors take a required
> Version parameter, which is only actually used by the few analyzers
> that have changed behavior recently.  Version.java contains a separate
> version parsing and comparison methods.
>
>
> CONCERNS
>
> * Having 2 different pieces of code that do very similar things is
> confusing for development.  Very few developers appear to really
> understand the current system (especially when trying to understand
> the alpha/beta setup).
>
> * Users are generally confused by the Version passed to analyzers: I
> know I was when I first started working with Lucene, and
> Version.CURRENT_VERSION was deprecated because users used that without
> understanding the implications.
>
> * Most analyzers currently have dead code constructors, since they
> never make use of Version.  There are also a lot of classes used by
> analyzers which contain similar dead code.
>
> * Backwards compatibility needs to be handled in some fashion, to
> ensure users have a path to upgrade from one version of lucene to
> another, without requiring immediate re-indexing.
>
>
> PROPOSAL
>
> I propose the following:
>
> * Consolidate all version related enumeration, including reading and
> writing string versions, into Version.java.  Have a static method that
> returns the current lucene version (replacing
> Constants.LUCENE_MAIN_VERSION).
>
> * Make bug fix releases first class in the enumeration, so that they
> can be distinguished for any compatibility issues that come up.
>
> * Remove all snapshot/alpha/beta versioning logic.  Alpha/beta was
> really only necessary for 4.0 because of the extreme changes that were
> being made.  The system is much more stable now, and 5.0 should not
> require preview releases, IMO.  I don’t think snapshots should be a
> concern because any user building an index from an unreleased build
> (which they built themselves) is just asking for trouble.  They do so
> at their own risk (of figuring out how to upgrade their indexes if
> they are not trash-able).  Backwards compatibility can be handled by
> adding the alpha/beta/final versions of 4.0 to the enum (and special
> parsing logic for this).  If lucene changes so much that we need
> alpha/beta type discrimination in the future, we can revisit the
> system if simply having extra versions in the enum won't work.
>
> * Analyzers constructors should have Version removed, and a setter
> should be added which allows production users to set the version used.
> This way any analyzers can still use version if it is set to something
> other than current (which would be the default), but users simply
> prototyping do not need to worry about it.
>
> * Classes that analyzers use, which take Version, should have Version
> removed, and the analyzers should choose which settings/variants of
> those classes to use based on the version they have set. In other
> words, all version variant logic should be contained within the
> analyzers.  For example, Lucene47WordDelimiterFilter, or
> StandardAnalyzer can take the unicode version.
> Factories could still take Version (e.g. TokenizerFactory,
> TokenFilterFactory, etc) to produce the correct component (so nothing
> will change for solr in this regard).
>
> I’m sure not everyone will be happy with what I have proposed, but I’m
> hoping we can work out a solution together, and then implement in a
> team-like fashion, the way I have seen the community work in the past,
> and I hope to see again in the future.
>
> Thanks
> Ryan
>
> [1] https://issues.apache.org/jira/browse/LUCENE-5850
> [2] https://issues.apache.org/jira/browse/LUCENE-5859
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org