You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Earwin Burrfoot <ea...@gmail.com> on 2010/11/27 21:44:54 UTC

deprecating Versions

I think we should deprecate and remove Version constants as Lucene progresses?

Imagine there's a number of features in 4.x that get deprecated and
un-defaulted in 5.x, then removed in 6.x
Our user compiled with Version.4_0, it was cool in 4.x, then it still
worked in 5.x, as we preserved index compatibility, then it silently
broke in 5.x -> not good.
If we deprecated Version.4_0 @ 5.x, he'd get a warning if he tried
recompiling. Then if we removed Version.4_0 @ 6.x, his app won't start
anymore, even without recompiling -> failfast.

Going with this, we should deprecate 3x in trunk and delete 2x. In 3x
branch, we should deprecate 2x.

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sat, Nov 27, 2010 at 3:44 PM, Earwin Burrfoot <ea...@gmail.com> wrote:
> I think we should deprecate and remove Version constants as Lucene progresses?
>
> Imagine there's a number of features in 4.x that get deprecated and
> un-defaulted in 5.x, then removed in 6.x
> Our user compiled with Version.4_0, it was cool in 4.x, then it still
> worked in 5.x, as we preserved index compatibility, then it silently
> broke in 5.x -> not good.
> If we deprecated Version.4_0 @ 5.x, he'd get a warning if he tried
> recompiling. Then if we removed Version.4_0 @ 6.x, his app won't start
> anymore, even without recompiling -> failfast.
>
> Going with this, we should deprecate 3x in trunk and delete 2x. In 3x
> branch, we should deprecate 2x.

+1, this makes complete sense.

Ie it simply "matches" our index back compat policy, which is the
intention of when these version constants should be used.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Mon, Nov 29, 2010 at 05:34:27AM -0500, Robert Muir wrote:
> Is it somehow possible i could convince everyone that all the analyzers we
> provide are simply examples?  This way we could really make this a bit more
> reasonable and clean up a lot of stuff.

I understand what you're getting at.  We don't really expect people to fork an
analyzer code base, though -- so we need to draw a line between e.g. the code
that implements StopFilter and stoplist content.   We want the low-level code
to be part of the library, but maybe we want stoplist content to be considered
example code.

> Seems like we really want to move towards a more declarative model where
> these are just config files... so only then it will ok for us to change them
> because they suddenly aren't suffixed with .java?!

Consider how this might work with e.g. RussianAnalyzer.  The
declaratively-expressed sample analyzer config could contain a hard-coded list
of Russian stop words, and as this hard-coded stoplist would travel with the
index in a config file, there would be no index compatibility problems upon
upgrading Lucene.  The stoplist in the sample config could change, even on
bugfix releases.

Config file syntax would potentially be affected by a Lucene upgrade, but that
doesn't affect index content and maintaining back compat is straightforward.

Things are more difficult with versioning e.g. stemmers, but I think the
stoplist example illustrates the potential of declarative analyzer
specification.  Maybe specifying Version in a sample file and dispatching to
different revs of a Snowball stemmer is less painful than forcing a user to
figure out Version from API documentation?

Having to extract an Analyzer from an index directory does present the
potential for Analyzer mismatches in a multi-node setup where e.g. the machine
that parses the query string and the machine which executes matching are not
the same.

Marvin Humphrey

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Document aware analyzers was Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Dec 1, 2010 at 3:44 PM, Grant Ingersoll <gs...@apache.org> wrote:

>> Well i have trouble with a few of your examples: "want to use
>> Tee/Sink" doesn't work for me... its a description of an XY problem to
>> me... i've never needed to use it, and its rarely discussed on the
>> user list...
>
> Shrugs.  In my experiments, it can really speed things up when analyzing the same content, but with different outcomes, or at least it did back before the new API.

<snip>

> For instance, the typical copy field scenario where one has two fields containing the same content analyzed in slightly different ways.  In many cases, most of the work is exactly the same (tokenize, lowercase, stopword, stem or not) and yet we have to pass around the string twice and do almost all of the same work twice all so that we can change one little thing on the token.
>

but didnt you just answer your own question? sounds like you just need
to implement copyField in solr with Tee/Sink.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Document aware analyzers was Re: deprecating Versions

Posted by Earwin Burrfoot <ea...@gmail.com>.

I agree with Robert that minimizing analysis <-> indexer interface is
the way to go.
For me, one of Lucene's problems is that it wants to do too much stuff
out of the box, and is tightly coupled, so you can't drop much of the
things you never need.

Having minimal interface for the indexer allows us to experiment with
various analysis approaches without touching core functionality at
all.
I.e. - I'd like to make analysis chain non-streaming. This will
greatly simplify much of my filters, and for my use-case likely yields
more performance.
At the same time I understand that many people can't afford keeping
their docs completely in memory while indexing.

My ideal API is unusable for them, wrapping my token buffers to look
like Document+Fieldables+TokenStreams is uuugly.

Having the lowest possible common denominator as indexer interface is
best for both parties.

On Wed, Dec 1, 2010 at 23:44, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Dec 1, 2010, at 2:40 PM, Robert Muir wrote:
>
>> On Wed, Dec 1, 2010 at 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>>
>>> Nah, I just meant analysis would often benefit from having knowledge of the document as a whole instead of just the individual field.
>>>
>>
>> and analysis would suffer from this too, because right now these
>> things are independent and we have a fast simple reusable model.
>> I'd prefer to keep the TokenStream analysis api... but as we have
>> discussed on the list, it would be nice to minimize the interface
>> between analysis components and indexer/queryparser so you can use an
>> *alternative* API... we are working in this direction already.
>
> I think the existing TokenStream API still works, at least in my mind.
>
>>
>>>>
>>>> Maybe if you give a concrete example then I would have a better
>>>> understanding of the problem you think this might solve.
>>>
>>> Let me see if I can put some flesh on the bones.  I'm assuming the raw document has already been parsed and that we are still basically dealing with strings and that we have a document which contains one or more fields.
>>>
>>> If we step back and look at our analysis process, there are some things that are easy and some things that are hard that maybe shouldn't be because even though we talk like we are indexing and search documents, we are really indexing and searching fields and everything is Field centric.  That works fine for the easy analysis things like tokenization, stemming, lowercasing, etc. when all the content is in one language.  It doesn't work well when you have multiple languages in a single document or if you want to do things like Tee/Sink or even something as simple as Solr's copy field semantics.
>>
>> Well i have trouble with a few of your examples: "want to use
>> Tee/Sink" doesn't work for me... its a description of an XY problem to
>> me... i've never needed to use it, and its rarely discussed on the
>> user list...
>
> Shrugs.  In my experiments, it can really speed things up when analyzing the same content, but with different outcomes, or at least it did back before the new API.  My bigger point is things like that and the PerFieldAnalyzerWrapper are symptoms of treating documents as second class citizens.
>
>>
>> As far as working with a lot of languages, i understand this issue
>> much more... but i've never much had a desire for this, especially
>> given the fact that "Query is a document too"... I'm personally not a
>> fan of language detection,
>> and I don't think it belongs in our analysis API: like encoding
>> detection and other similar heuristics, its part of document parsing
>> to me!
>
> I didn't say it did, I just said it is an example of the types of things where we pretend like we are document-centric, but we are actually field centric.
>
>>
>> As I said before, I think our TokenStream analysis API is already
>> quite complicated and I dont think we should make it more complicated
>> for these reasons (especially since these examples are quite vague and
>> i'm still not sure you cannot solve them easier in another way.
>
> I never said you couldn't solve them in other ways, but I always find they are kludgy.  For instance, how many times, in a complex environment, must one tokenize the same text over and over again just to get it in the index?
>
>>
>> If you want to use a more complicated analysis API that doesnt work
>> like TokenStreams but instead incorporates things that are document
>> parsing or whatever, i guess you should be able to do that. I'm not
>> sure Lucene should provide such an API, but we shouldn't force you to
>> use the TokenStreams API either.
>
> You keep going back to document parsing, even though I have never mentioned it.  All I am proposing/_wanting to discuss_ is the notion that Analysis might benefit from a more document centric view of analysis.  You're presupposing I want to change TokenStreams, etc. when all I'm wanting to do is take a step back and discuss the bigger picture of how a user actually does analysis in the real world and whether we can make it easier for them.  I don't even have an implementation in mind yet.
>
> For instance, the typical copy field scenario where one has two fields containing the same content analyzed in slightly different ways.  In many cases, most of the work is exactly the same (tokenize, lowercase, stopword, stem or not) and yet we have to pass around the string twice and do almost all of the same work twice all so that we can change one little thing on the token.
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Document aware analyzers was Re: deprecating Versions

Posted by Grant Ingersoll <gs...@apache.org>.

On Dec 1, 2010, at 2:40 PM, Robert Muir wrote:

> On Wed, Dec 1, 2010 at 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> 
>> Nah, I just meant analysis would often benefit from having knowledge of the document as a whole instead of just the individual field.
>> 
> 
> and analysis would suffer from this too, because right now these
> things are independent and we have a fast simple reusable model.
> I'd prefer to keep the TokenStream analysis api... but as we have
> discussed on the list, it would be nice to minimize the interface
> between analysis components and indexer/queryparser so you can use an
> *alternative* API... we are working in this direction already.

I think the existing TokenStream API still works, at least in my mind.  

> 
>>> 
>>> Maybe if you give a concrete example then I would have a better
>>> understanding of the problem you think this might solve.
>> 
>> Let me see if I can put some flesh on the bones.  I'm assuming the raw document has already been parsed and that we are still basically dealing with strings and that we have a document which contains one or more fields.
>> 
>> If we step back and look at our analysis process, there are some things that are easy and some things that are hard that maybe shouldn't be because even though we talk like we are indexing and search documents, we are really indexing and searching fields and everything is Field centric.  That works fine for the easy analysis things like tokenization, stemming, lowercasing, etc. when all the content is in one language.  It doesn't work well when you have multiple languages in a single document or if you want to do things like Tee/Sink or even something as simple as Solr's copy field semantics.
> 
> Well i have trouble with a few of your examples: "want to use
> Tee/Sink" doesn't work for me... its a description of an XY problem to
> me... i've never needed to use it, and its rarely discussed on the
> user list...

Shrugs.  In my experiments, it can really speed things up when analyzing the same content, but with different outcomes, or at least it did back before the new API.  My bigger point is things like that and the PerFieldAnalyzerWrapper are symptoms of treating documents as second class citizens.

> 
> As far as working with a lot of languages, i understand this issue
> much more... but i've never much had a desire for this, especially
> given the fact that "Query is a document too"... I'm personally not a
> fan of language detection,
> and I don't think it belongs in our analysis API: like encoding
> detection and other similar heuristics, its part of document parsing
> to me!

I didn't say it did, I just said it is an example of the types of things where we pretend like we are document-centric, but we are actually field centric.

> 
> As I said before, I think our TokenStream analysis API is already
> quite complicated and I dont think we should make it more complicated
> for these reasons (especially since these examples are quite vague and
> i'm still not sure you cannot solve them easier in another way.

I never said you couldn't solve them in other ways, but I always find they are kludgy.  For instance, how many times, in a complex environment, must one tokenize the same text over and over again just to get it in the index?

> 
> If you want to use a more complicated analysis API that doesnt work
> like TokenStreams but instead incorporates things that are document
> parsing or whatever, i guess you should be able to do that. I'm not
> sure Lucene should provide such an API, but we shouldn't force you to
> use the TokenStreams API either.

You keep going back to document parsing, even though I have never mentioned it.  All I am proposing/_wanting to discuss_ is the notion that Analysis might benefit from a more document centric view of analysis.  You're presupposing I want to change TokenStreams, etc. when all I'm wanting to do is take a step back and discuss the bigger picture of how a user actually does analysis in the real world and whether we can make it easier for them.  I don't even have an implementation in mind yet.

For instance, the typical copy field scenario where one has two fields containing the same content analyzed in slightly different ways.  In many cases, most of the work is exactly the same (tokenize, lowercase, stopword, stem or not) and yet we have to pass around the string twice and do almost all of the same work twice all so that we can change one little thing on the token.  

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Document aware analyzers was Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Dec 1, 2010 at 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> Nah, I just meant analysis would often benefit from having knowledge of the document as a whole instead of just the individual field.
>

and analysis would suffer from this too, because right now these
things are independent and we have a fast simple reusable model.
I'd prefer to keep the TokenStream analysis api... but as we have
discussed on the list, it would be nice to minimize the interface
between analysis components and indexer/queryparser so you can use an
*alternative* API... we are working in this direction already.

>>
>> Maybe if you give a concrete example then I would have a better
>> understanding of the problem you think this might solve.
>
> Let me see if I can put some flesh on the bones.  I'm assuming the raw document has already been parsed and that we are still basically dealing with strings and that we have a document which contains one or more fields.
>
> If we step back and look at our analysis process, there are some things that are easy and some things that are hard that maybe shouldn't be because even though we talk like we are indexing and search documents, we are really indexing and searching fields and everything is Field centric.  That works fine for the easy analysis things like tokenization, stemming, lowercasing, etc. when all the content is in one language.  It doesn't work well when you have multiple languages in a single document or if you want to do things like Tee/Sink or even something as simple as Solr's copy field semantics.

Well i have trouble with a few of your examples: "want to use
Tee/Sink" doesn't work for me... its a description of an XY problem to
me... i've never needed to use it, and its rarely discussed on the
user list...

As far as working with a lot of languages, i understand this issue
much more... but i've never much had a desire for this, especially
given the fact that "Query is a document too"... I'm personally not a
fan of language detection,
and I don't think it belongs in our analysis API: like encoding
detection and other similar heuristics, its part of document parsing
to me!

As I said before, I think our TokenStream analysis API is already
quite complicated and I dont think we should make it more complicated
for these reasons (especially since these examples are quite vague and
i'm still not sure you cannot solve them easier in another way.

If you want to use a more complicated analysis API that doesnt work
like TokenStreams but instead incorporates things that are document
parsing or whatever, i guess you should be able to do that. I'm not
sure Lucene should provide such an API, but we shouldn't force you to
use the TokenStreams API either.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Document aware analyzers was Re: deprecating Versions

Posted by Grant Ingersoll <gs...@apache.org>.

On Dec 1, 2010, at 8:07 AM, Robert Muir wrote:

> On Wed, Dec 1, 2010 at 8:01 AM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> While we are at it, how about we make the Analysis process document aware instead of Field aware?  The PerFieldAnalyzerWrapper, while doing exactly what it says it does, is just silly.  If you had an analysis process that was aware, if it chooses to be, of the document as a whole then you open up a whole lot more opportunity for doing interesting analysis while losing nothing towards the individual treatment of fields.  The TeeSink stuff is an attempt at this, but it is not sufficient.
>> 
> 
> I'm not sure I like this: traditionally we let the user application
> deal with "document parsing" (how do you take your content and define
> it as documents/fields).

Nah, I just meant analysis would often benefit from having knowledge of the document as a whole instead of just the individual field.  

> 
> If we want to change lucene to start dealing with this "document
> parsing" aspect, thats pretty scary in itself, but in my opinion the
> very last choice of where we would want to add something like that is
> analysis! So personally I really like analysis being separate from
> document parsing: our analysis API is already way too complicated.

Yes, I agree.

> 
> Maybe if you give a concrete example then I would have a better
> understanding of the problem you think this might solve.

Let me see if I can put some flesh on the bones.  I'm assuming the raw document has already been parsed and that we are still basically dealing with strings and that we have a document which contains one or more fields.

If we step back and look at our analysis process, there are some things that are easy and some things that are hard that maybe shouldn't be because even though we talk like we are indexing and search documents, we are really indexing and searching fields and everything is Field centric.  That works fine for the easy analysis things like tokenization, stemming, lowercasing, etc. when all the content is in one language.  It doesn't work well when you have multiple languages in a single document or if you want to do things like Tee/Sink or even something as simple as Solr's copy field semantics.  The fact that we have PerFieldAnalyzerWrapper is a symptom of this.  The clunkiness of the TeeSinkTokenFilter is also another one.  Handling auto language identification is another.  The end result of all of these things is you often have to do analysis work twice (or more) for the same piece of content when I believe that an analysis process that knew a document had multiple fields (which seems like a given) might lead to more efficiencies because repeated analysis work could be shared and also because work that inherently crosses multiple fields on the same document or selects a particular field out of a choice of several can be handled more cleanly.

So, you as the developer would still need to define out what your fields are and what analysis you want done for each of those fields, but we, as Lucene developers, might be able to make things more efficient if we can recognize commonalities, etc. as well as offer users tools that make it easy to work across fields.

At any rate, this is all just food for thought.  I don't have any proposed API changes at this point.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Document aware analyzers was Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Dec 1, 2010 at 8:01 AM, Grant Ingersoll <gs...@apache.org> wrote:

> While we are at it, how about we make the Analysis process document aware instead of Field aware?  The PerFieldAnalyzerWrapper, while doing exactly what it says it does, is just silly.  If you had an analysis process that was aware, if it chooses to be, of the document as a whole then you open up a whole lot more opportunity for doing interesting analysis while losing nothing towards the individual treatment of fields.  The TeeSink stuff is an attempt at this, but it is not sufficient.
>

I'm not sure I like this: traditionally we let the user application
deal with "document parsing" (how do you take your content and define
it as documents/fields).

If we want to change lucene to start dealing with this "document
parsing" aspect, thats pretty scary in itself, but in my opinion the
very last choice of where we would want to add something like that is
analysis! So personally I really like analysis being separate from
document parsing: our analysis API is already way too complicated.

Maybe if you give a concrete example then I would have a better
understanding of the problem you think this might solve.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Document aware analyzers was Re: deprecating Versions

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 29, 2010, at 5:34 AM, Robert Muir wrote:

> On Mon, Nov 29, 2010 at 2:50 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>> And for indexes:
>> * Index compatibility is guaranteed across two adjacent major
>> releases. eg 2.x -> 3.x, 3.x -> 4.x.
>>  That includes both binary compat - codecs, and semantic compat -
>> analyzers (if appropriate Version is used).
>> * Older releases are most probably unsupported.
>>  e.g. 4.x still supports shared docstores for reading, though never
>> writes them. 5.x won't read them either, so you'll have to at least
>> fully optimize your 3.x indexes when going through 4.x to 5.x.
>> 
> 
> Is it somehow possible i could convince everyone that all the
> analyzers we provide are simply examples?
> This way we could really make this a bit more reasonable and clean up
> a lot of stuff.
> 
> Seems like we really want to move towards a more declarative model
> where these are just config files... so only then it will ok for us to
> change them because they suddenly aren't suffixed with .java?!

While we are at it, how about we make the Analysis process document aware instead of Field aware?  The PerFieldAnalyzerWrapper, while doing exactly what it says it does, is just silly.  If you had an analysis process that was aware, if it chooses to be, of the document as a whole then you open up a whole lot more opportunity for doing interesting analysis while losing nothing towards the individual treatment of fields.  The TeeSink stuff is an attempt at this, but it is not sufficient.

Just a thought,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by DM Smith <dm...@gmail.com>.

On 11/29/2010 03:43 PM, Earwin Burrfoot wrote:
> On Mon, Nov 29, 2010 at 20:51, DM Smith<dm...@gmail.com>  wrote:
>> The other thing I'd like is for the spec to be save along side of the index
>> as a manifest. From earlier threads, I can see that there might need to be
>> one for writing and another for reading. I'm not interested in using it to
>> construct an analyzer, but to determine whether the index is invalid wrt to
>> the analyzer currently in use.
> You can already implement such behaviour with 3.x branch of Lucene.
> It has IW.commit(Map<String, String>  userdata) method, that allows you
> to commit with arbitrary payload, that binds to segment and can be
> read back later.

Cool. I forgot entirely about that.

>> I think there is a problem with deprecating and removing constants too.
>> In trunk, which will be 4.0, it needs to be able to read and/or upgrade 2.x
>> indexes. From an analyzer perspective, an index is invalid if the analyzer
>> would produce a different token stream for the same input. If the 2.x
>> version constants are gone, then the index built with 2.x version
>> constants is no longer valid. (It might be valid, but how can one have any
>> confidence of that?) Upgrading the index to the new internal format
>> cannot change this. A buggy lowercase Turkish word will still be buggy
>> after upgrade. (This is a 3.0 version constant that in 5.0 will still need to be around).
> I think it was declared that Lucene does not provide index
> compatibility across more than a single major revision.
> Thus, we don't guarantee reading 2.x index with 4.0 Lucene. So, we can
> drop 2.x constants and compatibility.
> But we still have to support 3.x. In version 5.0 then we're dropping
> 3.x constants and support for bugs/deprecated
> features of 3.x.

Yes, you are correct that 4.0 may but is not guaranteed to read 2.x. My 
bad, yet again. I went back to the threads regarding this around May 25 
and it also was decided that 4.x might not be able to read 3.x, but will 
provide a migration tool in such a case.

That said, my point still stands. The 3.0 version constant which is used 
by an analyzer to preserve 3.0 behavior will need to be retained for the 
sake of analyzers in 5.0. Or the index will need to be rebuilt from 
original input. (I'm referencing the 3.0 rather than a 2.x because of 
the example I have in mind)

The tokens in the 3.0 index that is migrated to a 4.0 index still have 
tokens produced by an analyzer that was buggy. Example, a Turkish index 
with the wrong lower case i (Prior to LUCENE-2101, it would lowercase to 
i. After: İ (dotted capital I) => i ("regular" lower case i) and I 
("regular" upper case I) => 𝚤 (dotless lower case i)). This very 
commonly occurs in Turkish text. So the 4.0 index, still using 3.0 
version constant to get expected behavior, works as it always did.

Now in 5.0, there might be a migration tool or it will be able to read a 
4.x index. If the 3.0 constant is gone and none of these tokens are 
reachable. Search requests will have the correct lower case i and will 
not be able to find those with the wrong one. It will be very obvious.

Regarding this analyzer, code that uses a 2.x version constant for this 
analyzer will need to change to a 3.0 version constant in order for the 
index to be usable in the 4.x series if the 2.x constants are removed.

I don't think this is an isolated example.

With what's happening, every index that uses a deprecated version 
constant will have one very long major release cycle in which to rebuild 
their indexes from scratch.

And as I said at the bottom of my last email, I'm going to re-index 
because I am able and because I want correct behavior. So whatever is 
decided won't affect my application of Lucene.

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Earwin Burrfoot <ea...@gmail.com>.

On Mon, Nov 29, 2010 at 20:51, DM Smith <dm...@gmail.com> wrote:
> The other thing I'd like is for the spec to be save along side of the index
> as a manifest. From earlier threads, I can see that there might need to be
> one for writing and another for reading. I'm not interested in using it to
> construct an analyzer, but to determine whether the index is invalid wrt to
> the analyzer currently in use.
You can already implement such behaviour with 3.x branch of Lucene.
It has IW.commit(Map<String, String> userdata) method, that allows you
to commit with arbitrary payload, that binds to segment and can be
read back later.

> I think there is a problem with deprecating and removing constants too.
> In trunk, which will be 4.0, it needs to be able to read and/or upgrade 2.x
> indexes. From an analyzer perspective, an index is invalid if the analyzer
> would produce a different token stream for the same input. If the 2.x
> version constants are gone, then the index built with 2.x version
> constants is no longer valid. (It might be valid, but how can one have any
> confidence of that?) Upgrading the index to the new internal format
> cannot change this. A buggy lowercase Turkish word will still be buggy
> after upgrade. (This is a 3.0 version constant that in 5.0 will still need to be around).
I think it was declared that Lucene does not provide index
compatibility across more than a single major revision.
Thus, we don't guarantee reading 2.x index with 4.0 Lucene. So, we can
drop 2.x constants and compatibility.
But we still have to support 3.x. In version 5.0 then we're dropping
3.x constants and support for bugs/deprecated
features of 3.x.

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by DM Smith <dm...@gmail.com>.

On 11/29/2010 01:43 PM, Robert Muir wrote:
> On Mon, Nov 29, 2010 at 12:51 PM, DM Smith<dm...@gmail.com>  wrote:
>>> Instead, you should use a Tokenizer that respects canonical
>>> equivalence (tokenizes text that is canonically equivalent in the same
>>> way), such as UAX29Tokenizer/StandardTokenizer in branch_3x. Ideally
>>> your filters too, will respect this equivalence, and you can finally
>>> normalize a single time at the *end* of processing.
>> Should it be normalized at all before using these? NFKC?
>>
> Sorry, i wanted to answer this one too :)

Thanks!

> NFKC is definitely a case where its likely what you want for search,
> but you don't want to normalize your documents to this... it removes
> certain distinctions important to display.
I have found that for everything but Hebrew, NFC is really good for 
display. For some reason, Hebrew does better with NFD. But since I can't 
see some of the nuances of some scripts, e.g. farsi/arabic (parochial 
vision at work), it's not saying much. I agree, the K forms are terrible 
for display.

In the context of my app, the document is accepted as is and is not 
stored in the index. As we are not to 3.x yet, and I've not backported 
your tokenizers, I'm stuck with a poor 2.x implementation. And at this 
time we do not normalize the stream as it is indexed or searched. The 
result is terrible. For example, the user can copy displayed Farsi text 
and then search it, but when they compose it from the keyboard, it 
doesn't work. Normalizations of the text as it is passed to index and to 
search improve the situation greatly. While the results do vary by form, 
they eclipse the bad results.

I appreciate your input as I'm working on making the change and the 
upgrade to 3.x/Java 5.

> If you are going to normalize to NFK[CD], thats a good reason to to
> deal with normalization in the analysis process, instead of
> normalizing your docs to these destructive lossy forms. (I do, however
> think its ok to normalize the docs to NFC for display, this is
> probably a good thing, because many rendering engines+fonts will
> display it better).
>
> The ICUTokenizer/UAX29Tokenizer/StandardTokenizer only respects
> canonical equivalence, not compatibility equivalence, but I think this
> is actually good. Have a look at the examples in
> http://unicode.org/reports/tr15/, such as fractions and subscripts.
> Its sorta up to the app to determine how it wants to deal with these,
> so treating 2⁵ the same as "25" by default (thats what NFKC will do!)
> early in the analysis process is dangerous. An app might want to
> normalize this to "32".
I don't know if it is still there but IBM had a web form where one could 
submit input and have it transformed to the various forms. I found it 
very educational.

> So it can be better to normalize towards the end of your analysis
> process, e.g. have a look at ICUNormalizer2Filter: which supports the
> NFKC_CaseFold normal form (NFKC + CaseFold + removing Ignorables) in
> additional to the standard ones, and ICUFoldingFilter, which is just
> like that, except it does additional folding for search (like removing
> diacritics). These foldings are computed recursively up front so they
> give a stable result.
Many thanks. This is very helpful.

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 29, 2010 at 12:51 PM, DM Smith <dm...@gmail.com> wrote:
>> Instead, you should use a Tokenizer that respects canonical
>> equivalence (tokenizes text that is canonically equivalent in the same
>> way), such as UAX29Tokenizer/StandardTokenizer in branch_3x. Ideally
>> your filters too, will respect this equivalence, and you can finally
>> normalize a single time at the *end* of processing.
>
> Should it be normalized at all before using these? NFKC?
>

Sorry, i wanted to answer this one too :)
NFKC is definitely a case where its likely what you want for search,
but you don't want to normalize your documents to this... it removes
certain distinctions important to display.

If you are going to normalize to NFK[CD], thats a good reason to to
deal with normalization in the analysis process, instead of
normalizing your docs to these destructive lossy forms. (I do, however
think its ok to normalize the docs to NFC for display, this is
probably a good thing, because many rendering engines+fonts will
display it better).

The ICUTokenizer/UAX29Tokenizer/StandardTokenizer only respects
canonical equivalence, not compatibility equivalence, but I think this
is actually good. Have a look at the examples in
http://unicode.org/reports/tr15/, such as fractions and subscripts.
Its sorta up to the app to determine how it wants to deal with these,
so treating 2⁵ the same as "25" by default (thats what NFKC will do!)
early in the analysis process is dangerous. An app might want to
normalize this to "32".

So it can be better to normalize towards the end of your analysis
process, e.g. have a look at ICUNormalizer2Filter: which supports the
NFKC_CaseFold normal form (NFKC + CaseFold + removing Ignorables) in
additional to the standard ones, and ICUFoldingFilter, which is just
like that, except it does additional folding for search (like removing
diacritics). These foldings are computed recursively up front so they
give a stable result.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by DM Smith <dm...@gmail.com>.

On 11/29/2010 01:03 PM, Robert Muir wrote:
> On Mon, Nov 29, 2010 at 12:51 PM, DM Smith<dm...@gmail.com>  wrote:
>> I'd have to look to be sure: IIRC, Turkish was one. The treatment of 'i' was
>> buggy. Russian had it's own encoding that was replaced with UTF-8. The
>> QueryParser had bug fixes. There is some effort to migrate away from stemmer
>> to snowball, but at least the Dutch one is not "identical".
>>
> but none of these broke backwards compatibility, they all respect the
> Version constant!
> The SnowballAnalyzer respects the version constant for the buggy
> turkish lowercasing! If you use VERSION.LUCENE_30 (or less) it wrongly
> lowercases so you get your old buggy behavior.
>
> Even the old buggy Dutch stemmer is still there, and if you use
> DutchAnalyzer(Version.LUCENE_30) (or less) it stems incorrectly so you
> get your old buggy behavior!
>
> The russian was the same way, same with the QueryParser.
>
> So I'm sorry, I am left confused about where the backwards breaks are?
Strictly speaking there are none, in the present. The user of Lucene can 
choose to break compatibility and retain old (and in these cases, buggy) 
behavior. This maintains Lucene's bw-compat policy.

This thread talked about removing the Version constants in the future? I 
went back and re-read the thread. Perhaps I misunderstood. I saw several 
thoughts:
Deprecate  version constants 1 version back and remove those 2 versions 
back.
Remove all version constants and use versioned jars instead.

If there is no way to select a prior behavior except to select a single 
jar that had lots of analyzers (or analyzer parts) in it, then I'm stuck 
with older code that is perhaps buggy. I can't pick a later analyzer for 
English and an earlier, buggy analyzer for Turkish. I have to get all of 
them from one jar. (Unless we get into renaming packages and/or 
classes). So I can't get some improvements while ignoring others.

I think there is a problem with deprecating and removing constants too. 
In trunk, which will be 4.0, it needs to be able to read and/or upgrade 
2.x indexes. From an analyzer perspective, an index is invalid if the 
analyzer would produce a different token stream for the same input. If 
the 2.x version constants are gone, then the index built with 2.x 
version constants is no longer valid. (It might be valid, but how can 
one have any confidence of that?) Upgrading the index to the new 
internal format cannot change this. A buggy lowercase Turkish word will 
still be buggy after upgrade. (This is a 3.0 version constant that in 
5.0 will still need to be around).

We either need more frequent releases (forcing the issue earlier and 
eliminating stale code earlier) or something's gotta give.

That said. As a user, I don't care any more. I'll give. The benefit of a 
better index outweighs backward compatibility for me.

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 29, 2010 at 12:51 PM, DM Smith <dm...@gmail.com> wrote:
>
> I'd have to look to be sure: IIRC, Turkish was one. The treatment of 'i' was
> buggy. Russian had it's own encoding that was replaced with UTF-8. The
> QueryParser had bug fixes. There is some effort to migrate away from stemmer
> to snowball, but at least the Dutch one is not "identical".
>

but none of these broke backwards compatibility, they all respect the
Version constant!
The SnowballAnalyzer respects the version constant for the buggy
turkish lowercasing! If you use VERSION.LUCENE_30 (or less) it wrongly
lowercases so you get your old buggy behavior.

Even the old buggy Dutch stemmer is still there, and if you use
DutchAnalyzer(Version.LUCENE_30) (or less) it stems incorrectly so you
get your old buggy behavior!

The russian was the same way, same with the QueryParser.

So I'm sorry, I am left confused about where the backwards breaks are?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by DM Smith <dm...@gmail.com>.

On 11/29/2010 09:40 AM, Robert Muir wrote:
> On Mon, Nov 29, 2010 at 9:05 AM, DM Smith<dm...@gmail.com>  wrote:
>> In my project, I don't use any of the Analyzers that Lucene provides, but I have variants of them. (Mine allow take flags indicating whether to filter stop words and whether to do stemming). The effort recently has been to change these analyzers to follow the new reuse pattern to improve performance.
>>
>> Having a declarative mechanism and I wouldn't have needed to make the changes.
> Right, this is I think what we want?
It's what I want. I think non-power users would like it as well.

The other thing I'd like is for the spec to be save along side of the 
index as a manifest. From earlier threads, I can see that there might 
need to be one for writing and another for reading. I'm not interested 
in using it to construct an analyzer, but to determine whether the index 
is invalid wrt to the analyzer currently in use.

>   To just provide examples so the
> user can make what they need to suit their application.
>
>> WRT to an analyzer, if any of the following changes, all bets are off:
>>     Tokenizer (i.e. which tokenizer is used)
>>     The rules that a tokenizer uses to break into tokens. (E.g. query parser, break iterator, ...)
>>     The type associated with each token (e.g. word, number, url, .... )
>>     Presence/Absence of a particular filter
>>     Order of filters
>>     Tables that a filter uses
>>     Rules that a filter encodes
>>     The version and implementation of Unicode being used (whether via ICU, Lucene and/or Java)
>>     Bugs fixed in these components.
>> (This list is adapted from an email I wrote to a user's group explaining why texts need to be re-indexed.)
>>
> Right, i agree, and some of these things (such as JVM unicode version)
> are completely outside of our control.
> But for the things inside our control, where are the breaks that
> caused you any reindexing?
The JVM version is not entirely out of our control: 3.x requires Java 5 
JVM. So going from 2.9.x to 3.1 (I can skip 3.0) requires a different 
Unicode. I bet most desktop applications using Lucene 2.9.x are using 
Java 5 or Java 6, so upgrading to 3.1 won't be an issue for them. This 
issue really only regards MacOSX.

But this is also a problem today outside of our control. A user of a 
desktop application under 2.x can have an index with Java 1.4.2 and then 
upgrade to Java 5 or 6. Unless the desktop application knew to look for 
this and "invalidate" the index, tough.

I'd have to look to be sure: IIRC, Turkish was one. The treatment of 'i' 
was buggy. Russian had it's own encoding that was replaced with UTF-8. 
The QueryParser had bug fixes. There is some effort to migrate away from 
stemmer to snowball, but at least the Dutch one is not "identical".

Maybe, I'm getting confused by lurking as to what is in which release 
and everything is just fine.

>> Additionally, it is the user's responsibility to normalize the text, probably to NFC or NFKC, before index and search. (It may need to precede the Tokenizer if it is not Unicode aware. E.g. what does a LetterTokenizer do if input is NFD and it encounters an accent?)
> I would not recommend this approach: NFC doesnt mean its going to take
> letter+accent combinations and compose them into a 'composed'
> character with the letter property... especially for non-latin
> scripts!
>
> In some cases, NFC will even cause the codepoint to be expanded: the
> NFC form of 0958 (QA) is 0915 + 093C (KA+NUKTA)... of course if you
> use LetterTokenizer with any language in this script, you are screwed
> anyway :)
>
> But even for latin scripts this won't work... not all combinations
> have a composed form and i think composed forms are in general not
> being added anymore.
I knew that NFC does not have a single codepoint for some glyphs.

I'm also seeing the trend you mention.

I'm always fighting my personal, parochial bias toward English;)

As an aside, my daughter is a linguist, who in summer 2009, worked on 
the development and completion of alphabets for 3 African languages. 
This was not an academic exercise but an effort to develop literacy 
among those people groups. Some of the letters in these languages are 
composed of multiple glyphs and some of the glyphs have decorations. 
It'd be interesting to see how these would be handled in Unicode (if 
they get added).

> For example, see the lithuanian sequences in
> http://www.unicode.org/Public/6.0.0/ucd/NamedSequences.txt:
>
> LATIN SMALL LETTER A WITH OGONEK AND TILDE;0105 0303
>
> You can normalize this all you want, but there is no single composed
> form, in NFC its gonna be 0105 0303.
>
> Instead, you should use a Tokenizer that respects canonical
> equivalence (tokenizes text that is canonically equivalent in the same
> way), such as UAX29Tokenizer/StandardTokenizer in branch_3x. Ideally
> your filters too, will respect this equivalence, and you can finally
> normalize a single time at the *end* of processing.
Should it be normalized at all before using these? NFKC?

> For example, don't
> use LowerCaseFilter + ASCIIFoldingFilter or something like that to
> lowercanse&  remove accents, but use ICUFoldingFilter instead, which
> handles all this stuff consistently, even if your text doesnt conform
> to any unicode normalization form...
Sigh. This is my point. The old contrib analyzers, which had no backward 
compatibility guarantee, except on an individual contribution basis, 
though it was treated with care, had weak non-english analyzers. Most of 
my texts are non-Latinate, let alone non-English.

The result is that Lucene sort-of works for them. The biggest hurdle has 
been that my lack of knowledge, but second to that the input to index 
and to search don't treat canonical equivalences as equivalent. By 
normalizing to NFC before index and search, I have found the results to 
be far better.

I have learned a lot from lurking on this list about handling 
non-English/Latinate text with care. And as a result, with each release 
of Lucene, I want to work those fixes/improvements into my application.

My understanding is that indexes built with them will result in problems 
that might not be readily apparent. You have done great work in 
alternative tokenizers and filters.
>> Recently, we've seen that there is some mistrust here in JVMs at the same version level from different vendors (Sun, Harmony, IBM) in producing the same results. (IIRC: Thai break iterator. Random tests.)
> Right, Sun JDK 7 will be a new unicode version. Harmony uses a
> different unicode version than Sun. There's nothing we can do about
> this except document it?
I don't know if it would make sense to have a JVM/Unicode map e.g. 
(vendor + jvm version => unicode version) and when an index is created 
note that tuple. Upon opening an index for reading or writing, the 
current value could be compared to the stored value. If they don't match 
something could be done. (warning? error?)

This could be optional where the default is to do no check and to store 
nothing.

But documentation is always good.

> Whether or not a special customized break iterator for Thai locale
> exists, and how it works, is just a jvm "feature". There's nothing we
> can do about this except document it?

>> Within a release of Lucene, a small handful of analyzers may have changed sufficiently to warrant re-index of indexes built with them.
> which ones changed in a backwards-incompatible way that forced you to reindex?

Other than the change to the JVM, which really is out of my control. 
Maybe I'm not reading it correctly, but the QueryParser is changing in 
3.1. If one wants the old QueryParser, one has to have their own 
Analyzer implementations.


>> So basically, I have given up on Lucene being backward compatible where it matters the most to me: Stable analyzer components. The gain I get from this admission is far better. YMMV.
>>
> which ones changed in a backwards-incompatible way that forced you to reindex?
Basically the ones in contrib. Because of the lack of a strong bw-compat 
guarantee, I am only fairly confident that nothing changed. I know the 
test have been improving, but when I started contributing small changes 
to them, I thought they were rudimentary. It didn't give me a lot of 
confidence that any contrib analyzer is stable. But ultimately, it's 
because I want the best analysis for each language's text that I use the 
improvements.

I wish I had more time to help.

If needed, I can do a review of the code to give an exact answer.

I could have and should have made it clearer.

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 29, 2010 at 9:05 AM, DM Smith <dm...@gmail.com> wrote:
>
> In my project, I don't use any of the Analyzers that Lucene provides, but I have variants of them. (Mine allow take flags indicating whether to filter stop words and whether to do stemming). The effort recently has been to change these analyzers to follow the new reuse pattern to improve performance.
>
> Having a declarative mechanism and I wouldn't have needed to make the changes.

Right, this is I think what we want? To just provide examples so the
user can make what they need to suit their application.

>
> WRT to an analyzer, if any of the following changes, all bets are off:
>    Tokenizer (i.e. which tokenizer is used)
>    The rules that a tokenizer uses to break into tokens. (E.g. query parser, break iterator, ...)
>    The type associated with each token (e.g. word, number, url, .... )
>    Presence/Absence of a particular filter
>    Order of filters
>    Tables that a filter uses
>    Rules that a filter encodes
>    The version and implementation of Unicode being used (whether via ICU, Lucene and/or Java)
>    Bugs fixed in these components.
> (This list is adapted from an email I wrote to a user's group explaining why texts need to be re-indexed.)
>

Right, i agree, and some of these things (such as JVM unicode version)
are completely outside of our control.
But for the things inside our control, where are the breaks that
caused you any reindexing?

> Additionally, it is the user's responsibility to normalize the text, probably to NFC or NFKC, before index and search. (It may need to precede the Tokenizer if it is not Unicode aware. E.g. what does a LetterTokenizer do if input is NFD and it encounters an accent?)

I would not recommend this approach: NFC doesnt mean its going to take
letter+accent combinations and compose them into a 'composed'
character with the letter property... especially for non-latin
scripts!

In some cases, NFC will even cause the codepoint to be expanded: the
NFC form of 0958 (QA) is 0915 + 093C (KA+NUKTA)... of course if you
use LetterTokenizer with any language in this script, you are screwed
anyway :)

But even for latin scripts this won't work... not all combinations
have a composed form and i think composed forms are in general not
being added anymore. For example, see the lithuanian sequences in
http://www.unicode.org/Public/6.0.0/ucd/NamedSequences.txt:

LATIN SMALL LETTER A WITH OGONEK AND TILDE;0105 0303

You can normalize this all you want, but there is no single composed
form, in NFC its gonna be 0105 0303.

Instead, you should use a Tokenizer that respects canonical
equivalence (tokenizes text that is canonically equivalent in the same
way), such as UAX29Tokenizer/StandardTokenizer in branch_3x. Ideally
your filters too, will respect this equivalence, and you can finally
normalize a single time at the *end* of processing. For example, don't
use LowerCaseFilter + ASCIIFoldingFilter or something like that to
lowercanse & remove accents, but use ICUFoldingFilter instead, which
handles all this stuff consistently, even if your text doesnt conform
to any unicode normalization form...

>
> Recently, we've seen that there is some mistrust here in JVMs at the same version level from different vendors (Sun, Harmony, IBM) in producing the same results. (IIRC: Thai break iterator. Random tests.)

Right, Sun JDK 7 will be a new unicode version. Harmony uses a
different unicode version than Sun. There's nothing we can do about
this except document it?
Whether or not a special customized break iterator for Thai locale
exists, and how it works, is just a jvm "feature". There's nothing we
can do about this except document it?

> Within a release of Lucene, a small handful of analyzers may have changed sufficiently to warrant re-index of indexes built with them.

which ones changed in a backwards-incompatible way that forced you to reindex?

> So basically, I have given up on Lucene being backward compatible where it matters the most to me: Stable analyzer components. The gain I get from this admission is far better. YMMV.
>

which ones changed in a backwards-incompatible way that forced you to reindex?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by DM Smith <dm...@gmail.com>.

On Nov 29, 2010, at 5:34 AM, Robert Muir wrote:

> On Mon, Nov 29, 2010 at 2:50 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>> And for indexes:
>> * Index compatibility is guaranteed across two adjacent major
>> releases. eg 2.x -> 3.x, 3.x -> 4.x.
>>  That includes both binary compat - codecs, and semantic compat -
>> analyzers (if appropriate Version is used).
>> * Older releases are most probably unsupported.
>>  e.g. 4.x still supports shared docstores for reading, though never
>> writes them. 5.x won't read them either, so you'll have to at least
>> fully optimize your 3.x indexes when going through 4.x to 5.x.
>> 
> 
> Is it somehow possible i could convince everyone that all the
> analyzers we provide are simply examples?

It really doesn't solve the problem. Analyzers are not much more than tokenizer and zero or more filters chained in an ordered manner. Right now, the "more" is the special code regarding reuse.

In my project, I don't use any of the Analyzers that Lucene provides, but I have variants of them. (Mine allow take flags indicating whether to filter stop words and whether to do stemming). The effort recently has been to change these analyzers to follow the new reuse pattern to improve performance.

Having a declarative mechanism and I wouldn't have needed to make the changes.

WRT to an analyzer, if any of the following changes, all bets are off:
    Tokenizer (i.e. which tokenizer is used)
    The rules that a tokenizer uses to break into tokens. (E.g. query parser, break iterator, ...)
    The type associated with each token (e.g. word, number, url, .... )
    Presence/Absence of a particular filter
    Order of filters
    Tables that a filter uses
    Rules that a filter encodes
    The version and implementation of Unicode being used (whether via ICU, Lucene and/or Java)
    Bugs fixed in these components.
(This list is adapted from an email I wrote to a user's group explaining why texts need to be re-indexed.)

Additionally, it is the user's responsibility to normalize the text, probably to NFC or NFKC, before index and search. (It may need to precede the Tokenizer if it is not Unicode aware. E.g. what does a LetterTokenizer do if input is NFD and it encounters an accent?)

Recently, we've seen that there is some mistrust here in JVMs at the same version level from different vendors (Sun, Harmony, IBM) in producing the same results. (IIRC: Thai break iterator. Random tests.)

For the most part, searching the index will seem to be fine. It may only be edge cases that cause problems.

Adding documents to an index with a changed Analyzer might not be a good thing. It might result in a question of "Why does my search find this Document, but not that Document. Both should be returned.")

Within a release of Lucene, a small handful of analyzers may have changed sufficiently to warrant re-index of indexes built with them.

For me the bigger problem is that the parts of analyzer are not separately versioned. It is not simply a matter of using a lucene-analyzers-XX.YY.jar. That is too coarse grained. Each release has new goodness regarding analysis of non-english texts and performance regarding all texts. If I want any or all of that, I have two choices:
a) Upgrade and rebuild every index. Since the desktop application does not know if a change requires rebuild, everything must be rebuilt.
or
b) Fork all the components I use. (To me this is just wrong, but perhaps necessary/expedient.)
or
c) version the names of the packages and/or classes. (I don't like this idea either, but it works)

Given that the releases of Lucene and my application are infrequent (so much for the release often mantra) forcing a rebuild is not such a horrible thing for me.

So basically, I have given up on Lucene being backward compatible where it matters the most to me: Stable analyzer components. The gain I get from this admission is far better. YMMV.

Hope this helps,
	DM

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Earwin Burrfoot <ea...@gmail.com>.

On Mon, Nov 29, 2010 at 15:28, Robert Muir <rc...@gmail.com> wrote:
> On Mon, Nov 29, 2010 at 7:21 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>> There's no reason, no advantage towards using .xml files for
>> configuration, when said configuration can easily be expressed
>> programmatically. It just causes problems :)
>>
>
> but the former is java code, so subject to backwards compatibility
> policy, right? :)

Could we make a special exception for these /configuration/ .java files?
Or can we name them .cava (leaving source code intact)?
Or maybe we try linking against common sense with our next release?

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 29, 2010 at 7:21 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
> There's no reason, no advantage towards using .xml files for
> configuration, when said configuration can easily be expressed
> programmatically. It just causes problems :)
>

but the former is java code, so subject to backwards compatibility
policy, right? :)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Earwin Burrfoot <ea...@gmail.com>.

> I'm talking about the analyzers we provide in lucene itself.
> there is no reason, no advantage towards these being .java code: it
> just causes problems.

I see little difference between

public class StockAnalyzers {
  public static final Analyzer STANDARD_30 = new AnalyzerBuilder().
    add(new WhiteSpaceTokenizer(LUCENE_30)).
    build();
}

and

<stock-analyzers>
  <analyzer name = "Standard_30">
    <tokenizer class = "lucene.WhiteSpaceTokenizer">
      <param name = "version" value = "3.0" />
    </tokenizer>
  </analyzer>
</stock-analyzers>

except latter is more verbose and former is more easily
understandable, copypaste-and-tweakable and refactorable

There's no reason, no advantage towards using .xml files for
configuration, when said configuration can easily be expressed
programmatically. It just causes problems :)

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 29, 2010 at 6:45 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
> I agree current Analyzers are a heap of bad copypaste.
> But I'd rather have an ability to compose a number of CharFilters,
> Tokenizers and TokenFilters programmatically (without writing a new
> Analyzer), instead of using config-files.

right, we shouldn't take this away. if you want to make your own
Analyzer this way, you should still be able to.

I'm talking about the analyzers we provide in lucene itself.
there is no reason, no advantage towards these being .java code: it
just causes problems.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Earwin Burrfoot <ea...@gmail.com>.

I agree current Analyzers are a heap of bad copypaste.
But I'd rather have an ability to compose a number of CharFilters,
Tokenizers and TokenFilters programmatically (without writing a new
Analyzer), instead of using config-files.

Something that roughly looks like:
Analyzer a = new AnalyzerBuilder().
  filterStreamWith(charFilterA, charFilterB).
  tokenizeWith(new MyFluffyTokenizer()).
  filter(new StopWordsFilter(..)).
  filter(whatever).
  build();

Configgy stuff can then appear as a layer over such API.

Building Analyzers programmatically has a number of benefits:
1. Easier tests. Everything being tested is in your test method, not
smeared across a bunch of config files (wink@Solr).
2. You can play around in REPL.
3. You might have slightly different variations of the same Analyzer.
And you don't have to write a bunch of almost-identical config files
for that.
  - i.e. in my code I have Index-mode analyzer, Index-mode
analyzer+html handling, Search-mode analyzer, that differ only in
parameters to a couple of filters.
4. Typesafety anyone?

On Mon, Nov 29, 2010 at 13:59, Uwe Schindler <uw...@thetaphi.de> wrote:
> I think with declarative model, he means more something like a "generic" Analyzer class, where you pass in a config file that lists all CharFilters, Tokenizers, TokenFilters. You can put this xml file or whatever into a jar file and then you have the same like hardcoded analyzers. We have simply stupid code duplication. And using these config files you can even supply variants for backwards compatibility.
>
> For this to implement, the factories from solr need to be moved to Lucene. Which would be a good thing, as e.g. Hibernate Search only references Solr jars to have a declarative (annotation-based) analyzer configuration. And for that the factories are needed.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Earwin Burrfoot [mailto:earwin@gmail.com]
>> Sent: Monday, November 29, 2010 11:53 AM
>> To: dev@lucene.apache.org
>> Subject: Re: deprecating Versions
>>
>> On Mon, Nov 29, 2010 at 13:34, Robert Muir <rc...@gmail.com> wrote:
>> > On Mon, Nov 29, 2010 at 2:50 AM, Earwin Burrfoot <ea...@gmail.com>
>> wrote:
>> >> And for indexes:
>> >> * Index compatibility is guaranteed across two adjacent major
>> >> releases. eg 2.x -> 3.x, 3.x -> 4.x.
>> >>  That includes both binary compat - codecs, and semantic compat -
>> >> analyzers (if appropriate Version is used).
>> >> * Older releases are most probably unsupported.
>> >>  e.g. 4.x still supports shared docstores for reading, though never
>> >> writes them. 5.x won't read them either, so you'll have to at least
>> >> fully optimize your 3.x indexes when going through 4.x to 5.x.
>> >>
>> >
>> > Is it somehow possible i could convince everyone that all the
>> > analyzers we provide are simply examples?
>> > This way we could really make this a bit more reasonable and clean up
>> > a lot of stuff.
>> At the very least, you don't have to convince me. :)
>>
>> > Seems like we really want to move towards a more declarative model
>> > where these are just config files... so only then it will ok for us to
>> > change them because they suddenly aren't suffixed with .java?!
>> No freakin' declarative models! That's the domain of Solr.
>> Though others might disagree and then happily store these declarations
>> within index, and then per-segment, making the mess even more messy for
>> the glory of backasswards compatibility.
>>
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>> Phone: +7 (495) 683-567-4
>> ICQ: 104465785
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
>> commands, e-mail: dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: deprecating Versions

Posted by Uwe Schindler <uw...@thetaphi.de>.

I think with declarative model, he means more something like a "generic" Analyzer class, where you pass in a config file that lists all CharFilters, Tokenizers, TokenFilters. You can put this xml file or whatever into a jar file and then you have the same like hardcoded analyzers. We have simply stupid code duplication. And using these config files you can even supply variants for backwards compatibility.

For this to implement, the factories from solr need to be moved to Lucene. Which would be a good thing, as e.g. Hibernate Search only references Solr jars to have a declarative (annotation-based) analyzer configuration. And for that the factories are needed.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Earwin Burrfoot [mailto:earwin@gmail.com]
> Sent: Monday, November 29, 2010 11:53 AM
> To: dev@lucene.apache.org
> Subject: Re: deprecating Versions
> 
> On Mon, Nov 29, 2010 at 13:34, Robert Muir <rc...@gmail.com> wrote:
> > On Mon, Nov 29, 2010 at 2:50 AM, Earwin Burrfoot <ea...@gmail.com>
> wrote:
> >> And for indexes:
> >> * Index compatibility is guaranteed across two adjacent major
> >> releases. eg 2.x -> 3.x, 3.x -> 4.x.
> >>  That includes both binary compat - codecs, and semantic compat -
> >> analyzers (if appropriate Version is used).
> >> * Older releases are most probably unsupported.
> >>  e.g. 4.x still supports shared docstores for reading, though never
> >> writes them. 5.x won't read them either, so you'll have to at least
> >> fully optimize your 3.x indexes when going through 4.x to 5.x.
> >>
> >
> > Is it somehow possible i could convince everyone that all the
> > analyzers we provide are simply examples?
> > This way we could really make this a bit more reasonable and clean up
> > a lot of stuff.
> At the very least, you don't have to convince me. :)
> 
> > Seems like we really want to move towards a more declarative model
> > where these are just config files... so only then it will ok for us to
> > change them because they suddenly aren't suffixed with .java?!
> No freakin' declarative models! That's the domain of Solr.
> Though others might disagree and then happily store these declarations
> within index, and then per-segment, making the mess even more messy for
> the glory of backasswards compatibility.
> 
> 
> --
> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> Phone: +7 (495) 683-567-4
> ICQ: 104465785
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 29, 2010 at 5:53 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>> Seems like we really want to move towards a more declarative model
>> where these are just config files... so only then it will ok for us to
>> change them because they suddenly aren't suffixed with .java?!
> No freakin' declarative models! That's the domain of Solr.
> Though others might disagree and then happily store these declarations
> within index, and then per-segment, making the mess even more messy
> for the glory of backasswards compatibility.
>

heh, i dont think we should store them within the index though!
just a way to make an analyzer from a list of tokenstreams without
writing code...

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Earwin Burrfoot <ea...@gmail.com>.

On Mon, Nov 29, 2010 at 13:34, Robert Muir <rc...@gmail.com> wrote:
> On Mon, Nov 29, 2010 at 2:50 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>> And for indexes:
>> * Index compatibility is guaranteed across two adjacent major
>> releases. eg 2.x -> 3.x, 3.x -> 4.x.
>>  That includes both binary compat - codecs, and semantic compat -
>> analyzers (if appropriate Version is used).
>> * Older releases are most probably unsupported.
>>  e.g. 4.x still supports shared docstores for reading, though never
>> writes them. 5.x won't read them either, so you'll have to at least
>> fully optimize your 3.x indexes when going through 4.x to 5.x.
>>
>
> Is it somehow possible i could convince everyone that all the
> analyzers we provide are simply examples?
> This way we could really make this a bit more reasonable and clean up
> a lot of stuff.
At the very least, you don't have to convince me. :)

> Seems like we really want to move towards a more declarative model
> where these are just config files... so only then it will ok for us to
> change them because they suddenly aren't suffixed with .java?!
No freakin' declarative models! That's the domain of Solr.
Though others might disagree and then happily store these declarations
within index, and then per-segment, making the mess even more messy
for the glory of backasswards compatibility.


-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 29, 2010 at 2:50 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
> And for indexes:
> * Index compatibility is guaranteed across two adjacent major
> releases. eg 2.x -> 3.x, 3.x -> 4.x.
>  That includes both binary compat - codecs, and semantic compat -
> analyzers (if appropriate Version is used).
> * Older releases are most probably unsupported.
>  e.g. 4.x still supports shared docstores for reading, though never
> writes them. 5.x won't read them either, so you'll have to at least
> fully optimize your 3.x indexes when going through 4.x to 5.x.
>

Is it somehow possible i could convince everyone that all the
analyzers we provide are simply examples?
This way we could really make this a bit more reasonable and clean up
a lot of stuff.

Seems like we really want to move towards a more declarative model
where these are just config files... so only then it will ok for us to
change them because they suddenly aren't suffixed with .java?!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Earwin Burrfoot <ea...@gmail.com>.

bq. not true. at some point you have to re-index, this isn't database
software. i suggest between major versions!

API/index compatibility was discussed some time ago, and there was a
kind of consensus I believe.
For APIs:
* Compatibility is guaranteed across minor releases, eg - within 2.x,
3.x, 4.x branches.
* We reserve the right to break it for major releases, eg - 4.x is a
break from 3.x. Though still try to forewarn people with deprecations
in previous branch, if possible.

And for indexes:
* Index compatibility is guaranteed across two adjacent major
releases. eg 2.x -> 3.x, 3.x -> 4.x.
 That includes both binary compat - codecs, and semantic compat -
analyzers (if appropriate Version is used).
* Older releases are most probably unsupported.
 e.g. 4.x still supports shared docstores for reading, though never
writes them. 5.x won't read them either, so you'll have to at least
fully optimize your 3.x indexes when going through 4.x to 5.x.



On Mon, Nov 29, 2010 at 05:35, Robert Muir <rc...@gmail.com> wrote:
> On Sat, Nov 27, 2010 at 3:44 PM, Earwin Burrfoot <ea...@gmail.com> wrote:
>> I think we should deprecate and remove Version constants as Lucene progresses?
>
> well one idea was that we would release analyzers with their own
> version numbers... e.g. instead of Version 4.0 you use
> analyzers-4.0.jar.
> This way you could upgrade lucene-core.jar to say, 4.5 or whatever,
> and still have your exact same index compatibility (same bytecode!)
>
>> Going with this, we should deprecate 3x in trunk and delete 2x. In 3x
>> branch, we should deprecate 2x.
>
> the problem as i understand it, is that people never want to reindex.
> e.g. they want to upgrade to 2.x, upgrade to 3.x, and then upgrade to
> 4.x and never re-analyze text.
> with the rest of the index (lists of integers), things like this can
> be converted losslessly, but analyzers do a lossy conversion...
> so it seems some people think we have this 'perpetual' backwards
> compatibility at the moment... not true. at some point you have to
> re-index, this isn't database software. i suggest between major
> versions!
>
> bottom line: i agree with you that we really need to clean house in
> trunk, except to say that Version constants should be removed
> completely too and replaced with 'real versions' if possible.
> the other major user of Version is QueryParser, perhaps if it gets
> yanked out of lucene core, we would do the same.
>
> we just have to figure out how the module releasing will work: should we:
> 1. do nothing yet: analyzers-4.0.jar have all the cruft, and then we
> can finally remove Version in analyzers-4.1?
> 2. backport the whole analyzers module to 3.x, keep the cruft there,
> remove Version in trunk now?
> 3. just say screw it and clean house in trunk, like we did for the
> rest of the code?
> 4. <other ideas>
>
> and how long should an analyzers-jar file be "valid" for anyway?
> eventually, no matter how clean the separation is with trunk, software
> interfaces are going to have to break.
> the same situation will happen if we try to modularize lucene in other
> ways so we need to figure these things out. again i suggest for the
> long term we should look at breaking in major versions,
> its what people are used to with other software.
>
> but for now, I think we at least need to deprecate Version_3.x in
> trunk, and Version_2.x in branch_3x, and remove Version_2.x in trunk
> completely as you suggest.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.

On Sat, Nov 27, 2010 at 3:44 PM, Earwin Burrfoot <ea...@gmail.com> wrote:
> I think we should deprecate and remove Version constants as Lucene progresses?

well one idea was that we would release analyzers with their own
version numbers... e.g. instead of Version 4.0 you use
analyzers-4.0.jar.
This way you could upgrade lucene-core.jar to say, 4.5 or whatever,
and still have your exact same index compatibility (same bytecode!)

> Going with this, we should deprecate 3x in trunk and delete 2x. In 3x
> branch, we should deprecate 2x.

the problem as i understand it, is that people never want to reindex.
e.g. they want to upgrade to 2.x, upgrade to 3.x, and then upgrade to
4.x and never re-analyze text.
with the rest of the index (lists of integers), things like this can
be converted losslessly, but analyzers do a lossy conversion...
so it seems some people think we have this 'perpetual' backwards
compatibility at the moment... not true. at some point you have to
re-index, this isn't database software. i suggest between major
versions!

bottom line: i agree with you that we really need to clean house in
trunk, except to say that Version constants should be removed
completely too and replaced with 'real versions' if possible.
the other major user of Version is QueryParser, perhaps if it gets
yanked out of lucene core, we would do the same.

we just have to figure out how the module releasing will work: should we:
1. do nothing yet: analyzers-4.0.jar have all the cruft, and then we
can finally remove Version in analyzers-4.1?
2. backport the whole analyzers module to 3.x, keep the cruft there,
remove Version in trunk now?
3. just say screw it and clean house in trunk, like we did for the
rest of the code?
4. <other ideas>

and how long should an analyzers-jar file be "valid" for anyway?
eventually, no matter how clean the separation is with trunk, software
interfaces are going to have to break.
the same situation will happen if we try to modularize lucene in other
ways so we need to figure these things out. again i suggest for the
long term we should look at breaking in major versions,
its what people are used to with other software.

but for now, I think we at least need to deprecate Version_3.x in
trunk, and Version_2.x in branch_3x, and remove Version_2.x in trunk
completely as you suggest.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org