You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2010/01/05 23:29:54 UTC

[jira] Issue Comment Edited: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

    [ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796872#action_12796872 ] 

Uwe Schindler edited comment on SOLR-1677 at 1/5/10 10:29 PM:
--------------------------------------------------------------

In my opinion, the default in solrconfig.xml should be possible to set, because there is currently no requirement to set a version for all TS components. This default is in the shipped solrconfig.xml the version of the shipped lucene version. so new users can use the default config and extend it like learned in all courses and books about solr. They do not need to care about the version. 

If they upgrade their lucene version, their config keeps stuck on the previous seeting and they are fine. If they want to change some of the components (like query parser, index writer, index reader -- flex!!!), they can do it locally. So Bob could change like Ernest proposed.

If we do not have a default, all users will keep stuck with lucene 2.4, because they do not care about version (it is not required, because it defaults to 2.4 for BW compatibility). So lots of configs will never use the new unicode features of Lucene 3.1. And suddenly Lucene 4.0 comes out and all support for Lucene < 3 is removed, then all users cry. With a default version set to 2.4, they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24 is no longer available as enum constant.

If you really do not want to have a default version in config (not schema, because it applies to *all* lucene components), then you should go the way like Lucene 3.0: Require a matchVersion for all components. As there may be tokenstream components not from lucene, make this attribute in the schema only mandatory for lucene-streams (this can be done by my initial patch, too: if the matchVersion property is missing then the matchVersion will get NULL and the factory should thow IAE if required. In my original patch, only the parsing code should be moved out of the factory into a util class in solr. Maybe also possible to parse "x.y"-style versions).

The problem here: Users upgrading from solr 1.4 will suddenly get errors, because their configs get invalid. Ahh, and because they are stupid they add LUCENE_29 (from where should they know that Solr 1.4 used Lucene 2.4 compatibility?). And then the mailing list gets flooded by questions because suddenly the configs fail to produce results with old indexes.

      was (Author: thetaphi):
    In my opinion, the default in solrconfig.xml should be possible to set, because there is currently no requirement to set a version for all TS components. This default is in the shipped solrconfig.xml the version of the shipped lucene version. so new users can use the default config and extend it like learned in all courses and books about solr. They do not need to care about the version. 

If they upgrade their lucene version, their config keeps stuck on the previous seeting and they are fine. If they want to change some of the components (like query parser, index writer, index reader -- flex!!!), they can do it locally. So Bob could change like Ernest proposed.

If we do not have a default, all users will keep stuck with lucene 2.4, because they do not care about version (it is not required, because it defaults to 2.4 for BW compatibility). So lots of configs will never use the new unicode features of Lucene 3.1. And suddenly Lucene 4.0 comes out and all support for Lucene < 3 is removed, then all users cry. With a default version set to 2.4, they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24 is no longer available as enum constant.

If you really do not want to have a default version in config (not schema, because it applies to *all* lucene components), then you should go the way like Lucene 3.0: Require a matchVersion for all components. As there may be tokenstream components not from lucene, make this attribute in the schema only mandatory for lucene-streams (this can be done by my initial patch, too: if the matchVersion property is missing then the matchVersion will get NULL and the factory should thow IAE if required. In my original patch, only the parsing code should be moved out of the factory into a util class in solr. Maybe also possible to parse "x.y"-style versions).

The problem here: Users upgrading from solr 1.4 will suddenly get errors, because their configs get invalid.
  
> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.