You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2011/08/09 00:18:27 UTC

[jira] [Created] (LUCENE-3366) StandardFilter only works with ClassicTokenizer and only when version < 3.1

StandardFilter only works with ClassicTokenizer and only when version < 3.1
---------------------------------------------------------------------------

                 Key: LUCENE-3366
                 URL: https://issues.apache.org/jira/browse/LUCENE-3366
             Project: Lucene - Java
          Issue Type: Improvement
          Components: modules/analysis
    Affects Versions: 3.3
            Reporter: David Smiley


The StandardFilter used to remove periods from acronyms and apostrophes-S's where they occurred. And it used to work in conjunction with the StandardTokenizer.  Presently, it only does this with ClassicTokenizer and when the lucene match version is before 3.1. Here is a excerpt from the code:
{code:lang=java}
  public final boolean incrementToken() throws IOException {
    if (matchVersion.onOrAfter(Version.LUCENE_31))
      return input.incrementToken(); // TODO: add some niceties for the new grammar
    else
      return incrementTokenClassic();
  }
{code}

It seems to me that in the great refactor of the standard tokenizer, LUCENE-2167, something was forgotten here. I think that if someone uses the ClassicTokenizer then no matter what the version is, this filter should do what it used to do. And the TODO suggests someone forgot to make this filter do something useful for the StandardTokenizer.  Or perhaps that idea should be discarded and this class should be named ClassicTokenFilter.

In any event, the javadocs for this class appear out of date as there is no mention of ClassicTokenizer, and the wiki is out of date too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3366) StandardFilter only works with ClassicTokenizer and only when version < 3.1

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3366.
---------------------------------

    Resolution: Not A Problem

use ClassicFilter if you want this behavior.

> StandardFilter only works with ClassicTokenizer and only when version < 3.1
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3366
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3366
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: David Smiley
>
> The StandardFilter used to remove periods from acronyms and apostrophes-S's where they occurred. And it used to work in conjunction with the StandardTokenizer.  Presently, it only does this with ClassicTokenizer and when the lucene match version is before 3.1. Here is a excerpt from the code:
> {code:lang=java}
>   public final boolean incrementToken() throws IOException {
>     if (matchVersion.onOrAfter(Version.LUCENE_31))
>       return input.incrementToken(); // TODO: add some niceties for the new grammar
>     else
>       return incrementTokenClassic();
>   }
> {code}
> It seems to me that in the great refactor of the standard tokenizer, LUCENE-2167, something was forgotten here. I think that if someone uses the ClassicTokenizer then no matter what the version is, this filter should do what it used to do. And the TODO suggests someone forgot to make this filter do something useful for the StandardTokenizer.  Or perhaps that idea should be discarded and this class should be named ClassicTokenFilter.
> In any event, the javadocs for this class appear out of date as there is no mention of ClassicTokenizer, and the wiki is out of date too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3366) StandardFilter only works with ClassicTokenizer and only when version < 3.1

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081256#comment-13081256 ] 

Robert Muir commented on LUCENE-3366:
-------------------------------------

Hi David, I think you want to use ClassicFilter.

> StandardFilter only works with ClassicTokenizer and only when version < 3.1
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3366
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3366
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: David Smiley
>
> The StandardFilter used to remove periods from acronyms and apostrophes-S's where they occurred. And it used to work in conjunction with the StandardTokenizer.  Presently, it only does this with ClassicTokenizer and when the lucene match version is before 3.1. Here is a excerpt from the code:
> {code:lang=java}
>   public final boolean incrementToken() throws IOException {
>     if (matchVersion.onOrAfter(Version.LUCENE_31))
>       return input.incrementToken(); // TODO: add some niceties for the new grammar
>     else
>       return incrementTokenClassic();
>   }
> {code}
> It seems to me that in the great refactor of the standard tokenizer, LUCENE-2167, something was forgotten here. I think that if someone uses the ClassicTokenizer then no matter what the version is, this filter should do what it used to do. And the TODO suggests someone forgot to make this filter do something useful for the StandardTokenizer.  Or perhaps that idea should be discarded and this class should be named ClassicTokenFilter.
> In any event, the javadocs for this class appear out of date as there is no mention of ClassicTokenizer, and the wiki is out of date too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3366) StandardFilter only works with ClassicTokenizer and only when version < 3.1

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081374#comment-13081374 ] 

Robert Muir commented on LUCENE-3366:
-------------------------------------

the purpose of the filter is "Normalizes tokens extracted with StandardTokenizer".

currently this is a no-op, but we can always improve it going with the spirit of the whole standard this thing implements.

The TODO currently refers to this statement:
"For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism ... Ideographic scripts such as Japanese and Chinese are even more complex"

There is no problem having a TODO in this filter, we don't need to do a rush job for any reason... 

Some of the preparation for this (e.g. improving the default behavior for CJK) was already done in LUCENE-2911. We now tag all these special types,
so in the meantime if someone wants to do their own downstream processing they can do this themselves.


> StandardFilter only works with ClassicTokenizer and only when version < 3.1
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3366
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3366
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: David Smiley
>
> The StandardFilter used to remove periods from acronyms and apostrophes-S's where they occurred. And it used to work in conjunction with the StandardTokenizer.  Presently, it only does this with ClassicTokenizer and when the lucene match version is before 3.1. Here is a excerpt from the code:
> {code:lang=java}
>   public final boolean incrementToken() throws IOException {
>     if (matchVersion.onOrAfter(Version.LUCENE_31))
>       return input.incrementToken(); // TODO: add some niceties for the new grammar
>     else
>       return incrementTokenClassic();
>   }
> {code}
> It seems to me that in the great refactor of the standard tokenizer, LUCENE-2167, something was forgotten here. I think that if someone uses the ClassicTokenizer then no matter what the version is, this filter should do what it used to do. And the TODO suggests someone forgot to make this filter do something useful for the StandardTokenizer.  Or perhaps that idea should be discarded and this class should be named ClassicTokenFilter.
> In any event, the javadocs for this class appear out of date as there is no mention of ClassicTokenizer, and the wiki is out of date too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3366) StandardFilter only works with ClassicTokenizer and only when version < 3.1

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081388#comment-13081388 ] 

Robert Muir commented on LUCENE-3366:
-------------------------------------

well its not "unfinished", the right decision might be to ultimately remove it.

and we could deprecate it in 4.9 and remove it in 5.0 if this is the case, no one's indexes will be broken as it wouldnt have done anything.

but I don't like what happens with thai etc right now if someone uses StandardAnalyzer.

> StandardFilter only works with ClassicTokenizer and only when version < 3.1
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3366
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3366
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: David Smiley
>
> The StandardFilter used to remove periods from acronyms and apostrophes-S's where they occurred. And it used to work in conjunction with the StandardTokenizer.  Presently, it only does this with ClassicTokenizer and when the lucene match version is before 3.1. Here is a excerpt from the code:
> {code:lang=java}
>   public final boolean incrementToken() throws IOException {
>     if (matchVersion.onOrAfter(Version.LUCENE_31))
>       return input.incrementToken(); // TODO: add some niceties for the new grammar
>     else
>       return incrementTokenClassic();
>   }
> {code}
> It seems to me that in the great refactor of the standard tokenizer, LUCENE-2167, something was forgotten here. I think that if someone uses the ClassicTokenizer then no matter what the version is, this filter should do what it used to do. And the TODO suggests someone forgot to make this filter do something useful for the StandardTokenizer.  Or perhaps that idea should be discarded and this class should be named ClassicTokenFilter.
> In any event, the javadocs for this class appear out of date as there is no mention of ClassicTokenizer, and the wiki is out of date too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3366) StandardFilter only works with ClassicTokenizer and only when version < 3.1

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081385#comment-13081385 ] 

David Smiley commented on LUCENE-3366:
--------------------------------------

Ok.  (I've been in no hurry to rush anything)

I updated the http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters page to fix references to StandardFilter that should have been to ClassicFilter, and I removed some uses of StandardFilter altogether because it doesn't do anything. I'm disinclined to mention this filter in the upcoming revision of my book, but I'll be sure to mention the Classic* variants.

Feel free to close this issue if you feel it is appropriate. I created it as an "improvement" because StandardFilter seems unfinished, and you've acknowledged it is. So perhaps it should stay open until it actually does something some day. 

> StandardFilter only works with ClassicTokenizer and only when version < 3.1
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3366
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3366
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: David Smiley
>
> The StandardFilter used to remove periods from acronyms and apostrophes-S's where they occurred. And it used to work in conjunction with the StandardTokenizer.  Presently, it only does this with ClassicTokenizer and when the lucene match version is before 3.1. Here is a excerpt from the code:
> {code:lang=java}
>   public final boolean incrementToken() throws IOException {
>     if (matchVersion.onOrAfter(Version.LUCENE_31))
>       return input.incrementToken(); // TODO: add some niceties for the new grammar
>     else
>       return incrementTokenClassic();
>   }
> {code}
> It seems to me that in the great refactor of the standard tokenizer, LUCENE-2167, something was forgotten here. I think that if someone uses the ClassicTokenizer then no matter what the version is, this filter should do what it used to do. And the TODO suggests someone forgot to make this filter do something useful for the StandardTokenizer.  Or perhaps that idea should be discarded and this class should be named ClassicTokenFilter.
> In any event, the javadocs for this class appear out of date as there is no mention of ClassicTokenizer, and the wiki is out of date too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3366) StandardFilter only works with ClassicTokenizer and only when version < 3.1

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081363#comment-13081363 ] 

David Smiley commented on LUCENE-3366:
--------------------------------------

Doh! Yes, I didn't notice it, Rob.  But still... the purpose of StandardFilter in its current state seems to only exist to satisfy backwards compatibility for code that uses Lucene at a pre 3.x era; nothing more.  Shouldn't it be marked @Deprecated to warn people?.  Or, the "TODO" should be done to do something. However the current StandardTokenizer doesn't really have equivalent token types to ClassicTokenizer in order for StandardFilter to actually do something useful. So then there is no TODO to do.

> StandardFilter only works with ClassicTokenizer and only when version < 3.1
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3366
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3366
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: David Smiley
>
> The StandardFilter used to remove periods from acronyms and apostrophes-S's where they occurred. And it used to work in conjunction with the StandardTokenizer.  Presently, it only does this with ClassicTokenizer and when the lucene match version is before 3.1. Here is a excerpt from the code:
> {code:lang=java}
>   public final boolean incrementToken() throws IOException {
>     if (matchVersion.onOrAfter(Version.LUCENE_31))
>       return input.incrementToken(); // TODO: add some niceties for the new grammar
>     else
>       return incrementTokenClassic();
>   }
> {code}
> It seems to me that in the great refactor of the standard tokenizer, LUCENE-2167, something was forgotten here. I think that if someone uses the ClassicTokenizer then no matter what the version is, this filter should do what it used to do. And the TODO suggests someone forgot to make this filter do something useful for the StandardTokenizer.  Or perhaps that idea should be discarded and this class should be named ClassicTokenFilter.
> In any event, the javadocs for this class appear out of date as there is no mention of ClassicTokenizer, and the wiki is out of date too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org