You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2008/01/10 21:51:33 UTC

[jira] Created: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Simplify StandardTokenizer JFlex grammar
----------------------------------------

                 Key: LUCENE-1126
                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
    Affects Versions: 2.2
            Reporter: Steven Rowe
            Priority: Minor
             Fix For: 2.4


Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:

On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> I wish the tokeniser could just use Character.isLetter and
> Character.isDigit instead of having to know all the ranges itself, since
> the JRE already has all this information.  Character.isLetter does
> return true for CJK characters though, so the ranges would still come in
> handy for determining what kind of letter they are.  I don't support
> JFlex has a way to do this...

The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().

Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:

bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 

So to exclude CJ characters from the LETTER macro:

{code}
    LETTER = ! ( ! [:letter:] | {CJ} )
{code}
 
Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.

I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  

Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.

I will attach a patch shortly.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557854#action_12557854 ] 

Hoss Man commented on LUCENE-1126:
----------------------------------

bq. Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.

Clarification: I don't think this would actually tie anything to the "user's" choice of JVM, wouldn't it make the behavior of the resulting code depend on the JVM choice of whoever generated StandardTokenizerImpl.java from StandardTokenizerImpl.jflex (not to mention possible discrepancies based on the version of JFLex in use at the time) ?

I'm not positive, but couldn't this result in situations where a committer using a 1.5 JVM could generate and commit a StandardTokenizerImpl.java that had would have a different behavior then if he was using 1.4 -- all of which would be completely independent of whether or not the release engineer of the next release compiled the resulting grammer using 1.4?

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626531#action_12626531 ] 

Michael McCandless commented on LUCENE-1126:
--------------------------------------------

I will commit this in a day or two.

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628106#action_12628106 ] 

Steven Rowe commented on LUCENE-1126:
-------------------------------------

Yeah, I see this too.

The issue is that the entire Thai range {{\u0e00-\u0e5b}} is included in the unpatched grammar's {LETTER} definition, which contains the huge range {{\u0100-\u1fff}}, much of which is not actually letters.  The patched grammar instead substitutes the Unicode 3.0 {{Letter}} general category (via JFlex's [:letter:]), which excludes some characters in the Thai range: non-spacing marks, a currency symbol, numerals, etc.

ThaiAnalyzer uses ThaiWordFilter, which uses Java's BreakIterator to tokenize the contiguous text (i.e. without whitespace) provided by StandardTokenizer.

The failing test expects to see {{"\u0e17\u0e35\u0e48"}}, but instead gets {{"\u0e17"}}, because {{\u0e35}} is a non-spacing mark, which the patched StandardTokenizer doesn't pass to ThaiWordFilter.

Because of this problem, I guess I'm -1 on applying the patch I provided.

One solution would be to switch from using the {{Letter}} general category to the derived property {{Alphabetic}}, which includes both general categories {{Letter}} and {{Mark}}. (see Annex C of [the Unicode Regular Expressions Technical Standard|http://www.unicode.org/unicode/reports/tr18/#Compatibility_Properties] under "alpha" for discussion of this).  The current version of JFlex does not support Unicode property references in its syntax, though, so simplifying -- and correcting -- the grammar may have to wait for the next version of JFlex, which will support syntax like {{\p{Alphabetic}}}.


> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1126:
--------------------------------

    Attachment: LUCENE-1126.patch

Compiled using JFlex 1.4.1, JDK 1.4.2

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558030#action_12558030 ] 

Steven Rowe commented on LUCENE-1126:
-------------------------------------

bq. The ANT settings specify 1.4 right? If it does happen, it is a bug, no?

The ANT settings specify that the bytecode compiler should produce 1.4-JVM-compatible byte code (javac.target property), and that the Java source code should be interpreted as complying with a particular version of the Java language specification (javac.source property).

But the JVM is a different tool from the bytecode compiler, and as Hoss pointed out, it is the JVM under which JFlex runs that makes the difference.

As I noted above:

bq. And when I compile with 1.4.2, the resulting StandardTokenizerImpl.java file is significantly different from the result with 1.5.0.

({{s/compile/generate the scanner source/g}} :) )

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557921#action_12557921 ] 

Hoss Man commented on LUCENE-1126:
----------------------------------

bq. this, as you point out, is a compile-time operation.

again, minor clarification ... if it were "compile-time" it would be *.java -> *.class conversion ... what we're talking about is ... i dunno ... "jflex-time" or "source-generation-time" ... where the JVM in use during the  *.flex -> *.java conversion affects the behavior of the final code.

(i'm not saying it's a bad thing ... i just want to be clear about what happens where)

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557864#action_12557864 ] 

Steven Rowe commented on LUCENE-1126:
-------------------------------------

bq. I'm not positive, but couldn't this result in situations where a committer using a 1.5 JVM could generate and commit a StandardTokenizerImpl.java that had would have a different behavior then if he was using 1.4 - all of which would be completely independent of whether or not the release engineer of the next release compiled the resulting grammer using 1.4?

Yes, you're right - I was conceiving of the resulting lexer building the tables at runtime, but looking at the code in StandardTokenizerImpl.java, that's patently false - this, as you point out, is a compile-time operation.

And when I compile with 1.4.2, the resulting StandardTokenizerImpl.java file is significantly different from the result with 1.5.0.

I'm quite sure, despite this, that switching to the (compile-time) JVM's Unicode version's definitions of letters and digits is a better policy than depending on a non-existent update mechanism for the grammar.

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628428#action_12628428 ] 

Michael McCandless commented on LUCENE-1126:
--------------------------------------------

Steven, it's OK I can regen; I just wanted to confirm which JRE (1.4) we should use.

I'm also going to add a comment at the top of StandardTokenizerImpl.jflex stating this.

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch, LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628368#action_12628368 ] 

Steven Rowe commented on LUCENE-1126:
-------------------------------------

{quote}
Could we, alternatively, modify the patch to explicitly add back in the full Thai range into ALPHANUM, and then upgrade to \p{Alphabetic} once the next version of JFlex is released?

Or are there other languages, besides Thai, that we might break with this patch?
{quote}

I noticed in looking at the Unicode database that Lao, which is contiguous with Thai and contained in the unpatched {{{LETTER}}} range, has exactly the same issue as Thai.  However, the Lucene code base doesn't contain a Lao Analyzer.  And I think ThaiAnalyzer is depending on faulty behavior from StandardTokenizer, so "fixing" the issue for other languages would be to make the assumption that they too would depend on bad behavior.

I'll shortly provide a redone patch that allows the ThaiAnalyzer to work again, but unless we have evidence of actual use by other language analyzers, I don't think we should be addressing them right now.

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628387#action_12628387 ] 

Michael McCandless commented on LUCENE-1126:
--------------------------------------------

bq. I'll shortly provide a redone patch that allows the ThaiAnalyzer to work again, but unless we have evidence of actual use by other language analyzers, I don't think we should be addressing them right now.

OK this sounds like the right approach.

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626052#action_12626052 ] 

Michael McCandless commented on LUCENE-1126:
--------------------------------------------

This patch looks reasonable?  We are replacing our own hardwired regexp for DIGIT with JFlex's :digit: which then falls back to the Character.isDigit() [equivalent] on the JVM that ran jflex.

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628424#action_12628424 ] 

Steven Rowe commented on LUCENE-1126:
-------------------------------------

bq. Steven, it looks like you ran JFlex with a 1.5 or 1.6 JRE? Should we stick with that, or, go with a 1.4 JRE (which is indeed significantly different)?

You're right, Mike - I was inadvertently using a 1.5 JRE.  I'll put up another patch produced with 1.4 JRE shortly - we should definitely not go with 1.5..

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch, LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557988#action_12557988 ] 

Steven Rowe commented on LUCENE-1126:
-------------------------------------

In part my imprecise characterization of the process comes from what is likely a misunderstanding of the Lucene-Java release process - when you said:

bq. I'm not positive, but couldn't this result in situations where a committer using a 1.5 JVM could generate and commit a StandardTokenizerImpl.java that had would have a different behavior then if he was using 1.4 - all of which would be completely independent of whether or not the release engineer of the next release compiled the resulting grammer using 1.4?

I assumed you meant that during the release process, the lexical scanner source (.java file) would be regenerated from the grammar (.jflex file).  And in this scenario, I meant to refer to "compile-time" as the entire build process - raw source to jar assembly, *including* lexical scanner generation - undertaken when producing a binary release.

But of course you're right :) .  The JVM version being used during source-generation-time (occurring prior to, and potentially not contiguously with, bytecode-generation-time) determines the version of Unicode used to define the meaning of "letter" and "digit".


> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627943#action_12627943 ] 

Michael McCandless commented on LUCENE-1126:
--------------------------------------------

Hmm -- I'm now seeing an failure with this patch, in TestThaiAnalyzer (in contrib/analyzers):

{code}
    [junit] Testcase: testAnalyzer(org.apache.lucene.analysis.th.TestThaiAnalyzer):	FAILED
    [junit] expected:<?[]> but was:<?[??]>
    [junit] junit.framework.ComparisonFailure: expected:<?[]> but was:<?[??]>
    [junit] 	at org.apache.lucene.analysis.th.TestThaiAnalyzer.assertAnalyzesTo(TestThaiAnalyzer.java:43)
    [junit] 	at org.apache.lucene.analysis.th.TestThaiAnalyzer.testAnalyzer(TestThaiAnalyzer.java:54)
    [junit] 
{code}

Does anyone else see this?

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628406#action_12628406 ] 

Michael McCandless commented on LUCENE-1126:
--------------------------------------------

Steven, it looks like you ran JFlex with a 1.5 or 1.6 JRE?  Should we stick with that, or, go with a 1.4 JRE (which is indeed significantly different)?

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch, LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628308#action_12628308 ] 

Michael McCandless commented on LUCENE-1126:
--------------------------------------------

{quote}
One solution would be to switch from using the Letter general category to the derived property Alphabetic, which includes both general categories Letter and Mark. (see Annex C of the Unicode Regular Expressions Technical Standard under "alpha" for discussion of this). The current version of JFlex does not support Unicode property references in its syntax, though, so simplifying - and correcting - the grammar may have to wait for the next version of JFlex, which will support syntax like \p{Alphabetic}.
{quote}
Could we, alternatively, modify the patch to explicitly add back in the full Thai range into ALPHANUM, and then upgrade to \p{Alphabetic} once the next version of JFlex is released?

Or are there other languages, besides Thai, that we might break with this patch?

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1126:
------------------------------------------

    Assignee: Michael McCandless

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1126:
--------------------------------

    Attachment: LUCENE-1126.patch

Refreshed original patch to include the Thai range {{[\u0e00-\u0e59]}} in the {{{ALPHANUM}}} definition.

All tests, including TestThaiAnalyzer, now pass.

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch, LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557996#action_12557996 ] 

Grant Ingersoll commented on LUCENE-1126:
-----------------------------------------

{quote}
'm not positive, but couldn't this result in situations where a committer using a 1.5 JVM could generate and commit a StandardTokenizerImpl.java that had would have a different behavior then if he was using 1.4 - all of which would be completely independent of whether or not the release engineer of the next release compiled the resulting grammer using 1.4?
{quote}

True, but no committer should ever be doing this, right?  :-)  At least not until 3.0.  The ANT settings specify 1.4 right?  If it does happen, it is a bug, no?

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1126.
----------------------------------------

    Resolution: Fixed

Committed revision 692211.

Thanks Steven!

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch, LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org