You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2008/01/11 01:31:34 UTC
[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557854#action_12557854 ] 

Hoss Man commented on LUCENE-1126:
----------------------------------

bq. Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.

Clarification: I don't think this would actually tie anything to the "user's" choice of JVM, wouldn't it make the behavior of the resulting code depend on the JVM choice of whoever generated StandardTokenizerImpl.java from StandardTokenizerImpl.jflex (not to mention possible discrepancies based on the version of JFLex in use at the time) ?

I'm not positive, but couldn't this result in situations where a committer using a 1.5 JVM could generate and commit a StandardTokenizerImpl.java that had would have a different behavior then if he was using 1.4 -- all of which would be completely independent of whether or not the release engineer of the next release compiled the resulting grammer using 1.4?

> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter()) includes CJK characters, there is a way to handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org