You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Stanislaw Osinski (JIRA)" <ji...@apache.org> on 2007/08/01 09:53:53 UTC
[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516893 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following text:

78academyawards/rules/rule02.html,7194,7227,type

which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token:

// every other segment must have at least one digit
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>
       | <HAS_DIGIT> <P> <ALPHANUM>
       | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
       | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
        )

you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT):

78academyawards   /   rules   /   rule02   .   html   ,   7194   ,   7227   ,   type
                H                  P      A     P       H       P     A     P     H      P      A     P    H?*     (starting from the beginning)
                                                                              H?*    P     A      P      H     P     A       (starting from the end)

* (would have to be H, but no digits in substring "type" or "html")

I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here. 

Now I can see two solutions:

* try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be aware of most of them...)
* relax the <NUM> rule a little bit (JFlex notation):

// there must be at least one segment with a digit
NUM = ({P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P} ({HAS_DIGIT} | {ALPHANUM}))*

With this definition, again, all StandardAnalyzer tests pass, plus all texts along the lines of:

2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type
78academyawards/rules/rule02.html,7194,7227,type
978-0-94045043-1,86408,86424,type
62.46,37004,37009,type    (this one was parsed as <HOST> by the original analyzer)

get parsed as a whole as one <NUM>, which is equivalent to what JavaCC-based version would do. I will attach a corresponding patch in a second.



> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org