You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Stanislaw Osinski (JIRA)" <ji...@apache.org> on 2007/07/26 15:06:40 UTC

[jira] Created: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

A faster JFlex-based replacement for StandardAnalyzer
-----------------------------------------------------

                 Key: LUCENE-966
                 URL: https://issues.apache.org/jira/browse/LUCENE-966
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Stanislaw Osinski
             Fix For: 2.3


JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-966.
---------------------------------------

       Resolution: Fixed
    Lucene Fields: [New, Patch Available]  (was: [New])

OK I committed this!  Thank you Stanislaw!

I ran a quick perf test on Wikipedia (first 50K docs only) and found
the new StandardTokenizer is ~6X faster -- awesome :)

I made these small additional changes over the final patch before
committing:

  * I removed StandardAnalyzer.html "grammar doc" generation from
    build.xml since it was using jjdoc.  Stanislaw, is there something
    in jflex that can generated a BNF description of the grammar as
    HTML?

  * I removed the @author tag from StandardTokenizer.java: we are
    removing all such tags and instead giving credit in CHANGES.txt.

  * I removed the whitespace-only diffs from common-build.xml &
    build.xml.

  * I put back the big comment that describes this tokenizer in
    StandardTokenizer.java.

  * Put standard Apache copyright headers in all sources.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516762 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

Thanks for spotting the differences, I'll add them to the unit tests and will correct the tokenizer accordingly.

One doubt I have is about the filename-like tokens, e.g.:

      OLD: (2004.jpg,34461,34469,type=<HOST>)
      NEW: (2004.jpg,34461,34469,type=<NUM>) 

To be honest, both variants seem "almost" correct. If you try 2007.org -- this is a correct domain name (and has a funny website on it :), so, given the fact that we don't check for typical suffixes, such as ".com", <HOST> doesn't seem wrong. On the other hand, 2004.jpg may well have been some sort of numerical code or a product number, so <NUM> is not totally irrelevant either.

For the JFlex-based tokenizer, I put the <NUM> rule matching first, as it gives some nice performance benefits. We can put HOST first, and then we'll get compliance with the old version.

Another option we might consider is:

* adding a new token type for file names (get a list of common extensions or even assume that an extension is simply three alphanumerical characters) 
* checking for common domain names in hosts (something along the lines of: "mil" | "info" | "gov" | "edu" | "biz" | "com" | "org" | "net" |  "arpa" | {LETTER}{2})

I'm not sure how this will affect the performance of the tokenizer, but my rough guess is that if we don't come up with very complex/ backtracking-prone rules there should not be too much of a difference. On the other hand, if 100% compatibility with the old tokenizer is a priority, adding new token types is not a good idea, I guess.

Finally, when it comes to the initialization time of the new tokenizer -- according to the JFlex documentation, some time is required to unpack the transition tables. But the unpacking takes place during the initialization of static fields, so once the class is loaded the overhead should be negligible.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517368 ] 

Mark Miller commented on LUCENE-966:
------------------------------------

>When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following >text:
>
>78academyawards/rules/rule02.html,7194,7227,type
>
>which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token:
>
>// every other segment must have at least one digit
><NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>       | <HAS_DIGIT> <P> <ALPHANUM>
>       | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>       | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>       | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>       | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>        )
>
>you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this >rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks >like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT):
>
>78academyawards / rules / rule02 . html , 7194 , 7227 , type
>                H P A P H P A P H P A P H?* (starting from the beginning)
>                                                                              H?* P A P H P A (starting from the end)
>
>* (would have to be H, but no digits in substring "type" or "html")
>
>I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here. 

I think that JavaCC is correct here. Every other segment must have a digit: 78academyawards / rules / rule02 . html 
It looks to me like every other segment does have a digit. First segment has a digit, second does not, third does, fourth does not. Matches NUM correctly.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518463 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------


I was gearing up to commit this, but realized the copyright on top of
StandardTokenizer.java isn't the standard Apache header; it's this
one:

/*
 * Carrot2 project.
 *
 * Copyright (C) 2002-2007, Dawid Weiss, Stanisław Osiński.
 * Portions (C) Contributors listed in "carrot2.CONTRIBUTORS" file.
 * All rights reserved.
 *
 * Refer to the full license file "carrot2.LICENSE"
 * in the root folder of the repository checkout or at:
 * http://www.carrot2.org/carrot2.LICENSE
 */

I believe (according to http://www.apache.org/legal/src-headers.html),
we need to replace the above with the standard Apache header and then
maybe move this one into the NOTICE.txt.  Is that right?  Stanislaw,
is that OK?

I would also add the Apache header into StandardAnalyzerImpl.jflex.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517222 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

If it really is down to emulating the bugs/oddities in JavaCC then I
think it's not worth polluting the new tokenizer with these legacy
bugs, unless one or two cases can match perfectly and not degrade
performance too badly?

And maybe what we should do is make this a new tokenizer, calling it
StandardAnalyzer2, and then deprecate the existing StandardAnalyzer?
Then remove any & all JavaCC bug emulation from the new one.

This way people relying on the precise bugs in JavaCC tokenization are
not hurt on upgrading to 2.3 and are given a chance to migrate to the
new one (with 1 release of deprecated StandardAnalyzer).  And new
people will use the new faster one.


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stanislaw Osinski updated LUCENE-966:
-------------------------------------

    Attachment: jflex-analyzer-r561693-compatibility.txt

A patch for better compatibility with the StandardAnalyzer containing:

* relaxed definition of the <NUM> token
* new test cases in TestStandardAnalyzer

I noticed that with this patch org.apache.lucene.benchmark.quality.TestQualityRun.testTrecQuality fails, but I'm not sure if this is related to the tokenizer.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stanislaw Osinski updated LUCENE-966:
-------------------------------------

    Attachment: AnalyzerBenchmark.java

Here is a very simple benchmark I used to test the performance of StandardAnalyzer, FastAnalyzer and WhitespaceAnalyzer. I ran it on a number of JVMs and got the following results:

Input: Reuters collection, the one used by contrib/benchmark, only documents
longer than 100 bytes

Machine: AMD Sempron 2600+, 2G RAM, Windows XP

Sun 1.4.2 Server
org.apache.lucene.analysis.standard.StandardAnalyzer: 15172 ms, 139667 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2438 ms, 869170 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 781 ms, 3547585 tokens/s

Sun 1.4.2 Client
org.apache.lucene.analysis.standard.StandardAnalyzer: 24187 ms, 87610 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 3157 ms, 671218 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1453 ms, 1906857 tokens/s

Sun 1.5.0 Server
org.apache.lucene.analysis.standard.StandardAnalyzer: 16062 ms, 131928 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2641 ms, 802361 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 750 ms, 3694218 tokens/s

Sun 1.5.0 Client
org.apache.lucene.analysis.standard.StandardAnalyzer: 23891 ms, 88696 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 3641 ms, 581993 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1437 ms, 1928089 tokens/s

Sun 1.6.0 Server
org.apache.lucene.analysis.standard.StandardAnalyzer: 13719 ms, 154460 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2484 ms, 853074 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 750 ms, 3694218 tokens/s

Sun 1.6.0 Client
org.apache.lucene.analysis.standard.StandardAnalyzer: 22312 ms, 94972 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2750 ms, 770558 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1297 ms, 2136209 tokens/s

IBM 1.4.2
org.apache.lucene.analysis.standard.StandardAnalyzer: 11922 ms, 177741 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 3218 ms, 658495 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1407 ms, 1969199 tokens/s

IBM 1.5.0
org.apache.lucene.analysis.standard.StandardAnalyzer: 11797 ms, 179625 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2968 ms, 713961 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1000 ms, 2770664 tokens/s

BEA 1.4.2
org.apache.lucene.analysis.standard.StandardAnalyzer: 16234 ms, 130530 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 3344 ms, 633683 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1343 ms, 2063040 tokens/s

BEA 1.5.0 (looks really slow)
org.apache.lucene.analysis.standard.StandardAnalyzer: 33891 ms, 62525 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 12703 ms, 166813 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 4860 ms, 570095 tokens/s



> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stanislaw Osinski updated LUCENE-966:
-------------------------------------

    Attachment: jflex-analyzer-r560135-patch.txt

Here's another patch (against r560135) which:

* replaces JavaCC-based StandardTokenizer with a JFlex-based one (all tests pass), as suggested by Yonik
* updates build scripts accordingly and fixes the issue of deleting one file to many, as reported by Mark 

Additionally, I noticed that ChainedFilterTest from contrib/miscellaneous would fail on machines with non-US locale, so this patch fixes that too.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Paul Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515882 ] 

Paul Smith commented on LUCENE-966:
-----------------------------------

We did pretty much the same thing here at Aconex,   The tokenization mechanism in the old javacc-based analyser is woeful compared to what JFlex outputs.

Nice work!  

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516776 ] 

Doug Cutting commented on LUCENE-966:
-------------------------------------

It is important that the same sequence of token text is produced, but I think we could live with different token types in some cases, if we must.  Few applications depend on token types, no?

Provided the token text issues can be resolved, I'd like to see StandardTokenizer replaced with this.  Performance is important, and ideally folks shouldn't have to change their applications to see performance improvements.


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515768 ] 

Mark Miller commented on LUCENE-966:
------------------------------------

Great patch! And a very quick turnaround to boot!

I am seeing a HUGE speed increase!

I also think this should certainly become the new StandardAnalyzer.

I am running into only one issue...when rebuilding, after a clean, FastTokenizer does not seem to get generated -- FastTokenizerImpl is there, but no FastTokenizer interface; it appears to be wiped out in the clean but not regenerated in the build. If I put it in manually, everything works great.

- Mark



> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stanislaw Osinski updated LUCENE-966:
-------------------------------------

    Attachment: jflex-analyzer-r562378-patch.txt

I've attached another patch with:

* grammar translated "dot for dot" from JavaCC (no "optimizations" this time) 
* tests extended with Mark's latest compatibility examples

I hope this time the patch correctly applies to the test class (previously I used Eclipse to create the patch which may not have been a good idea). All unit tests pass, so I hope this time I finally got it right.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518523 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

Super, thanks!

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517342 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

I like the name DefaultAnalyzer.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stanislaw Osinski updated LUCENE-966:
-------------------------------------

    Attachment: jflex-analyzer-r561292-patch.txt

Here is another (this time let's call it "final" :) patch for this issue which comes with a simplified (but test-wise equivalent) grammar for numeric expressions that gives about 5% performance increase.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517542 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

First 1000 documents in wikipedia show absolutely no difference -- I think this one is it!

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517577 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

Good news -- thanks for the test!

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517515 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

This time I used Tortoise, but it made things worse :) I'll create the patch the usual way and e-mail it directly to you to see if it's not JIRA's fault (I don't see the '?' characters on my local machine).

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516775 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------


I agree, let's try to perfectly match the tokens of the old
StandardAnalyzer so we have a way-faster drop-in replacement.

The speedups of JFlex are amazing: based on a quick test, with JFlex +
patch from LUCENE-969, the new StandardAnalyzer is only 2.09X slower
than WhitespaceAnalyzer even though it's doing so much more ...

> Finally, when it comes to the initialization time of the new
> tokenizer -- according to the JFlex documentation, some time is
> required to unpack the transition tables. But the unpacking takes
> place during the initialization of static fields, so once the class
> is loaded the overhead should be negligible.

Yeah I'm baffled why it's that much slower, but on 100 token docs I
definitely see LUCENE-969 making things 84% faster but "only" 36%
faster if I use the full Wikipedia doc (which are much larger than 100
tokens on average).  If we tested even smaller docs I think the gains
would be even more.

When I ran under the profiler it was the StandardTokenizerImpl
<init>(java.io.Reader) way on the top.  Maybe it's the cost of new'ing
the 16 KB buffer each time?

In any event I think it's OK, so long as we get LUCENE-969 in, and
document the importance of using reusableTokenStream() API for better
performance.


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517581 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

OK I can commit this.  Does anyone see any reason not to?  I will wait a few days then commit...

Thank you Stanislaw!

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518636 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

>  * I removed StandardAnalyzer.html "grammar doc" generation from
>    build.xml since it was using jjdoc. Stanislaw, is there something
>    in jflex that can generated a BNF description of the grammar as
>    HTML? 

I've had a look at the JFlex docs and there doesn't seem to be such a tool for JFlex, I'm afraid.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518521 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

Absolutely -- the header was there only because I used Carrot2's grammar to start with and forgot to remove it. You can put the Apache header on all the files, NOTICE.txt is not necessary either. Apologies for confusion :)

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516893 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following text:

78academyawards/rules/rule02.html,7194,7227,type

which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token:

// every other segment must have at least one digit
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>
       | <HAS_DIGIT> <P> <ALPHANUM>
       | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
       | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
        )

you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT):

78academyawards   /   rules   /   rule02   .   html   ,   7194   ,   7227   ,   type
                H                  P      A     P       H       P     A     P     H      P      A     P    H?*     (starting from the beginning)
                                                                              H?*    P     A      P      H     P     A       (starting from the end)

* (would have to be H, but no digits in substring "type" or "html")

I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here. 

Now I can see two solutions:

* try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be aware of most of them...)
* relax the <NUM> rule a little bit (JFlex notation):

// there must be at least one segment with a digit
NUM = ({P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P} ({HAS_DIGIT} | {ALPHANUM}))*

With this definition, again, all StandardAnalyzer tests pass, plus all texts along the lines of:

2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type
78academyawards/rules/rule02.html,7194,7227,type
978-0-94045043-1,86408,86424,type
62.46,37004,37009,type    (this one was parsed as <HOST> by the original analyzer)

get parsed as a whole as one <NUM>, which is equivalent to what JavaCC-based version would do. I will attach a corresponding patch in a second.



> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516745 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

I took the patch from here (to use jflex for StandardAnalyzer) and
merged it with the patch from LUCENE-969 (re-use Token & TokenStream)
to measure the net performance gains.

I measure the time to just tokenize all of Wikipedia using
StandardAnalyzer using contrib/benchmark plus patch from LUCENE-967
(test details are described in LUCENE-969).

With the jflex patch it takes 646 sec (best of 2 runs); when I then
merge in the patch from LUCENE-969 it takes 455 sec.  Subtracting off
the time to just load all Wikipedia docs (= 112 sec) that gives net
additional speedup of 36% (534 sec -> 343 sec) when using LUCENE-969
in addition to jflex.

A couple other things I noticed:

  * The init cost of jflex (StandardTokenizerImpl) seems to be fairly
    high: when I repeat the above test with smallish docs (100 tokens
    each) instead, the gain is around 84%.  I think this just makes
    the new reusableTokenStream() in LUCENE-969 important to commit.

  * I'm seeing differing token counts with the jflex StandardAnalyzer
    vs the current one; I think there is some difference here.  I will
    track down which tokens differ and post back...


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517097 ] 

Doron Cohen commented on LUCENE-966:
------------------------------------

The search quality test failure can be caused by the standard 
analyzer generating different tokens than before. (has nothing 
to do with token types.)

This is because the test's topics (queries) and qrels (expected matches) 
were created by examining an index that was created using the current 
standard analyzer. Now, running this test with an analyzer that creates 
other tokens is likely to fail. 

It is not difficult to update this test for a modified analyzer, but it seems 
better to me to preserve the original standard analyzer behavior.


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515726 ] 

Yonik Seeley commented on LUCENE-966:
-------------------------------------

Thanks Staszek, very nice!
If FastTokenizer is compatible with StandardTokenizer, it should probably just replace it.
Have you tried this replacement and running all the lucene tests?


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stanislaw Osinski updated LUCENE-966:
-------------------------------------

    Attachment: jflex-analyzer-patch.txt

Here comes a somewhat rough (needing refactorings, see below) patch adding a JFlex-based repalcement for StandardAnalyzer called FastAnalyzer.  A few comments:

* all tests from TestStandardAnalyzer pass
* tokenizer generation added to Ant build scripts

Because FastAnalyzer shares some code with StandardAnalyzer, it might be worthwhile to do some refactorings:

* have a common test for both analyzers (only if we want to keep them in sync)
* have a common superclass for FastFilter and StandardFilter
* have a common superclass for FastAnalyzer and StandardAnalyzer

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: jflex-analyzer-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517520 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

Super, I will compare tokens on Wikipedia...

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517348 ] 

Mark Miller commented on LUCENE-966:
------------------------------------

These issues seem odd.

Both JavaCC and Flex match with the same rules:
1. Longest match first
2. If match size is the same, use the first in the grammar

OLD
    (money.cnn.com,1382,1395,type=<HOST>)
    (magazines,1396,1405,type=<ALPHANUM>)
    (fortune,1406,1413,type=<ALPHANUM>)
    (fortune,1414,1421,type=<ALPHANUM>)
    (archive/2007/03/19/8402357,1422,1448,type=<NUM>)
    (index.htm,1449,1458,type=<HOST>)

  NEW
    (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>)

The old is correct and the NEW should not match <NUM>. <NUM> should break on '/' and '.' and every other token from the break should have a digit for a NUM match to occur. This is not the case.

  OLD
    (mid-20th,2436,2444,type=<NUM>)

  NEW
    (mid,2436,2439,type=<ALPHANUM>)
    (-20th,2439,2444,type=<NUM>)

Something is wrong with the NEW one. <NUM> is certainly a valid longer match.

  OLD
    (safari-0-sheikh,12011,12026,type=<NUM>)
    (zayed,12027,12032,type=<ALPHANUM>)
    (grand,12033,12038,type=<ALPHANUM>)
    (mosque.jpg,12039,12049,type=<HOST>)

  NEW
    (safari,12011,12017,type=<ALPHANUM>)
    (0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=<NUM>)

Again, something seems wrong with the NEW.  (safari-0-sheikh,12011,12026,type=<NUM>) is a correct and longer match than  (safari,12011,12017,type=<ALPHANUM>)

It would be nice to have the source text for these comparisons.

Also, a hard vote against StandardAnalyzer2 <g> Default is arguable as well, as this wouldn't be the default analyzer you should use in many cases (don't like standard because of that either).

>From the latest samples, I would say something is off with the NEW and OLD appears mostly correct.

- Mark


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by Stanislaw Osinski <st...@man.poznan.pl>.

>
> > Mark -- have you tried the jflex-analyzer-r560135-patch.txt patch with
> your wikipedia diff test? That's the early one whose grammar was "dot for
> dot" translated from the original JavaCC spec -- for further patches I did
> some "optimizations", which seem to have broken the compatibility...
> >
> The test is Mike's and I think it is off your latest patch.


Oops again -- I should stop working late at night :)

The latest patch was not too compatible with JavaCC, the
jflex-analyzer-r560135-patch.txt patch should be best here. Maybe I should
delete the other attachments from JIRA to avoid further confusion?

Looks like the optimizations might have to go then?


 Definitely, but the base version (jflex-analyzer-r560135-patch.txt) is
still much faster than JavaCC.

> Incidentally, what was the motivation for requiring the <NUM> token to
> have numbers only in every second segment and not in any segment?
> >
> I don't think the rule is "every second segment" but "at least every
> other segment". Why this rule was made, I am not sure; I am guessing it
> was just a good rule of thumb to catch a lot of serial numbers, model
> numbers, etc but without going too overboard in the matching.
>

Ok -- seems reasonable.

Staszek

Re: [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by Mark Miller <ma...@gmail.com>.

>
>
>
> Mark -- have you tried the jflex-analyzer-r560135-patch.txt patch with your wikipedia diff test? That's the early one whose grammar was "dot for dot" translated from the original JavaCC spec -- for further patches I did some "optimizations", which seem to have broken the compatibility...
>   
The test is Mike's and I think it is off your latest patch. Looks like 
the optimizations might have to go then?
> Incidentally, what was the motivation for requiring the <NUM> token to have numbers only in every second segment and not in any segment?
>   
I don't think the rule is "every second segment" but "at least every 
other segment". Why this rule was made, I am not sure; I am guessing it 
was just a good rule of thumb to catch a lot of serial numbers, model 
numbers, etc but without going too overboard in the matching.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517371 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

Okkkk -- only now I realized I made a really silly mistake :) When using Mark's examples I somehow took the ",type" substring as part of the token image, which made the JavaCC tokenizer look "buggy"...  Apologies for the confusion, tomorrow in the morning I'll correct my tests and will see what's happening.

One more important clarification -- the tokenizer from the last patch (jflex-analyzer-r561693-compatibility.txt) has a completely different definition of the <NUM> token -- it allows digits in any segment, hence the totally different results. If we want to be compatible with the StandardAnalyzer, we should forget about that patch.

Mark -- have you tried the jflex-analyzer-r560135-patch.txt patch with your wikipedia diff test? That's the early one whose grammar was "dot for dot" translated from the original JavaCC spec -- for further patches I did some "optimizations", which seem to have broken the compatibility...

Incidentally, what was the motivation for requiring the <NUM> token to have numbers only in every second segment and not in any segment?



> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517513 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

The TestStandardAnalyzer.java patch still fails to apply -- looks like
the same thing as before (the unicode chars under "Korean words" were
replaced by ?'s somewhere along the way).

And, somehow, many of the files are dup'd (appear twice) in the
patch (eg StandardTokenizerImpl.jflex, StandardFilter.java, etc.).
I'm not sure whether first or 2nd one is the one to keep.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517233 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

To be precise -- I'm not 100% sure that this is a bug in JavaCC (I'll try to browse/ask on their mailing list to find out), but it looks like the scanner generated by JavaCC does not really strictly follow the grammar. I discovered this when you gave the examples of different token produced by JFlex- and JavaCC-based scanners generated from equivalent grammars -- JFlex seemed to behave "correctly", hence different tokens.

I was about to start trying to hack the grammar to produce results simiar to StandardAnalyzer (based on a finite set of test cases :). I'll see how it compares performance-wise too.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516752 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

I tracked down at least some differences between the JavaCC vs JFlex
versions of StandardAnalyzer.

I think we should resolve these before committing.

I just printed all tokens for the first 20 Wikipedia docs and diff'd
the outputs.

Here are the categories of differences that I saw:

  * Only the type differs on a filename-like token:

      OLD: (2004.jpg,34461,34469,type=<HOST>)
      NEW: (2004.jpg,34461,34469,type=<NUM>)

    In this case the old StandardAnalyzer called "2004.jpg" a HOST and
    the new one calls it a NUM.  Seems like neither one is right!

  * Only the type differs on a number token:

      OLD: (62.46,37004,37009,type=<HOST>)
      NEW: (62.46,37004,37009,type=<NUM>)

    The new tokenizer looks right here.  I guess the decimal point
    confuses the JavaCC (old) one.

  * Different number of tokens produced for number-like-token:

      OLD: (978-0-94045043-1,86408,86424,type=<NUM>)
      NEW: (978-0-94045043,86408,86422,type=<NUM>)
           (1,86423,86424,type=<ALPHANUM>)

    The new one split off the final "-1" as its own token, and called
    it ALPHANUM not NUM.  I think the old behavior is correct.

  * Different number of tokens produced for filename:

      OLD: (78academyawards/rules/rule02.html,7194,7227,type=<NUM>)
      NEW: (78academyawards/rules/rule02,7194,7222,type=<NUM>)
           (html,7223,7227,type=<ALPHANUM>)

    I think the old one is better, though it should not be called a
    NUM (maybe we need a new "FILENAME" token type?).

  * Same as above, but split on final '_' instead of '.' ('-' also
    shows this behavior):

      OLD: (2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type=<NUM>)
      new: (2006-03-11t082958z_01_ban130523_rtridst_0,2076,2117,type=<NUM>)
           (ozabs,2118,2123,type=<ALPHANUM>)


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517184 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

Thanks for more test cases. I guess the biggest problem here is that the scanner generated by JavaCC doesn't seem to strictly follow the specification (see https://issues.apache.org/jira/browse/LUCENE-966#action_12516893), so I'd need to emulate possible JavaCC "bugs" I'm not aware of at the moment (I'm not an expert on lexical scanner generation either, not yet at least :). I can add some workarounds to the grammar to make the known incompatibility examples work, but this won't guarantee consistency in general.

As a side note, it's a shame there's no trace of the version of JavaCC that was used to generate the scanner for the original StandardAnalyzer. I'm also curious if the results of the current JavaCC grammar would be the same with the newest version of the generator (4.0 I guess) -- I'll try to check that.

Anyway, I'll take a look at the problem in more depth once again. And in the worst case scenario, we can keep the StandardAnalyzer as it was and add the new one next to it so that people can have a choice (on the other hand, this might be a problem for the quality tests).

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stanislaw Osinski updated LUCENE-966:
-------------------------------------

    Attachment: jflex-analyzer-r562378-patch-nodup.txt

One more try -- this time without duplicated entries (no idea why Tortoise did this...).

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517514 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

> The TestStandardAnalyzer.java patch still fails to apply -- looks like
> the same thing as before (the unicode chars under "Korean words" were
> replaced by ?'s somewhere along the way). 

WOOPS!  This was my bad.  I was clicking on the patch and then
copy/pasting the text out of the browser into a local file, and this
step (ugh) introduces the ?'s.  If I instead right-click on the link
and have browser directly save the file then it works perfectly!

Sorry for the confusion.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518661 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

> > * I removed StandardAnalyzer.html "grammar doc" generation from
> > build.xml since it was using jjdoc. Stanislaw, is there something
> > > in jflex that can generated a BNF description of the grammar as
> > > HTML?

> I've had a look at the JFlex docs and there doesn't seem to be such
> a tool for JFlex, I'm afraid.

OK, I think we can just live without this.  Thanks!

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt, jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517003 ] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

Oddly, the patch for TestStandardAnalyzer failed to apply for me (but
the rest did), so I manually merged those changes in.  Oh, I see: it
was the "Korean words" test -- somehow the characters got mapped to
?'s in your patch.  This is why the patch didn't apply, I think?
Maybe you used a diffing tool that wasn't happy with unicode or
something?

I also see the quality test failing in contrib benchmark.  I fear
something about the new StandardAnalyzer is in fact causing this test
to fail (it passes on a clean checkout).  That test uses
StandardAnalyzer.

KO I re-tested the old vs new StandardAnalyzer on Wikipedia and I
still found some differences, I think only on these very large
URL-like tokens.  Here's one:

  OLD
    (money.cnn.com,1382,1395,type=<HOST>)
    (magazines,1396,1405,type=<ALPHANUM>)
    (fortune,1406,1413,type=<ALPHANUM>)
    (fortune,1414,1421,type=<ALPHANUM>)
    (archive/2007/03/19/8402357,1422,1448,type=<NUM>)
    (index.htm,1449,1458,type=<HOST>)

  NEW
    (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>)

I like the NEW behavior better but I fear we should try to match the
old one?


Here's another one:

  OLD
    (mid-20th,2436,2444,type=<NUM>)

  NEW
    (mid,2436,2439,type=<ALPHANUM>)
    (-20th,2439,2444,type=<NUM>)

I like the old behavior better here.

Another one:

  OLD
    (safari-0-sheikh,12011,12026,type=<NUM>)
    (zayed,12027,12032,type=<ALPHANUM>)
    (grand,12033,12038,type=<ALPHANUM>)
    (mosque.jpg,12039,12049,type=<HOST>)

  NEW
    (safari,12011,12017,type=<ALPHANUM>)
    (0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=<NUM>)

Another one:

  OLD
    (semitica-01.png,616,631,type=<NUM>)

  NEW
    (-semitica-01.png,615,631,type=<NUM>)



> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516767 ] 

Mark Miller commented on LUCENE-966:
------------------------------------

I think it is important that we do not modify the StandardAnalyzer and that this improved version has results identical to the original. I believe that the StandardAnalyzer is pretty heavily used -- these performance increases could give many users huge gains. By changing this Analyzer at all we will be forcing users to effectively lose terms in their indexes if they want to upgrade core Lucene for use on an older index.

I think any enhanced StandardAnalzyer should probably be released as a new Analyzer. This one should mimic the old though, as the performance gains will be very beneficial to a lot of current users.

- Mark

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517358 ] 

Mark Miller commented on LUCENE-966:
------------------------------------

By the way...you can see one of the issues here:

OLD
    (money.cnn.com,1382,1395,type=<HOST>)
    (magazines,1396,1405,type=<ALPHANUM>)
    (fortune,1406,1413,type=<ALPHANUM>)
    (fortune,1414,1421,type=<ALPHANUM>)
    (archive/2007/03/19/8402357,1422,1448,type=<NUM>)
    (index.htm,1449,1458,type=<HOST>)

  NEW
    (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>) 

JavaCC StandardAnalyzer would never output a token that starts with a '/'. It would be cut off. The issues may be involved with how JFlex skips characters that are not part of a match compared to how JavaCC is doing it. Or perhaps the JFlex version is considering '/' and '.' to be ALPHANUM's.

- Mark

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517339 ] 

Grant Ingersoll commented on LUCENE-966:
----------------------------------------

+1 for deprecating StandardAnalyzer, although it is a little weird to have a StandardAnalyzer2, which is it the standard or not?  Maybe we could call it the DefaultAnalyzer or something like that.  I never much cared for tacking on numbers on the end of class names.  

I agree with Mike M's last comment and approach, though, if that proves to be the case for the differences.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org