You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/11/22 22:04:39 UTC

[jira] Created: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

convert automaton to char[] based processing and TermRef / TermsEnum api
------------------------------------------------------------------------

                 Key: LUCENE-2090
                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Search
            Reporter: Robert Muir
            Priority: Minor
             Fix For: 3.1


The automaton processing is currently done with String, mostly because TermEnum is based on String.
it is easy to change the processing to work with char[], since behind the scenes this is used anyway.

in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782414#action_12782414 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

bq. BTW, we've discussed someday having a codec whose terms dict (or maybe just terms index) is represented as an FST

this would open up more opportunities.

bq. Maybe also make TermRef final in the patch?

ok

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782409#action_12782409 ] 

Michael McCandless edited comment on LUCENE-2090 at 11/25/09 1:06 PM:
----------------------------------------------------------------------

BTW, we've discussed someday having a codec whose terms dict (or maybe just terms index) is represented as an FST, at which point AutomatonTermsEnum would be an intersection + walk of two FSTs.  Because suffix's are also shared in the FST, you could more easily (more efficiently) handle \*XXX cases as well (it'd just be symmetic with the XXX\* cases).

      was (Author: mikemccand):
    BTW, we've discussed someday having a codec whose terms dict (or maybe just terms index) is represented as an FST, at which point AutomatonTermsEnum would be an intersection + walk of two FSTs.  Because suffix's are also shared in the FST, you could more easily (more efficiently) handle *XXX cases as well (it'd just be symmetic with the XXX* cases).
  
> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781222#action_12781222 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

Spinoff from LUCENE-1606.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781249#action_12781249 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

I changed only the accept(final TermRef term) method from Mike's flex patch of this enum to use char[], instead of string.
I did not modify the "smart" part, its more complex, but will probably help the ????NNN case.

the results change significantly for the *N case (i used my old benchmark, just because it was already setup in my eclipse)
||Pattern||Iter||AvgHits||AvgMS (String)||AvgMS (char[])||
|N?N?N?N|10|1000.0|36.2|34.9|
|?NNNNNN|10|10.0|4.9|5.1|
|??NNNNN|10|100.0|8.0|11.5|
|???NNNN|10|1000.0|35.4|34.0|
|????NNN|10|10000.0|250.9|230.9|
|NN??NNN|10|100.0|9.1|5.0|
|NN?N*|10|10000.0|8.3|7.5|
|?NN*|10|100000.0|63.5|28.7|
|*N|10|1000000.0|3027.8|1922.7|
|NNNNN??|10|100.0|3.7|3.7|

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2090:
--------------------------------

    Attachment: LUCENE-2090_TermRef_flex3.patch

Mike, here TermRef is final also. This doesn't remove any flexibility does it?
if the term dictionary is encoded in a different way (i.e. BOCU-1), will TermRef still be UTF-8 byte[] ?


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782412#action_12782412 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

Maybe also make TermRef final in the patch?

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781718#action_12781718 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

right, we could use constant suffix to stay with bytes. 
for example *N in this test, well 90% of the charset conversion of TermRefs disappears, because they can be eliminated by comparing bytes.


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782409#action_12782409 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

BTW, we've discussed someday having a codec whose terms dict (or maybe just terms index) is represented as an FST, at which point AutomatonTermsEnum would be an intersection + walk of two FSTs.  Because suffix's are also shared in the FST, you could more easily (more efficiently) handle *XXX cases as well (it'd just be symmetic with the XXX* cases).

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787175#action_12787175 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

Mike, I converted this to char[] api (see LUCENE-1606 for the patch).

In order for this to work, I needed to expose UnicodeUtil.nextUTF16ValidString(UTF16Result). 
The code is not duplicated, the String based method is just a wrapper for this, take a look if you get a chance.


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2090.
---------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 3.1)
                   Flex Branch

i am marking this one resolved, the goals have been met (char[]/byte[] based processing and TermRef/TermsEnum api)


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: Flex Branch
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781710#action_12781710 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

That sounds compelling -- you'd still do the full scan, but testing each term is much faster?

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781694#action_12781694 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

Hi Mike, I think an easier win is to perhaps add endsWith() byte[] comparison in TermRef.
(for now, I can use regular endsWith(), or run the machine backwards, or something like that).

I can use this in "dumb mode", i.e. *N, where I know the first part of the machine is a loop.
for whatever reason dumb mode checks "constant prefix" right now, which is useless, it will always be 0 in dumb mode.
instead I should build "constant suffix" in dumb mode. this would be much more useful for a quick comparison.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787389#action_12787389 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. Mike, I converted this to char[] api

Nice!

bq. the other thing I forgot, I think TermRef.copy(UTF8Result) would be handy... is there anywhere you could use this too?

That sounds reasonable -- maybe just add it?  Or... we could also deprecate UTF8Result, entirely, replacing it w/ TermRef...?  Hmmm.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782482#action_12782482 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. I suppose we should also fix compareTerm here for UTF-16 ordering at some point?

Yes... I'm [slowly] working towards that.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782476#action_12782476 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. Mike, here TermRef is final also. This doesn't remove any flexibility does it?

I'd actually rather lock it down for now, and then only open up flexibility when/if we get there... patch looks good!

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782479#action_12782479 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

bq. I'd actually rather lock it down for now, and then only open up flexibility when/if we get there... patch looks good!

Ok, I will commit it.

Just as a side note, maybe i can add a comment if you need it... the existing startsWith(), and now the new endsWith() are correct against byte[] for any Unicode encoding form.
However, some other encodings (including alternate encodings someone might flex to), do not have the properties of non-overlap, etc.

if someone was to implement a codec to store the index in one of those other encodings, they would have to write significantly more complex code that is aware of character boundaries, depending upon the properties of said encoding.
oh yeah, and their sort order would be different, too... (I suppose we should also fix compareTerm here for UTF-16 ordering at some point?)


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2090:
--------------------------------

    Attachment: LUCENE-2090_TermRef_flex.patch

Attached is a patch to TermRef to implement endsWith()

this is a huge win on flex, even though constant suffix gain is very minor on trunk, because it avoids unicode conversion (char[]) for the worst cases that must do lots of comparisons.

*N	1705.7ms avg -> 1195.4ms avg
*NNNNNN	1844.9ms avg -> 1192.3ms avg

it doesn't really matter if the suffix is short, if there is a way in FilteredTermsEnum.accept() for a multitermquery to accept/reject a term without unicode conversion, it helps a lot.

in my opinion, this is the cleanest way to improve these cases, other crazy ideas i have tossed around out here like the iterative "reader-like" conversion or even TermRef substring matching will probably not gain much more over this, will be a lot more complex, and only apply to automatonquery.

Mike, if you get a chance to review, this, I'll commit it to flex branch (the tests pass).


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782065#action_12782065 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

Mike, I implemented this common suffix, but only for dumb mode, it does not help smart mode.
so i got rid of common prefix entirely, as its useless, and just replaced it.
I also take measures to ensure the suffix is well-formed UTF-8 :)

on my *N trunk tests its now 5700/5800ms on average versus 6000ms, just using String.endsWith() before checking the DFA.
its a consistent gain, so I think for really crappy worst-case wildcards and regular expressions, 
we have a lot to gain by doing this with bytes, before converting to char[] and running against the DFA.

I guess since TermRef exposes all the bytes, I could implement endsWith myself in AutomatonTermsEnum in the future, 
but it seems like it would be a nice complement to startsWith() ?


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782401#action_12782401 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. shouldnt the JRE hoist this constant additive to the array index out anyway?

Maybe?

bq. alternative patch for if you do not trust your compiler

Thanks ;)

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787175#action_12787175 ] 

Robert Muir edited comment on LUCENE-2090 at 12/7/09 10:34 PM:
---------------------------------------------------------------

Mike, I converted this to char[] api (see LUCENE-1606 for the patch).

In order for this to work, I needed to expose UnicodeUtil.nextUTF16ValidString(UTF16Result). 
The code is not duplicated, the String based method is just a wrapper for this, take a look if you get a chance.

the other thing I forgot, I think TermRef.copy(UTF16Result) would be handy... is there anywhere you could use this too?

      was (Author: rcmuir):
    Mike, I converted this to char[] api (see LUCENE-1606 for the patch).

In order for this to work, I needed to expose UnicodeUtil.nextUTF16ValidString(UTF16Result). 
The code is not duplicated, the String based method is just a wrapper for this, take a look if you get a chance.

  
> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782495#action_12782495 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

bq. Yes... I'm [slowly] working towards that.

Glad it is you working on it instead of me. If I wrote it, it would be very slow.

Committed revision 884190 for TermRef


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782565#action_12782565 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

OK I'll first focus on making sure DW flushes in UTF-16 sort order...

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781365#action_12781365 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

Michael, I think i would have to profile things to determine this?
I guess it would be a close one, because strings in term dictionary are pretty short.
just an idea, i think moving all the code to char[] first would be the best for starters.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787175#action_12787175 ] 

Robert Muir edited comment on LUCENE-2090 at 12/7/09 10:35 PM:
---------------------------------------------------------------

Mike, I converted this to char[] api (see LUCENE-1606 for the patch).

In order for this to work, I needed to expose UnicodeUtil.nextUTF16ValidString(UTF16Result). 
The code is not duplicated, the String based method is just a wrapper for this, take a look if you get a chance.

the other thing I forgot, I think TermRef.copy(UTF8Result) would be handy... is there anywhere you could use this too?

edit: i meant utf-8 result, sorry

      was (Author: rcmuir):
    Mike, I converted this to char[] api (see LUCENE-1606 for the patch).

In order for this to work, I needed to expose UnicodeUtil.nextUTF16ValidString(UTF16Result). 
The code is not duplicated, the String based method is just a wrapper for this, take a look if you get a chance.

the other thing I forgot, I think TermRef.copy(UTF16Result) would be handy... is there anywhere you could use this too?
  
> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781233#action_12781233 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

Michael, here is one idea that isn't too crazy.

separately i think we should make it convenient for a MTQ to get a char[], this should not change.

however, lets consider this:
{code}
  /**
   * Returns true if the given string is accepted by this automaton.
   */
  public boolean run(String s) {
    int p = initial;
    int l = s.length();
    for (int i = 0; i < l; i++) {
      p = step(p, s.charAt(i));
      if (p == -1) return false;
    }
    return accept[p];
  }
{code}

checking a string, is really just stepping thru one char at a time.
would 'incremental, one char at a time' conversion actually help, or do you think it would just be slower?

conceptually, this isn't that much different than using a Reader with java i/o, at a much smaller scale.
i am not familiar with decoding performance, but I thought I would mention this, just in the case there is a way to do it clean.


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782075#action_12782075 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

I guess now you have me starting to think about byte[] contains()
Because really the real worst case, which I bet a lot of users do, are not things like *foobar but instead *foobar\* !
in UTF-8 you can do such things safely, I would have to sucker out the "longest common constant sequence" out of a DFA.
This might be more generally applicable.

commonSuffix is easy... at least it makes progress for now, even slightly later in trunk. 

this could be a later improvement.


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2090:
--------------------------------

    Attachment: LUCENE-2090_TermRef_flex2.patch

alternative patch for if you do not trust your compiler :)
I think the do the same thing though...

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781344#action_12781344 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. would 'incremental, one char at a time' conversion actually help, or do you think it would just be slower?

I like this idea!  It's worth exploring a Reader-like interface from UnicodeUtil?  Is this a hotspot in automaton's processing?  Ie, could we save much conversion by only doing it on demand?

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782503#action_12782503 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

Mike, by the way, looking at this code, I don't see a way to expose the UnicodeUtil / char[] functionality in a clean way via TermRef/FilteredTermsEnum.

Once I see that most of the other enums survive with TermRef alone, and don't need it, and its handy to have multiple TermRefs around in the same enum,
it doesn't make sense I guess.

Also I guess people in general aren't writing MultiTermQueries every day, so I think this is ok?
The rest of this issue should only involve automaton code itself...


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782353#action_12782353 ] 

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

Patch looks good, except, I think I wouldn't factor startsWith/endsWith to share any code, to save the "+ pos" inside startsWith's loop?

{quote}
*N	1705.7ms avg -> 1195.4ms avg
*NNNNNN	1844.9ms avg -> 1192.3ms avg
{quote}

Whoa -- those are great results!

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782384#action_12782384 ] 

Robert Muir commented on LUCENE-2090:
-------------------------------------

bq. Patch looks good, except, I think I wouldn't factor startsWith/endsWith to share any code, to save the "+ pos" inside startsWith's loop? 

forgive my ignorance, but shouldnt the JRE hoist this constant additive to the array index out anyway?
I checked, this is how harmony, etc implement startsWith/endsWith even for String...
(I will change it, just curious)


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-2090
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2090
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2090_TermRef_flex.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org