You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/12/28 06:04:45 UTC

[jira] Created: (LUCENE-2836) FieldCache rewrite method for MultiTermQueries

FieldCache rewrite method for MultiTermQueries
----------------------------------------------

                 Key: LUCENE-2836
                 URL: https://issues.apache.org/jira/browse/LUCENE-2836
             Project: Lucene - Java
          Issue Type: New Feature
            Reporter: Robert Muir
             Fix For: 4.0


For some MultiTermQueries, like RangeQuery we have a FieldCacheRangeFilter etc (in this case its particularly optimized).

But in the general case, since LUCENE-2784 we can now have a rewrite method to rewrite any MultiTermQuery 
using the FieldCache, because MultiTermQuery's getEnum no longer takes IndexReader but Terms, and all the 
FilteredTermsEnums are now just real TermsEnum decorators.

In cases like low frequency queries this is actually slower (I think this has been shown for numeric ranges before too),
but for the really high-frequency cases like especially ugly wildcards, regexes, fuzzies, etc, this can be several times faster 
using the FieldCache instead, since all the terms are in RAM and automaton can blast through them quicker.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2836) FieldCache rewrite method for MultiTermQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2836:
--------------------------------

    Attachment: LUCENE-2836.patch

here's the patch: I don't think we really need the *Wrapper class, nor does it need to be in core (this could be contrib or something instead).



> FieldCache rewrite method for MultiTermQueries
> ----------------------------------------------
>
>                 Key: LUCENE-2836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2836
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2836.patch
>
>
> For some MultiTermQueries, like RangeQuery we have a FieldCacheRangeFilter etc (in this case its particularly optimized).
> But in the general case, since LUCENE-2784 we can now have a rewrite method to rewrite any MultiTermQuery 
> using the FieldCache, because MultiTermQuery's getEnum no longer takes IndexReader but Terms, and all the 
> FilteredTermsEnums are now just real TermsEnum decorators.
> In cases like low frequency queries this is actually slower (I think this has been shown for numeric ranges before too),
> but for the really high-frequency cases like especially ugly wildcards, regexes, fuzzies, etc, this can be several times faster 
> using the FieldCache instead, since all the terms are in RAM and automaton can blast through them quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2836) FieldCache rewrite method for MultiTermQueries

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975477#action_12975477 ] 

Uwe Schindler commented on LUCENE-2836:
---------------------------------------

Hah, cool!

The question is, does it really works correct with multivalued fields? I have to recapitulate the TermsIndex, but the method fcsi.getOrd(doc) returns only the term ord of the first term found in index for that document? For numeric queries with single-value fields thats fine, but for wildcards on analyzed fields? Maybe I miss something, but I am not sure if it works correct...

Robert: Help me please :-) *g*

> FieldCache rewrite method for MultiTermQueries
> ----------------------------------------------
>
>                 Key: LUCENE-2836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2836
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2836.patch
>
>
> For some MultiTermQueries, like RangeQuery we have a FieldCacheRangeFilter etc (in this case its particularly optimized).
> But in the general case, since LUCENE-2784 we can now have a rewrite method to rewrite any MultiTermQuery 
> using the FieldCache, because MultiTermQuery's getEnum no longer takes IndexReader but Terms, and all the 
> FilteredTermsEnums are now just real TermsEnum decorators.
> In cases like low frequency queries this is actually slower (I think this has been shown for numeric ranges before too),
> but for the really high-frequency cases like especially ugly wildcards, regexes, fuzzies, etc, this can be several times faster 
> using the FieldCache instead, since all the terms are in RAM and automaton can blast through them quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2836) FieldCache rewrite method for MultiTermQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975496#action_12975496 ] 

Robert Muir commented on LUCENE-2836:
-------------------------------------

The question is, does it really works correct with multivalued fields?

of course not, its no different than any of the other fieldcache*filter stuff we have now.
except that stuff is an aweful lot more code... do we really need all those specializations in fieldcacherangefilter?


> FieldCache rewrite method for MultiTermQueries
> ----------------------------------------------
>
>                 Key: LUCENE-2836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2836
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2836.patch
>
>
> For some MultiTermQueries, like RangeQuery we have a FieldCacheRangeFilter etc (in this case its particularly optimized).
> But in the general case, since LUCENE-2784 we can now have a rewrite method to rewrite any MultiTermQuery 
> using the FieldCache, because MultiTermQuery's getEnum no longer takes IndexReader but Terms, and all the 
> FilteredTermsEnums are now just real TermsEnum decorators.
> In cases like low frequency queries this is actually slower (I think this has been shown for numeric ranges before too),
> but for the really high-frequency cases like especially ugly wildcards, regexes, fuzzies, etc, this can be several times faster 
> using the FieldCache instead, since all the terms are in RAM and automaton can blast through them quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2836) FieldCache rewrite method for MultiTermQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975421#action_12975421 ] 

Robert Muir commented on LUCENE-2836:
-------------------------------------

Here's some results from my silly wildcard benchmarker (I think luceneutil doesnt yet have a keyword title or similar field for this):

(using 10M docs with single valued numeric field, so 10M terms too)

in general its a stupid rewrite method, unless your users are typing in truly horrific queries and then its better.

||Pattern||no. matching docs||avgms (filter)||avgms (fieldcache)||
|N?N?N?N|1000|35.9|52.5|
|?NNNNNN|10|3.1|44.2|
|??NNNNN|100|5.5|45.6|
|???NNNN|1000|44.7|48.5|
|????NNN|10000|141.8|67.9|
|NN??NNN|100|3.6|41.5|
|NN?N\*|10000|5.3|42.7|
|?NN\*|100000|25.9|50.8|
|\*N|1000000|1639.2|446.8|
|\*N\*|5217031|2089.4|701.2|
|\*NN\*|590040|1811.6|674.8|


> FieldCache rewrite method for MultiTermQueries
> ----------------------------------------------
>
>                 Key: LUCENE-2836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2836
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2836.patch
>
>
> For some MultiTermQueries, like RangeQuery we have a FieldCacheRangeFilter etc (in this case its particularly optimized).
> But in the general case, since LUCENE-2784 we can now have a rewrite method to rewrite any MultiTermQuery 
> using the FieldCache, because MultiTermQuery's getEnum no longer takes IndexReader but Terms, and all the 
> FilteredTermsEnums are now just real TermsEnum decorators.
> In cases like low frequency queries this is actually slower (I think this has been shown for numeric ranges before too),
> but for the really high-frequency cases like especially ugly wildcards, regexes, fuzzies, etc, this can be several times faster 
> using the FieldCache instead, since all the terms are in RAM and automaton can blast through them quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2836) FieldCache rewrite method for MultiTermQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2836.
---------------------------------

    Resolution: Fixed
      Assignee: Robert Muir

Committed revision 1055130.

> FieldCache rewrite method for MultiTermQueries
> ----------------------------------------------
>
>                 Key: LUCENE-2836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2836
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2836.patch, LUCENE-2836.patch
>
>
> For some MultiTermQueries, like RangeQuery we have a FieldCacheRangeFilter etc (in this case its particularly optimized).
> But in the general case, since LUCENE-2784 we can now have a rewrite method to rewrite any MultiTermQuery 
> using the FieldCache, because MultiTermQuery's getEnum no longer takes IndexReader but Terms, and all the 
> FilteredTermsEnums are now just real TermsEnum decorators.
> In cases like low frequency queries this is actually slower (I think this has been shown for numeric ranges before too),
> but for the really high-frequency cases like especially ugly wildcards, regexes, fuzzies, etc, this can be several times faster 
> using the FieldCache instead, since all the terms are in RAM and automaton can blast through them quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2836) FieldCache rewrite method for MultiTermQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2836:
--------------------------------

    Attachment: LUCENE-2836.patch

here's the patch for contrib... i think its ready to commit.

i also added some basic testing of the seek() in the doctermsindex's termsenum.


> FieldCache rewrite method for MultiTermQueries
> ----------------------------------------------
>
>                 Key: LUCENE-2836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2836
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2836.patch, LUCENE-2836.patch
>
>
> For some MultiTermQueries, like RangeQuery we have a FieldCacheRangeFilter etc (in this case its particularly optimized).
> But in the general case, since LUCENE-2784 we can now have a rewrite method to rewrite any MultiTermQuery 
> using the FieldCache, because MultiTermQuery's getEnum no longer takes IndexReader but Terms, and all the 
> FilteredTermsEnums are now just real TermsEnum decorators.
> In cases like low frequency queries this is actually slower (I think this has been shown for numeric ranges before too),
> but for the really high-frequency cases like especially ugly wildcards, regexes, fuzzies, etc, this can be several times faster 
> using the FieldCache instead, since all the terms are in RAM and automaton can blast through them quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2836) FieldCache rewrite method for MultiTermQueries

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976707#action_12976707 ] 

Michael McCandless commented on LUCENE-2836:
--------------------------------------------

This is a great speedup for the hard wildcard queries!

I think we should commit it, but jdoc the limitations (eg single valued).

I'll add a "whole title" field to luceneutil so we can more naturally test wildcards...

> FieldCache rewrite method for MultiTermQueries
> ----------------------------------------------
>
>                 Key: LUCENE-2836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2836
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2836.patch
>
>
> For some MultiTermQueries, like RangeQuery we have a FieldCacheRangeFilter etc (in this case its particularly optimized).
> But in the general case, since LUCENE-2784 we can now have a rewrite method to rewrite any MultiTermQuery 
> using the FieldCache, because MultiTermQuery's getEnum no longer takes IndexReader but Terms, and all the 
> FilteredTermsEnums are now just real TermsEnum decorators.
> In cases like low frequency queries this is actually slower (I think this has been shown for numeric ranges before too),
> but for the really high-frequency cases like especially ugly wildcards, regexes, fuzzies, etc, this can be several times faster 
> using the FieldCache instead, since all the terms are in RAM and automaton can blast through them quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2836) FieldCache rewrite method for MultiTermQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976729#action_12976729 ] 

Robert Muir commented on LUCENE-2836:
-------------------------------------

OK, I'll work on getting it into contrib. 

I think its best to put it there because its generally slower (only faster in certain circumstances), 
and at the moment the app has to supply the 'query planning logic' to make good use of it.


> FieldCache rewrite method for MultiTermQueries
> ----------------------------------------------
>
>                 Key: LUCENE-2836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2836
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2836.patch
>
>
> For some MultiTermQueries, like RangeQuery we have a FieldCacheRangeFilter etc (in this case its particularly optimized).
> But in the general case, since LUCENE-2784 we can now have a rewrite method to rewrite any MultiTermQuery 
> using the FieldCache, because MultiTermQuery's getEnum no longer takes IndexReader but Terms, and all the 
> FilteredTermsEnums are now just real TermsEnum decorators.
> In cases like low frequency queries this is actually slower (I think this has been shown for numeric ranges before too),
> but for the really high-frequency cases like especially ugly wildcards, regexes, fuzzies, etc, this can be several times faster 
> using the FieldCache instead, since all the terms are in RAM and automaton can blast through them quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org