You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Mark Harwood (JIRA)" <ji...@apache.org> on 2009/10/21 01:10:01 UTC

[jira] Created: (LUCENE-1999) Match spotter for all query types

Match spotter for all query types
---------------------------------

                 Key: LUCENE-1999
                 URL: https://issues.apache.org/jira/browse/LUCENE-1999
             Project: Lucene - Java
          Issue Type: New Feature
    Affects Versions: 2.9
            Reporter: Mark Harwood
         Attachments: matchflagger.patch

Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica.

This patch provides the ability to wrap *any* Query objects and record match info as flags encoded in the overall document score.
Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query.

The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7

Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
This may be something we should consider.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1999) Match spotter for all query types

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768257#action_12768257 ] 

Mark Harwood commented on LUCENE-1999:
--------------------------------------

bq. and 2) you need it for every single doc visited by the query

Actually I don't need it for every doc, only the top ones  - it just happens to be so cheap to produce that I can afford to run this in-line with the query. (I haven't actually benchmarked it at scale buy my gut feel is it would be fast )

I was thinking that this might be orthogonal to the existing "free-text" based highlighter. The logic for this being roughly that

1) Highlighting of free-text fields is reasonably well-catered for with summarisation etc.
2) The remaining problem areas for highlighting (NumericRangeQuery, Spatial, Cached term filters on enums eg gender:male/female) are all likely to be non-free-text fields which don't require summarisation and only contain a single value.

I may be wrong in these assumptions about the existing state of play (any thoughts, Mark M?) but it might be useful to think of attacking the problem with these 2 different requirements in mind.

Regardless of type e.g. int, long etc I tend to think of fields as falling into these broad usage categories:

a) "Identifiers" (e.g. primary keys)
b) Quantifiers (e.g numerics, dates, spatial)
c) Free-text 
d) Controlled vocabularies (e.g. enums such as gender:m/f)

Type a ) is catered for with a straight TermQuery and therefore can be handled with the existing highlighter
Type b) needs special indexes/queries (spatial/trie) and isn't catered for by the existing term/span-based Highlighter
Type c) is catered for with the existing highlighter and its summarising features
Type d) involves many TermDoc.next reads so is usefully cached as filters and therefore not catered for by existing Highlighter

So this patch helps cater for types b) and d) where simply knowing the field matched is all that is required to highlight.


> Match spotter for all query types
> ---------------------------------
>
>                 Key: LUCENE-1999
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1999
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.9
>            Reporter: Mark Harwood
>         Attachments: matchflagger.patch
>
>
> Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica.
> This patch provides the ability to wrap *any* Query objects and record match info as flags encoded in the overall document score.
> Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query.
> The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7
> Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
> This may be something we should consider.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1999) Match spotter for all query types

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-1999:
---------------------------------

    Attachment: matchflagger.patch

> Match spotter for all query types
> ---------------------------------
>
>                 Key: LUCENE-1999
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1999
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.9
>            Reporter: Mark Harwood
>         Attachments: matchflagger.patch
>
>
> Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica.
> This patch provides the ability to wrap *any* Query objects and record match info as flags encoded in the overall document score.
> Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query.
> The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7
> Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
> This may be something we should consider.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1999) Match spotter for all query types

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768163#action_12768163 ] 

Michael McCandless commented on LUCENE-1999:
--------------------------------------------

Very clever!

Since you are wrapping arbitrary query objs, couldn't the wrapper make a separate data structure for tracking which clause matched (instead of encoding it into the score)?

Also: doesn't highlighter run, separately, on each doc?  And so it's OK if the scores are affected?  Ie, I would run my main search with a normal query, get the 10 results for the current page, then step through each of those 10 doc IDs make a single-doc-IndexSearcher, and run this wrapper?

{quote}
Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
This may be something we should consider.
{quote}

+1  I would love to see the Scorer API extended to optionally provide details on matches.  Not just which clause matched which docs/fields, but the positions within the field where the match occurred.  I think we could do this by absorbing *SpanQuery into their normal Query counterparts, making the getSpans API [somehow] optional so that if you didn't invoke it you don't pay a performance price.

> Match spotter for all query types
> ---------------------------------
>
>                 Key: LUCENE-1999
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1999
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.9
>            Reporter: Mark Harwood
>         Attachments: matchflagger.patch
>
>
> Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica.
> This patch provides the ability to wrap *any* Query objects and record match info as flags encoded in the overall document score.
> Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query.
> The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7
> Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
> This may be something we should consider.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1999) Match spotter for all query types

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768173#action_12768173 ] 

Mark Harwood commented on LUCENE-1999:
--------------------------------------

bq. couldn't the wrapper make a separate data structure for tracking which clause matched 

I was trying to keep the processing cost super-low with no object allocations because this is in a very tight loop. We don't really want to be generating a lot of state/processing while we're still evaluating potentially millions of candidate matches.
That seems to be the challenge doing this instrumentation in-line with the query execution.

bq. Also: doesn't highlighter run, separately, on each doc? And so it's OK if the scores are affected?

The use case I'm tackling right now involves search forms with lots of optional fields (spatial, numeric, "choice" etc) and I only needed a yes/no match flag for each field. This approach should give me these answers back immediately without impacting query processing speeds significantly. 
However, I can see the value in core Lucene capturing a richer data structure than a simple boolean where you choose to do a seperate "highlight" pass on the top N documents. This would suggest that you might need 2 query expressions - one for execution and one for adding highlighter instrumentation. I suppose the client could add the instrumentation requests to the initial query which are passive during a Lucene "results-selection" mode and become active in "highlight mode".



> Match spotter for all query types
> ---------------------------------
>
>                 Key: LUCENE-1999
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1999
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.9
>            Reporter: Mark Harwood
>         Attachments: matchflagger.patch
>
>
> Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica.
> This patch provides the ability to wrap *any* Query objects and record match info as flags encoded in the overall document score.
> Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query.
> The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7
> Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
> This may be something we should consider.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1999) Match spotter for all query types

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768191#action_12768191 ] 

Michael McCandless commented on LUCENE-1999:
--------------------------------------------

I see, it sounds like your use case is different from the typical
highlighting use case in that 1) you don't need the positions of the
matches (just whether a given clause matched the doc or not), and 2)
you need it for every single doc visited by the query, not just for
the handful of docs that are being presented to the user on the
current "page".

bq. This would suggest that you might need 2 query expressions - one for execution and one for adding highlighter instrumentation.

I'm thinking it's the same query, but we fix the Scorer API for all
queries (= big change!!) to be able to produce match details on
demand, where those match details look something like what getSpans
now returns.  But for the normal case (only highlighting the docs
being shown on current page), we'd only get the match details for that
small set of docs.

Then we ideally would not need a separate mirrored set of span
queries.  Ie, SpanTermQuery would be absorbed into TermQuery, etc.

But I could easily be being too naive here :) Maybe there is some
serious performance cost to even adding the optional API in.

> Match spotter for all query types
> ---------------------------------
>
>                 Key: LUCENE-1999
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1999
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.9
>            Reporter: Mark Harwood
>         Attachments: matchflagger.patch
>
>
> Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica.
> This patch provides the ability to wrap *any* Query objects and record match info as flags encoded in the overall document score.
> Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query.
> The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7
> Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
> This may be something we should consider.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org