You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Mike Sokolov (JIRA)" <ji...@apache.org> on 2011/06/29 17:49:29 UTC

[jira] [Issue Comment Edited] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

    [ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057306#comment-13057306 ] 

Mike Sokolov edited comment on LUCENE-2878 at 6/29/11 3:49 PM:
---------------------------------------------------------------

bq. there should be only one consumer really. Which usecase have you in mind where multiple consumers are using the iterator?

I guess I am coming at this from the perspective of a Highlighter; the Highlighter wants to iterate over the top-level Scorer, finding each of its matching positions, and then for each of those, it wants to iterate over all the individual terms' positions.  Possibly some clever HL of the future will be interested in intermediate-level nodes in the tree as well, like highlighting a near-span, or coalescing phrases.  The problem I see is that with the current API the only way to retrieve the lower-level positions is to advance their iterators, but if that is done directly (without the knowledge of the enclosing scorer and its iterator), the scoring will be messed up.  I guess that's what I meant by multiple consumers - of course you are right, there should be only one "writer" consumer that can advance the iteration.  My idea is that there could be many readers, though.  In any case, I think it is typical for an iterator that you can read the current position as many times as you want, rather than "read once" and expect the caller to cache the value?

bq. what is the returned PI here again? In the TermScorer case that is trivial but what would a BooleanScorer return here?

It has its own PI right?  I think it is the minimum interval containing some terms that satisfy the boolean conditions.

bq. I think that could make sense but let me explain the reason why this is there right now. So currently a socrer has a defined PositionIterator which could be a problem later. for instance I want to have the minimal positions interval (ordered) of all boolean clauses for query X but for query Y I want the same interval unorderd (out of order) I need to replace the logic in the scorer somehow. So to make that more flexible I exposed all subs here so you can run your own alg. I would love to see better solutions since I only hacked this up in a couple of days though.

Hmm I haven't yet looked at how BooleanScorer2 and BooleanScorer works, but I understand there is some additional complexity there.  Perhaps if the only distinction is order/unordered there might be a special case for that when you create the Scorer, rather than exposing internals to the caller?  But I don't know - would have to understand this better.  Maybe there are other cases where that could be needed.

bq. Mike, would you be willing to upload a patch for your hacked collector etc to see what you have done?

The PosHiglighter is a bit messy - filled with debugging and testing code and so on, and it's also slow because of the need to match positions->offsets in kind of a gross way.. Robert M had an idea for storing this mapping in the index which would improve things there, but I haven't done that. In any case, I'll be happy to share the patch when I get back home and can clean it up a bit. Maybe if I have a chance I will look into implementing OR-queries - I stumbled on that limitation right away!


      was (Author: sokolov):
    bq. there should be only one consumer really. Which usecase have you in mind where multiple consumers are using the iterator?

I guess I am coming at this from the perspective of a Highlighter; the Highlighter wants to iterate over the top-level Scorer, finding each of its matching positions, and then for each of those, it wants to iterate over all the individual terms' positions.  Possibly some clever HL of the future will be interested in intermediate-level nodes in the tree as well, like highlighting a near-span, or coalescing phrases.  The problem I see is that with the current API the only way to retrieve the lower-level positions is to advance their iterators, but if that is done directly (without the knowledge of the enclosing scorer and its iterator), the scoring will be messed up.  I guess that's what I meant by multiple consumers - of course you are right, there should be only one "writer" consumer that can advance the iteration.  My idea is that there could be many readers, though.  In any case, I think it is typical for an iterator that you can read the current position as many times as you want, rather than "read once" and expect the caller to cache the value?

bq. what is the returned PI here again? In the TermScorer case that is trivial but what would a BooleanScorer return here?

It has its own PI right?  I think it is the minimum interval containing some terms that satisfy the boolean conditions.

bq. I think that could make sense but let me explain the reason why this is there right now. So currently a socrer has a defined PositionIterator which could be a problem later. for instance I want to have the minimal positions interval (ordered) of all boolean clauses for query X but for query Y I want the same interval unorderd (out of order) I need to replace the logic in the scorer somehow. So to make that more flexible I exposed all subs here so you can run your own alg. I would love to see better solutions since I only hacked this up in a couple of days though.

Hmm I haven't yet looked at how BooleanScorer2 and BooleanScorer works, but I understand there is some additional complexity there.  Perhaps if the only distinction is order/unordered there might be a special case for that when you create the Scorer, rather than exposing internals to the caller?  But I don't know - would have to understand this better.  Maybe there are other cases where that could be needed.

{Mike, would you be willing to upload a patch for your hacked collector etc to see what you have done?}

The PosHiglighter is a bit messy - filled with debugging and testing code and so on, and it's also slow because of the need to match positions->offsets in kind of a gross way.. Robert M had an idea for storing this mapping in the index which would improve things there, but I haven't done that. In any case, I'll be happy to share the patch when I get back home and can clean it up a bit. Maybe if I have a chance I will look into implementing OR-queries - I stumbled on that limitation right away!

  
> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. 
> To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :).
> The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org