You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/07/16 04:18:29 UTC

Best Practices for getting Strings from a position range

Do we have a best practice for going from, say a SpanQuery doc/ 
position information and retrieving the actual range of positions of  
content from the Document?  Is it just to reanalyze the Document  
using the appropriate Analyzer and start recording once you hit the  
positions you are interested in?    Seems like Term Vectors _could_  
help, but even my new Mapper approach patch (LUCENE-868) doesn't  
really help, because they are stored in a term-centric manner.  I  
guess what I am after is a position centric approach.  That is, give  
a Document, get a term vector (note, not a TermFreqVector) back that  
is ordered by position (thus, there may be duplicate entries for a  
given term that occurs in multiple positions)

I feel like I am missing something obvious.  I would suspect the  
highlighter needs to do this, but it seems to take the reanalyze  
approach as well (I admit, though, that I have little experience with  
the highlighter.)

I am wondering if it would be useful to have an alternative Term  
Vector storage mechanism that was position centric.  Because we  
couldn't take advantage of the lexicographic compression, it would  
take up more disk space, but it would be a lot faster for these kinds  
of things.  With this kind of approach, you could easily index into  
an array based on the result of a SpanQuery.start(), etc.  Of course,  
you would have to have a data structure that handled the multiple  
terms per position option, but I don't think that would be too hard,  
correct?

Just thinking out loud...

Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Best Practices for getting Strings from a position range

Posted by Grant Ingersoll <gr...@gmail.com>.

On Aug 15, 2007, at 10:46 AM, Peter Keegan wrote:

> Grant,
>
> I built an index as described here:
> http://www.nabble.com/SpanQuery-and-database-join-tf4262902.html
>
> Many documents have only 1 or 2 rows, some have dozens.
> Here is a typical query without spans:
>
> +((+contents:quaker +contents:cereal) (+boost50:quaker  
> +boost50:cereal))
> +literals:co$us), sort=<custom:"feedbabe":
> RoundingScoreDocComparator@8c169d05>,"dateactiveR"!
>
>
> Here is a typical query with spans:
>
> +spanNear([adliterals:jb$1, adliterals:co$us], 8, false)
> +(+((+contents:quaker +contents:cereal) (+boost50:quaker  
> +boost50:cereal))
> +literals:co$us), sort=<custom:"feedbabe":
> RoundingScoreDocComparator@8c169d05>,"dateactiveR"!
>
> The addition of the spanNear clause caused the 10X decrease in  
> throughput. I
> could probably change the way rows are indexed and use ordered  
> terms, which
> seems to be a bit faster (only 5X decrease)

In looking at the code, it makes sense that an ordered SpanNearQuery  
would be faster.

I am still trying to dig into the logistics of the Unordered  
SpanNearQuery, as it is the only thing hanging me up on adding  
payload access to Spans.  I need to step through and debug.  As your  
stack trace showed, there is a lot of work taking place to manage the  
priority queue that is created.  I just don't understand the relation  
between the SpanCells, the "ordered" List and the PriorityQueue  
"queue" just yet.  It seems the SpanCells make a linked list, the  
"ordered" list is for getting the spans from the sub queries and the  
queue seems to rearrange the ordered list

If anyone wants to chip in with pseudocode explaining what is going  
on in NearSpansUnordered.java it would be helpful. 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Best Practices for getting Strings from a position range

Posted by Peter Keegan <pe...@gmail.com>.

Grant,

I built an index as described here:
http://www.nabble.com/SpanQuery-and-database-join-tf4262902.html

Many documents have only 1 or 2 rows, some have dozens.
Here is a typical query without spans:

+((+contents:quaker +contents:cereal) (+boost50:quaker +boost50:cereal))
+literals:co$us), sort=<custom:"feedbabe":
RoundingScoreDocComparator@8c169d05>,"dateactiveR"!


Here is a typical query with spans:

+spanNear([adliterals:jb$1, adliterals:co$us], 8, false)
+(+((+contents:quaker +contents:cereal) (+boost50:quaker +boost50:cereal))
+literals:co$us), sort=<custom:"feedbabe":
RoundingScoreDocComparator@8c169d05>,"dateactiveR"!

The addition of the spanNear clause caused the 10X decrease in throughput. I
could probably change the way rows are indexed and use ordered terms, which
seems to be a bit faster (only 5X decrease)

Peter


On 8/14/07, Grant Ingersoll <gr...@gmail.com> wrote:
>
> Hi Peter,
>
> Could you give more details on this test?  What are you comparing,
> etc.?  Sample queries would be good.  I would like to write up a
> contrib/Benchmark algorithm to begin investigating this and see if
> there is anything that can be done.
>
> Thanks,
> Grant
>
> On Aug 10, 2007, at 6:27 PM, Peter Keegan wrote:
>
> > ok, glad we're on the same page.
> >
> > I did some performance testing with span queries and,
> > unfortunately,  the
> > results are discouraging for my intended use. When I added a simple
> > SpanNearQuery to existing queries, the throughput decreased by a
> > factor of
> > 10+. I figured spans would be expensive, but not that much. I
> > haven't done
> > profiling yet, but here's a typical thread stack during execution:
> >
> > at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:
> > 137)
> > at org.apache.lucene.util.PriorityQueue.adjustTop
> > (PriorityQueue.java:101)
> > at org.apache.lucene.search.spans.NearSpansUnordered.next(
> > NearSpansUnordered.java:128)
> > at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc(
> > SpanScorer.java:83)
> > at org.apache.lucene.search.spans.SpanScorer.next(SpanScorer.java:57)
> > at org.apache.lucene.search.ConjunctionScorer.next
> > (ConjunctionScorer.java
> > :56)
> > at org.apache.lucene.search.BooleanScorer2.score
> > (BooleanScorer2.java:327)
> > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
> > 146)
> > at org.apache.lucene.search.Searcher.search(Searcher.java:118)
> >
> > Peter
> >
> >
> > On 8/10/07, Grant Ingersoll <gs...@apache.org> wrote:
> >>
> >> Sorry for the confusion.  I thought you just wanted access to the
> >> term info per position.    I think we will have to add something to
> >> the Spans like we talked about before.
> >>
> >> -Grant
> >>
> >> On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:
> >>
> >>> Grant,
> >>>
> >>> I'm afraid I don't understand how to use this mapper in the context
> >>> of a
> >>> SpanQuery. It seems like I would have to modify SpanScorer to fetch
> >>> payload
> >>> data and provide a new method to access the payloads while
> >>> iterating through
> >>> the documents. If this can be accomplished without modifying Spans,
> >>> could
> >>> you provide a bit more detail?
> >>>
> >>> Thanks,
> >>> Peter
> >>>
> >>> On 8/9/07, Peter Keegan <pe...@gmail.com> wrote:
> >>>>
> >>>> Hi Grant,
> >>>>
> >>>> I'm hoping to check this out soon.
> >>>>
> >>>> Thanks,
> >>>> Peter
> >>>>
> >>>> On 8/7/07, Grant Ingersoll <gsingers@apache.org > wrote:
> >>>>>
> >>>>> Hi Peter,
> >>>>>
> >>>>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
> >>>>> provides a TermVectorMapper that loads by position.
> >>>>>
> >>>>> Still not what ideally what you want, but I haven't had time to
> >>>>> scope
> >>>>> that one out yet.,
> >>>>>
> >>>>> -Grant
> >>>>>
> >>>>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
> >>>>>
> >>>>>> Hi Grant,
> >>>>>>
> >>>>>> No problem - I know you are very busy.  I just wanted to get a
> >>>>>> sense for the
> >>>>>> timing because I'd like to use this for a release this Fall. If I
> >>>>>> can get a
> >>>>>> prototype working in the coming weeks AND the performance is
> >>>>>> great :) , this
> >>>>>> would be terrific. If not, I'll have to fall back on a more
> >>>>>> complex
> >>>>>> design
> >>>>>> that handles the query outside of Lucene :(
> >>>>>>
> >>>>>> In the meantime, I'll try playing with LUCENE-868.
> >>>>>>
> >>>>>> Thanks for the update.
> >>>>>> Peter
> >>>>>>
> >>>>>> On 7/24/07, Grant Ingersoll <grant.ingersoll@gmail.com > wrote:
> >>>>>>>
> >>>>>>> Sorry, Peter, I haven't had a chance to work on it.  I don't
> >>>>>>> see it
> >>>>>>> happening this week, but maybe next.
> >>>>>>>
> >>>>>>> I do think the Mapper approach via TermVectors will work.  It
> >>>>>>> will
> >>>>>>> require implementing a new mapper that orders by position, but I
> >>>>>>> don't think that is too hard.   I started on one on the
> >>>>>>> LUCENE-868
> >>>>>>> patch (version 4) but it is not complete.  Maybe you want to
> >>>>>>> pick
> >>>>>>> it up?
> >>>>>>>
> >>>>>>> With this approach, you would iterate your spans, when you come
> >>>>>>> to a
> >>>>>>> new doc, you would load the term vector using the
> >>>>>>> PositionMapper, and
> >>>>>>> then you could index into the positions for the matches in the
> >>>>>>> document.
> >>>>>>>
> >>>>>>> I realize this does not cover the just wanting to get the
> >>>>>>> Payload at
> >>>>>>> the match issue.  Maybe next week...
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Grant
> >>>>>>>
> >>>>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
> >>>>>>>
> >>>>>>>> Any idea on when this might be available (days, weeks...)?
> >>>>>>>>
> >>>>>>>> Peter
> >>>>>>>>
> >>>>>>>> On 7/16/07, Grant Ingersoll < gsingers@apache.org> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> : Do we have a best practice for going from, say a SpanQuery
> >>>>>>> doc/
> >>>>>>>>>> : position information and retrieving the actual range of
> >>>>>>>>> positions of
> >>>>>>>>>> : content from the Document?  Is it just to reanalyze the
> >>>>>>> Document
> >>>>>>>>>> : using the appropriate Analyzer and start recording once you
> >>>>>>>>> hit the
> >>>>>>>>>> : positions you are interested in?    Seems like Term Vectors
> >>>>>>>>> _could_
> >>>>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868)
> >>>>>>> doesn't
> >>>>>>>>>> : really help, because they are stored in a term-centric
> >>>>>>> manner.  I
> >>>>>>>>>> : guess what I am after is a position centric approach.  That
> >>>>>>>>> is, give
> >>>>>>>>>>
> >>>>>>>>>> this is kind of what i was suggesting in the last message i
> >>>>>>>>>> sent
> >>>>>>>>>> to the java-user thread about paylods and SpanQueries (which
> >>>>>>>>>> i'm
> >>>>>>>>>> guessing is what prompted this thread as well)...
> >>>>>>>>>>
> >>>>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery-
> >>>>>>>>>> tf3988826.html#a11551628
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This is one use case, the other is related to the new patch I
> >>>>>>>>> submitted for LUCENE-960.  In this case, I have a
> >>>>>>>>> SpanQueryFilter
> >>>>>>>>> that identifies a bunch of docs and positions ahead of time.
> >>>>>>>>> Then
> >>>>>
> >>>>>>>>> the user enters new Span Query and I want to relate the
> >>>>>>>>> matches
> >>>>>>> from
> >>>>>>>>> the user query with the positions of matches in the filter and
> >>>>>>> then
> >>>>>>>>> show that window.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> my point was that currently, to retrieve a payload you need a
> >>>>>>>>>> TermPositions instance, which is designed for iterating in
> >>>>>>>>>> the
> >>>>>>>>>> order of...
> >>>>>>>>>>     seek(term)
> >>>>>>>>>>       skipTo(doc)
> >>>>>>>>>>          nextPosition()
> >>>>>>>>>>             getPayload()
> >>>>>>>>>> ...which is great for getting the payload of every instance
> >>>>>>>>>> (ie:position) of a specific term in a given document (or in
> >>>>>>> every
> >>>>>>>>>> document) but without serious changes to the Spans API, the
> >>>>>>> ideal
> >>>>>>>>>> payload
> >>>>>>>>>> API would let you say...
> >>>>>>>>>>     skipTo(doc)
> >>>>>>>>>>        advance(startPosition)
> >>>>>>>>>>          getPayload()
> >>>>>>>>>>        while (nextPosition() < endPosition)
> >>>>>>>>>>          getPosition()
> >>>>>>>>>>
> >>>>>>>>>> but this seems like a nearly impossible API to implement
> >>>>>>> given the
> >>>>>>>>>> natore
> >>>>>>>>>> of hte inverted index and the fact that terms aren't ever
> >>>>>>> stored in
> >>>>>>>>>> position order.
> >>>>>>>>>>
> >>>>>>>>>> there's a lot i really don't know/understand about the lucene
> >>>>>>> term
> >>>>>>>>>> position internals ... but as i recall, the datastructure
> >>>>>>> written
> >>>>>>>>>> to disk
> >>>>>>>>>> isn't actually a tree structure inverted index, it's a long
> >>>>>>>>>> sequence of
> >>>>>>>>>> tuples correct?  so in theory you could scan along the tuples
> >>>>>>>>>> untill you
> >>>>>>>>>> find the doc you are interested in, ignoring all of the term
> >>>>>>> info
> >>>>>>>>>> along
> >>>>>>>>>> the way, then whatever term you happen be on at the
> >>>>>>>>>> moment, you
> >>>>>>>>>> could scan
> >>>>>>>>>> along all of the positions until you find one in the range
> >>>>>>> you are
> >>>>>>>>>> interested in -- assuming you do, then you record the current
> >>>>>>> Term
> >>>>>>>>>> (and
> >>>>>>>>>> read your payload data if interested)
> >>>>>>>>>
> >>>>>>>>> I think the main issue I see is in both the payloads and the
> >>>>>>> matching
> >>>>>>>>> case above is that they require a document centric approach.
> >>>>>>>>> And
> >>>>>>>>> then, for each Document,
> >>>>>>>>> you ideally want to be able to just index into an array so
> >>>>>>>>> that
> >>>>>>> you
> >>>>>>>>> can go directly to the position that is needed based on
> >>>>>>>>> Span.getStart()
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> if i remember correctly, the first part of this is easy, and
> >>>>>>>>>> relative fast
> >>>>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions will
> >>>>>>> happily
> >>>>>>>>>> scan for
> >>>>>>>>>> the first <term,doc> pair with the correct docId,
> >>>>>>> irregardless of
> >>>>>>>>>> the term
> >>>>>>>>>> ... the only thing i'm not sure about is how efficient it
> >>>>>>>>>> is to
> >>>>>>>>>> loop over
> >>>>>>>>>> nextPosition() for every term you find to see if any of them
> >>>>>>> are in
> >>>>>>>>>> your
> >>>>>>>>>> range ... the best case scenerio is that the first position
> >>>>>>>>>> returned is
> >>>>>>>>>> above the high end of your range, in which case you can stop
> >>>>>>>>>> immediately
> >>>>>>>>>> and seek to the next term -- butthe worst case is that you
> >>>>>>>>>> call
> >>>>>>>>>> nextPosition() over an over a lot of times before you get a
> >>>>>>>>>> position in
> >>>>>>>>>> (or above) your rnage .... an advancePosition(pos) that
> >>>>>>> wokred like
> >>>>>>>>>> seek
> >>>>>>>>>> or skipTo might be helpful here.
> >>>>>>>>>>
> >>>>>>>>>> : I feel like I am missing something obvious.  I would
> >>>>>>> suspect the
> >>>>>>>>>> : highlighter needs to do this, but it seems to take the
> >>>>>>> reanalyze
> >>>>>>>>>> : approach as well (I admit, though, that I have little
> >>>>>>> experience
> >>>>>>>>>> with
> >>>>>>>>>> : the highlighter.)
> >>>>>>>>>>
> >>>>>>>>>> as i understand it the default case is to reanalyze, but
> >>>>>>>>>> if you
> >>>>>>>>> have
> >>>>>>>>>> TermFreqVector info stored with positions (ie: a
> >>>>>>>>>> TermPositionVector) then
> >>>>>>>>>> it can use that to construct a TokenStream by iterating over
> >>>>>>>>>> all
> >>>>>>>>>> terms and
> >>>>>>>>>> writing them into a big array in position order (see the
> >>>>>>>>>> TermSources class
> >>>>>>>>>> in the highlighter)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Ah, I see that now.  Thanks.
> >>>>>>>>>>
> >>>>>>>>>> this makes sense when highlighting because it doesn't know
> >>>>>>>>>> what
> >>>>>>>>>> kind of
> >>>>>>>>>> fragmenter is going to be used so it needs the whole
> >>>>>>> TokenStream,
> >>>>>>>>>> but it
> >>>>>>>>>> seems less then ideal when you are only interested in a small
> >>>>>>>>>> number of
> >>>>>>>>>> position ranges that you know in advance.
> >>>>>>>>>>
> >>>>>>>>>> : I am wondering if it would be useful to have an alternative
> >>>>>>> Term
> >>>>>>>>>> : Vector storage mechanism that was position centric.
> >>>>>>> Because we
> >>>>>>>>>> : couldn't take advantage of the lexicographic
> >>>>>>>>>> compression, it
> >>>>>>>>> would
> >>>>>>>>>> : take up more disk space, but it would be a lot faster for
> >>>>>>> these
> >>>>>>>>>> kinds
> >>>>>>>>>>
> >>>>>>>>>> i'm not sure if it's really neccessary to store the data in a
> >>>>>>>>> position
> >>>>>>>>>> centric manner, assuming we have a way to "seek" by position
> >>>>>>> like i
> >>>>>>>>>> described above -- but then again i don't really know that
> >>>>>>> what i
> >>>>>>>>>> described above is all that possible/practical/performant.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I suppose I could use my Mapper approach to organize things
> >>>>>>>>> in a
> >>>>>>>>> position centric way now that I think about it more.  Just
> >>>>>>> means some
> >>>>>>>>> unpacking and repacking.  Still, probably would perform well
> >>>>>>> enough
> >>>>>>>>> since I can setup the correct structure on the fly.  I will
> >>>>>>> give this
> >>>>>>>>> a try.  Maybe even add a Mapper to do this.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -Grant
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>> ----------------------------------------------------------------
> >>>>>>> --
> >>>>>>> ---
> >>>>>
> >>>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>>>>>> For additional commands, e-mail: java-dev-
> >>>>>>>>> help@lucene.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>> ------------------------------------------------------
> >>>>>>> Grant Ingersoll
> >>>>>>> http://www.grantingersoll.com/
> >>>>>>> http://lucene.grantingersoll.com
> >>>>>>> http://www.paperoftheweek.com/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ----------------------------------------------------------------
> >>>>>>> --
> >>>>>>> ---
> >>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> --------------------------
> >>>>> Grant Ingersoll
> >>>>> http://lucene.grantingersoll.com
> >>>>>
> >>>>> Lucene Helpful Hints:
> >>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> --
> >>>>> -
> >>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>
> >>>>>
> >>>>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucene.grantingersoll.com
> >>
> >> Lucene Helpful Hints:
> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Best Practices for getting Strings from a position range

Posted by Grant Ingersoll <gr...@gmail.com>.

Hi Peter,

Could you give more details on this test?  What are you comparing,  
etc.?  Sample queries would be good.  I would like to write up a  
contrib/Benchmark algorithm to begin investigating this and see if  
there is anything that can be done.

Thanks,
Grant

On Aug 10, 2007, at 6:27 PM, Peter Keegan wrote:

> ok, glad we're on the same page.
>
> I did some performance testing with span queries and,  
> unfortunately,  the
> results are discouraging for my intended use. When I added a simple
> SpanNearQuery to existing queries, the throughput decreased by a  
> factor of
> 10+. I figured spans would be expensive, but not that much. I  
> haven't done
> profiling yet, but here's a typical thread stack during execution:
>
> at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java: 
> 137)
> at org.apache.lucene.util.PriorityQueue.adjustTop 
> (PriorityQueue.java:101)
> at org.apache.lucene.search.spans.NearSpansUnordered.next(
> NearSpansUnordered.java:128)
> at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc(
> SpanScorer.java:83)
> at org.apache.lucene.search.spans.SpanScorer.next(SpanScorer.java:57)
> at org.apache.lucene.search.ConjunctionScorer.next 
> (ConjunctionScorer.java
> :56)
> at org.apache.lucene.search.BooleanScorer2.score 
> (BooleanScorer2.java:327)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: 
> 146)
> at org.apache.lucene.search.Searcher.search(Searcher.java:118)
>
> Peter
>
>
> On 8/10/07, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> Sorry for the confusion.  I thought you just wanted access to the
>> term info per position.    I think we will have to add something to
>> the Spans like we talked about before.
>>
>> -Grant
>>
>> On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:
>>
>>> Grant,
>>>
>>> I'm afraid I don't understand how to use this mapper in the context
>>> of a
>>> SpanQuery. It seems like I would have to modify SpanScorer to fetch
>>> payload
>>> data and provide a new method to access the payloads while
>>> iterating through
>>> the documents. If this can be accomplished without modifying Spans,
>>> could
>>> you provide a bit more detail?
>>>
>>> Thanks,
>>> Peter
>>>
>>> On 8/9/07, Peter Keegan <pe...@gmail.com> wrote:
>>>>
>>>> Hi Grant,
>>>>
>>>> I'm hoping to check this out soon.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> On 8/7/07, Grant Ingersoll <gsingers@apache.org > wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
>>>>> provides a TermVectorMapper that loads by position.
>>>>>
>>>>> Still not what ideally what you want, but I haven't had time to
>>>>> scope
>>>>> that one out yet.,
>>>>>
>>>>> -Grant
>>>>>
>>>>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
>>>>>
>>>>>> Hi Grant,
>>>>>>
>>>>>> No problem - I know you are very busy.  I just wanted to get a
>>>>>> sense for the
>>>>>> timing because I'd like to use this for a release this Fall. If I
>>>>>> can get a
>>>>>> prototype working in the coming weeks AND the performance is
>>>>>> great :) , this
>>>>>> would be terrific. If not, I'll have to fall back on a more  
>>>>>> complex
>>>>>> design
>>>>>> that handles the query outside of Lucene :(
>>>>>>
>>>>>> In the meantime, I'll try playing with LUCENE-868.
>>>>>>
>>>>>> Thanks for the update.
>>>>>> Peter
>>>>>>
>>>>>> On 7/24/07, Grant Ingersoll <grant.ingersoll@gmail.com > wrote:
>>>>>>>
>>>>>>> Sorry, Peter, I haven't had a chance to work on it.  I don't
>>>>>>> see it
>>>>>>> happening this week, but maybe next.
>>>>>>>
>>>>>>> I do think the Mapper approach via TermVectors will work.  It  
>>>>>>> will
>>>>>>> require implementing a new mapper that orders by position, but I
>>>>>>> don't think that is too hard.   I started on one on the  
>>>>>>> LUCENE-868
>>>>>>> patch (version 4) but it is not complete.  Maybe you want to  
>>>>>>> pick
>>>>>>> it up?
>>>>>>>
>>>>>>> With this approach, you would iterate your spans, when you come
>>>>>>> to a
>>>>>>> new doc, you would load the term vector using the
>>>>>>> PositionMapper, and
>>>>>>> then you could index into the positions for the matches in the
>>>>>>> document.
>>>>>>>
>>>>>>> I realize this does not cover the just wanting to get the
>>>>>>> Payload at
>>>>>>> the match issue.  Maybe next week...
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Grant
>>>>>>>
>>>>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
>>>>>>>
>>>>>>>> Any idea on when this might be available (days, weeks...)?
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> On 7/16/07, Grant Ingersoll < gsingers@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> : Do we have a best practice for going from, say a SpanQuery
>>>>>>> doc/
>>>>>>>>>> : position information and retrieving the actual range of
>>>>>>>>> positions of
>>>>>>>>>> : content from the Document?  Is it just to reanalyze the
>>>>>>> Document
>>>>>>>>>> : using the appropriate Analyzer and start recording once you
>>>>>>>>> hit the
>>>>>>>>>> : positions you are interested in?    Seems like Term Vectors
>>>>>>>>> _could_
>>>>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868)
>>>>>>> doesn't
>>>>>>>>>> : really help, because they are stored in a term-centric
>>>>>>> manner.  I
>>>>>>>>>> : guess what I am after is a position centric approach.  That
>>>>>>>>> is, give
>>>>>>>>>>
>>>>>>>>>> this is kind of what i was suggesting in the last message i
>>>>>>>>>> sent
>>>>>>>>>> to the java-user thread about paylods and SpanQueries (which
>>>>>>>>>> i'm
>>>>>>>>>> guessing is what prompted this thread as well)...
>>>>>>>>>>
>>>>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery-
>>>>>>>>>> tf3988826.html#a11551628
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is one use case, the other is related to the new patch I
>>>>>>>>> submitted for LUCENE-960.  In this case, I have a
>>>>>>>>> SpanQueryFilter
>>>>>>>>> that identifies a bunch of docs and positions ahead of time.
>>>>>>>>> Then
>>>>>
>>>>>>>>> the user enters new Span Query and I want to relate the  
>>>>>>>>> matches
>>>>>>> from
>>>>>>>>> the user query with the positions of matches in the filter and
>>>>>>> then
>>>>>>>>> show that window.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> my point was that currently, to retrieve a payload you need a
>>>>>>>>>> TermPositions instance, which is designed for iterating in  
>>>>>>>>>> the
>>>>>>>>>> order of...
>>>>>>>>>>     seek(term)
>>>>>>>>>>       skipTo(doc)
>>>>>>>>>>          nextPosition()
>>>>>>>>>>             getPayload()
>>>>>>>>>> ...which is great for getting the payload of every instance
>>>>>>>>>> (ie:position) of a specific term in a given document (or in
>>>>>>> every
>>>>>>>>>> document) but without serious changes to the Spans API, the
>>>>>>> ideal
>>>>>>>>>> payload
>>>>>>>>>> API would let you say...
>>>>>>>>>>     skipTo(doc)
>>>>>>>>>>        advance(startPosition)
>>>>>>>>>>          getPayload()
>>>>>>>>>>        while (nextPosition() < endPosition)
>>>>>>>>>>          getPosition()
>>>>>>>>>>
>>>>>>>>>> but this seems like a nearly impossible API to implement
>>>>>>> given the
>>>>>>>>>> natore
>>>>>>>>>> of hte inverted index and the fact that terms aren't ever
>>>>>>> stored in
>>>>>>>>>> position order.
>>>>>>>>>>
>>>>>>>>>> there's a lot i really don't know/understand about the lucene
>>>>>>> term
>>>>>>>>>> position internals ... but as i recall, the datastructure
>>>>>>> written
>>>>>>>>>> to disk
>>>>>>>>>> isn't actually a tree structure inverted index, it's a long
>>>>>>>>>> sequence of
>>>>>>>>>> tuples correct?  so in theory you could scan along the tuples
>>>>>>>>>> untill you
>>>>>>>>>> find the doc you are interested in, ignoring all of the term
>>>>>>> info
>>>>>>>>>> along
>>>>>>>>>> the way, then whatever term you happen be on at the  
>>>>>>>>>> moment, you
>>>>>>>>>> could scan
>>>>>>>>>> along all of the positions until you find one in the range
>>>>>>> you are
>>>>>>>>>> interested in -- assuming you do, then you record the current
>>>>>>> Term
>>>>>>>>>> (and
>>>>>>>>>> read your payload data if interested)
>>>>>>>>>
>>>>>>>>> I think the main issue I see is in both the payloads and the
>>>>>>> matching
>>>>>>>>> case above is that they require a document centric approach.
>>>>>>>>> And
>>>>>>>>> then, for each Document,
>>>>>>>>> you ideally want to be able to just index into an array so  
>>>>>>>>> that
>>>>>>> you
>>>>>>>>> can go directly to the position that is needed based on
>>>>>>>>> Span.getStart()
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> if i remember correctly, the first part of this is easy, and
>>>>>>>>>> relative fast
>>>>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions will
>>>>>>> happily
>>>>>>>>>> scan for
>>>>>>>>>> the first <term,doc> pair with the correct docId,
>>>>>>> irregardless of
>>>>>>>>>> the term
>>>>>>>>>> ... the only thing i'm not sure about is how efficient it  
>>>>>>>>>> is to
>>>>>>>>>> loop over
>>>>>>>>>> nextPosition() for every term you find to see if any of them
>>>>>>> are in
>>>>>>>>>> your
>>>>>>>>>> range ... the best case scenerio is that the first position
>>>>>>>>>> returned is
>>>>>>>>>> above the high end of your range, in which case you can stop
>>>>>>>>>> immediately
>>>>>>>>>> and seek to the next term -- butthe worst case is that you  
>>>>>>>>>> call
>>>>>>>>>> nextPosition() over an over a lot of times before you get a
>>>>>>>>>> position in
>>>>>>>>>> (or above) your rnage .... an advancePosition(pos) that
>>>>>>> wokred like
>>>>>>>>>> seek
>>>>>>>>>> or skipTo might be helpful here.
>>>>>>>>>>
>>>>>>>>>> : I feel like I am missing something obvious.  I would
>>>>>>> suspect the
>>>>>>>>>> : highlighter needs to do this, but it seems to take the
>>>>>>> reanalyze
>>>>>>>>>> : approach as well (I admit, though, that I have little
>>>>>>> experience
>>>>>>>>>> with
>>>>>>>>>> : the highlighter.)
>>>>>>>>>>
>>>>>>>>>> as i understand it the default case is to reanalyze, but  
>>>>>>>>>> if you
>>>>>>>>> have
>>>>>>>>>> TermFreqVector info stored with positions (ie: a
>>>>>>>>>> TermPositionVector) then
>>>>>>>>>> it can use that to construct a TokenStream by iterating over
>>>>>>>>>> all
>>>>>>>>>> terms and
>>>>>>>>>> writing them into a big array in position order (see the
>>>>>>>>>> TermSources class
>>>>>>>>>> in the highlighter)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ah, I see that now.  Thanks.
>>>>>>>>>>
>>>>>>>>>> this makes sense when highlighting because it doesn't know  
>>>>>>>>>> what
>>>>>>>>>> kind of
>>>>>>>>>> fragmenter is going to be used so it needs the whole
>>>>>>> TokenStream,
>>>>>>>>>> but it
>>>>>>>>>> seems less then ideal when you are only interested in a small
>>>>>>>>>> number of
>>>>>>>>>> position ranges that you know in advance.
>>>>>>>>>>
>>>>>>>>>> : I am wondering if it would be useful to have an alternative
>>>>>>> Term
>>>>>>>>>> : Vector storage mechanism that was position centric.
>>>>>>> Because we
>>>>>>>>>> : couldn't take advantage of the lexicographic  
>>>>>>>>>> compression, it
>>>>>>>>> would
>>>>>>>>>> : take up more disk space, but it would be a lot faster for
>>>>>>> these
>>>>>>>>>> kinds
>>>>>>>>>>
>>>>>>>>>> i'm not sure if it's really neccessary to store the data in a
>>>>>>>>> position
>>>>>>>>>> centric manner, assuming we have a way to "seek" by position
>>>>>>> like i
>>>>>>>>>> described above -- but then again i don't really know that
>>>>>>> what i
>>>>>>>>>> described above is all that possible/practical/performant.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I suppose I could use my Mapper approach to organize things  
>>>>>>>>> in a
>>>>>>>>> position centric way now that I think about it more.  Just
>>>>>>> means some
>>>>>>>>> unpacking and repacking.  Still, probably would perform well
>>>>>>> enough
>>>>>>>>> since I can setup the correct structure on the fly.  I will
>>>>>>> give this
>>>>>>>>> a try.  Maybe even add a Mapper to do this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Grant
>>>>>>>>>
>>>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> ---
>>>>>
>>>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-dev- 
>>>>>>>>> help@lucene.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.grantingersoll.com/
>>>>>>> http://lucene.grantingersoll.com
>>>>>>> http://www.paperoftheweek.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> ---
>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://lucene.grantingersoll.com
>>>>>
>>>>> Lucene Helpful Hints:
>>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Best Practices for getting Strings from a position range

Posted by Peter Keegan <pe...@gmail.com>.

ok, glad we're on the same page.

I did some performance testing with span queries and, unfortunately,  the
results are discouraging for my intended use. When I added a simple
SpanNearQuery to existing queries, the throughput decreased by a factor of
10+. I figured spans would be expensive, but not that much. I haven't done
profiling yet, but here's a typical thread stack during execution:

at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:137)
at org.apache.lucene.util.PriorityQueue.adjustTop(PriorityQueue.java:101)
at org.apache.lucene.search.spans.NearSpansUnordered.next(
NearSpansUnordered.java:128)
at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc(
SpanScorer.java:83)
at org.apache.lucene.search.spans.SpanScorer.next(SpanScorer.java:57)
at org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java
:56)
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.Searcher.search(Searcher.java:118)

Peter


On 8/10/07, Grant Ingersoll <gs...@apache.org> wrote:
>
> Sorry for the confusion.  I thought you just wanted access to the
> term info per position.    I think we will have to add something to
> the Spans like we talked about before.
>
> -Grant
>
> On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:
>
> > Grant,
> >
> > I'm afraid I don't understand how to use this mapper in the context
> > of a
> > SpanQuery. It seems like I would have to modify SpanScorer to fetch
> > payload
> > data and provide a new method to access the payloads while
> > iterating through
> > the documents. If this can be accomplished without modifying Spans,
> > could
> > you provide a bit more detail?
> >
> > Thanks,
> > Peter
> >
> > On 8/9/07, Peter Keegan <pe...@gmail.com> wrote:
> >>
> >> Hi Grant,
> >>
> >> I'm hoping to check this out soon.
> >>
> >> Thanks,
> >> Peter
> >>
> >> On 8/7/07, Grant Ingersoll <gsingers@apache.org > wrote:
> >>>
> >>> Hi Peter,
> >>>
> >>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
> >>> provides a TermVectorMapper that loads by position.
> >>>
> >>> Still not what ideally what you want, but I haven't had time to
> >>> scope
> >>> that one out yet.,
> >>>
> >>> -Grant
> >>>
> >>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
> >>>
> >>>> Hi Grant,
> >>>>
> >>>> No problem - I know you are very busy.  I just wanted to get a
> >>>> sense for the
> >>>> timing because I'd like to use this for a release this Fall. If I
> >>>> can get a
> >>>> prototype working in the coming weeks AND the performance is
> >>>> great :) , this
> >>>> would be terrific. If not, I'll have to fall back on a more complex
> >>>> design
> >>>> that handles the query outside of Lucene :(
> >>>>
> >>>> In the meantime, I'll try playing with LUCENE-868.
> >>>>
> >>>> Thanks for the update.
> >>>> Peter
> >>>>
> >>>> On 7/24/07, Grant Ingersoll <grant.ingersoll@gmail.com > wrote:
> >>>>>
> >>>>> Sorry, Peter, I haven't had a chance to work on it.  I don't
> >>>>> see it
> >>>>> happening this week, but maybe next.
> >>>>>
> >>>>> I do think the Mapper approach via TermVectors will work.  It will
> >>>>> require implementing a new mapper that orders by position, but I
> >>>>> don't think that is too hard.   I started on one on the LUCENE-868
> >>>>> patch (version 4) but it is not complete.  Maybe you want to pick
> >>>>> it up?
> >>>>>
> >>>>> With this approach, you would iterate your spans, when you come
> >>>>> to a
> >>>>> new doc, you would load the term vector using the
> >>>>> PositionMapper, and
> >>>>> then you could index into the positions for the matches in the
> >>>>> document.
> >>>>>
> >>>>> I realize this does not cover the just wanting to get the
> >>>>> Payload at
> >>>>> the match issue.  Maybe next week...
> >>>>>
> >>>>> Cheers,
> >>>>> Grant
> >>>>>
> >>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
> >>>>>
> >>>>>> Any idea on when this might be available (days, weeks...)?
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> On 7/16/07, Grant Ingersoll < gsingers@apache.org> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> : Do we have a best practice for going from, say a SpanQuery
> >>>>> doc/
> >>>>>>>> : position information and retrieving the actual range of
> >>>>>>> positions of
> >>>>>>>> : content from the Document?  Is it just to reanalyze the
> >>>>> Document
> >>>>>>>> : using the appropriate Analyzer and start recording once you
> >>>>>>> hit the
> >>>>>>>> : positions you are interested in?    Seems like Term Vectors
> >>>>>>> _could_
> >>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868)
> >>>>> doesn't
> >>>>>>>> : really help, because they are stored in a term-centric
> >>>>> manner.  I
> >>>>>>>> : guess what I am after is a position centric approach.  That
> >>>>>>> is, give
> >>>>>>>>
> >>>>>>>> this is kind of what i was suggesting in the last message i
> >>>>>>>> sent
> >>>>>>>> to the java-user thread about paylods and SpanQueries (which
> >>>>>>>> i'm
> >>>>>>>> guessing is what prompted this thread as well)...
> >>>>>>>>
> >>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery-
> >>>>>>>> tf3988826.html#a11551628
> >>>>>>>
> >>>>>>>
> >>>>>>> This is one use case, the other is related to the new patch I
> >>>>>>> submitted for LUCENE-960.  In this case, I have a
> >>>>>>> SpanQueryFilter
> >>>>>>> that identifies a bunch of docs and positions ahead of time.
> >>>>>>> Then
> >>>
> >>>>>>> the user enters new Span Query and I want to relate the matches
> >>>>> from
> >>>>>>> the user query with the positions of matches in the filter and
> >>>>> then
> >>>>>>> show that window.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> my point was that currently, to retrieve a payload you need a
> >>>>>>>> TermPositions instance, which is designed for iterating in the
> >>>>>>>> order of...
> >>>>>>>>     seek(term)
> >>>>>>>>       skipTo(doc)
> >>>>>>>>          nextPosition()
> >>>>>>>>             getPayload()
> >>>>>>>> ...which is great for getting the payload of every instance
> >>>>>>>> (ie:position) of a specific term in a given document (or in
> >>>>> every
> >>>>>>>> document) but without serious changes to the Spans API, the
> >>>>> ideal
> >>>>>>>> payload
> >>>>>>>> API would let you say...
> >>>>>>>>     skipTo(doc)
> >>>>>>>>        advance(startPosition)
> >>>>>>>>          getPayload()
> >>>>>>>>        while (nextPosition() < endPosition)
> >>>>>>>>          getPosition()
> >>>>>>>>
> >>>>>>>> but this seems like a nearly impossible API to implement
> >>>>> given the
> >>>>>>>> natore
> >>>>>>>> of hte inverted index and the fact that terms aren't ever
> >>>>> stored in
> >>>>>>>> position order.
> >>>>>>>>
> >>>>>>>> there's a lot i really don't know/understand about the lucene
> >>>>> term
> >>>>>>>> position internals ... but as i recall, the datastructure
> >>>>> written
> >>>>>>>> to disk
> >>>>>>>> isn't actually a tree structure inverted index, it's a long
> >>>>>>>> sequence of
> >>>>>>>> tuples correct?  so in theory you could scan along the tuples
> >>>>>>>> untill you
> >>>>>>>> find the doc you are interested in, ignoring all of the term
> >>>>> info
> >>>>>>>> along
> >>>>>>>> the way, then whatever term you happen be on at the moment, you
> >>>>>>>> could scan
> >>>>>>>> along all of the positions until you find one in the range
> >>>>> you are
> >>>>>>>> interested in -- assuming you do, then you record the current
> >>>>> Term
> >>>>>>>> (and
> >>>>>>>> read your payload data if interested)
> >>>>>>>
> >>>>>>> I think the main issue I see is in both the payloads and the
> >>>>> matching
> >>>>>>> case above is that they require a document centric approach.
> >>>>>>> And
> >>>>>>> then, for each Document,
> >>>>>>> you ideally want to be able to just index into an array so that
> >>>>> you
> >>>>>>> can go directly to the position that is needed based on
> >>>>>>> Span.getStart()
> >>>>>>>
> >>>>>>>>
> >>>>>>>> if i remember correctly, the first part of this is easy, and
> >>>>>>>> relative fast
> >>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions will
> >>>>> happily
> >>>>>>>> scan for
> >>>>>>>> the first <term,doc> pair with the correct docId,
> >>>>> irregardless of
> >>>>>>>> the term
> >>>>>>>> ... the only thing i'm not sure about is how efficient it is to
> >>>>>>>> loop over
> >>>>>>>> nextPosition() for every term you find to see if any of them
> >>>>> are in
> >>>>>>>> your
> >>>>>>>> range ... the best case scenerio is that the first position
> >>>>>>>> returned is
> >>>>>>>> above the high end of your range, in which case you can stop
> >>>>>>>> immediately
> >>>>>>>> and seek to the next term -- butthe worst case is that you call
> >>>>>>>> nextPosition() over an over a lot of times before you get a
> >>>>>>>> position in
> >>>>>>>> (or above) your rnage .... an advancePosition(pos) that
> >>>>> wokred like
> >>>>>>>> seek
> >>>>>>>> or skipTo might be helpful here.
> >>>>>>>>
> >>>>>>>> : I feel like I am missing something obvious.  I would
> >>>>> suspect the
> >>>>>>>> : highlighter needs to do this, but it seems to take the
> >>>>> reanalyze
> >>>>>>>> : approach as well (I admit, though, that I have little
> >>>>> experience
> >>>>>>>> with
> >>>>>>>> : the highlighter.)
> >>>>>>>>
> >>>>>>>> as i understand it the default case is to reanalyze, but if you
> >>>>>>> have
> >>>>>>>> TermFreqVector info stored with positions (ie: a
> >>>>>>>> TermPositionVector) then
> >>>>>>>> it can use that to construct a TokenStream by iterating over
> >>>>>>>> all
> >>>>>>>> terms and
> >>>>>>>> writing them into a big array in position order (see the
> >>>>>>>> TermSources class
> >>>>>>>> in the highlighter)
> >>>>>>>
> >>>>>>>
> >>>>>>> Ah, I see that now.  Thanks.
> >>>>>>>>
> >>>>>>>> this makes sense when highlighting because it doesn't know what
> >>>>>>>> kind of
> >>>>>>>> fragmenter is going to be used so it needs the whole
> >>>>> TokenStream,
> >>>>>>>> but it
> >>>>>>>> seems less then ideal when you are only interested in a small
> >>>>>>>> number of
> >>>>>>>> position ranges that you know in advance.
> >>>>>>>>
> >>>>>>>> : I am wondering if it would be useful to have an alternative
> >>>>> Term
> >>>>>>>> : Vector storage mechanism that was position centric.
> >>>>> Because we
> >>>>>>>> : couldn't take advantage of the lexicographic compression, it
> >>>>>>> would
> >>>>>>>> : take up more disk space, but it would be a lot faster for
> >>>>> these
> >>>>>>>> kinds
> >>>>>>>>
> >>>>>>>> i'm not sure if it's really neccessary to store the data in a
> >>>>>>> position
> >>>>>>>> centric manner, assuming we have a way to "seek" by position
> >>>>> like i
> >>>>>>>> described above -- but then again i don't really know that
> >>>>> what i
> >>>>>>>> described above is all that possible/practical/performant.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I suppose I could use my Mapper approach to organize things in a
> >>>>>>> position centric way now that I think about it more.  Just
> >>>>> means some
> >>>>>>> unpacking and repacking.  Still, probably would perform well
> >>>>> enough
> >>>>>>> since I can setup the correct structure on the fly.  I will
> >>>>> give this
> >>>>>>> a try.  Maybe even add a Mapper to do this.
> >>>>>>>
> >>>>>>>
> >>>>>>> -Grant
> >>>>>>>
> >>>>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>
> >>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> ------------------------------------------------------
> >>>>> Grant Ingersoll
> >>>>> http://www.grantingersoll.com/
> >>>>> http://lucene.grantingersoll.com
> >>>>> http://www.paperoftheweek.com/
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>
> >>>>>
> >>>
> >>> --------------------------
> >>> Grant Ingersoll
> >>> http://lucene.grantingersoll.com
> >>>
> >>> Lucene Helpful Hints:
> >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >>> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>>
> >>>
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>>
> >>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Best Practices for getting Strings from a position range

Posted by Grant Ingersoll <gs...@apache.org>.

Sorry for the confusion.  I thought you just wanted access to the  
term info per position.    I think we will have to add something to  
the Spans like we talked about before.

-Grant

On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:

> Grant,
>
> I'm afraid I don't understand how to use this mapper in the context  
> of a
> SpanQuery. It seems like I would have to modify SpanScorer to fetch  
> payload
> data and provide a new method to access the payloads while  
> iterating through
> the documents. If this can be accomplished without modifying Spans,  
> could
> you provide a bit more detail?
>
> Thanks,
> Peter
>
> On 8/9/07, Peter Keegan <pe...@gmail.com> wrote:
>>
>> Hi Grant,
>>
>> I'm hoping to check this out soon.
>>
>> Thanks,
>> Peter
>>
>> On 8/7/07, Grant Ingersoll <gsingers@apache.org > wrote:
>>>
>>> Hi Peter,
>>>
>>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
>>> provides a TermVectorMapper that loads by position.
>>>
>>> Still not what ideally what you want, but I haven't had time to  
>>> scope
>>> that one out yet.,
>>>
>>> -Grant
>>>
>>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
>>>
>>>> Hi Grant,
>>>>
>>>> No problem - I know you are very busy.  I just wanted to get a
>>>> sense for the
>>>> timing because I'd like to use this for a release this Fall. If I
>>>> can get a
>>>> prototype working in the coming weeks AND the performance is
>>>> great :) , this
>>>> would be terrific. If not, I'll have to fall back on a more complex
>>>> design
>>>> that handles the query outside of Lucene :(
>>>>
>>>> In the meantime, I'll try playing with LUCENE-868.
>>>>
>>>> Thanks for the update.
>>>> Peter
>>>>
>>>> On 7/24/07, Grant Ingersoll <grant.ingersoll@gmail.com > wrote:
>>>>>
>>>>> Sorry, Peter, I haven't had a chance to work on it.  I don't  
>>>>> see it
>>>>> happening this week, but maybe next.
>>>>>
>>>>> I do think the Mapper approach via TermVectors will work.  It will
>>>>> require implementing a new mapper that orders by position, but I
>>>>> don't think that is too hard.   I started on one on the LUCENE-868
>>>>> patch (version 4) but it is not complete.  Maybe you want to pick
>>>>> it up?
>>>>>
>>>>> With this approach, you would iterate your spans, when you come  
>>>>> to a
>>>>> new doc, you would load the term vector using the  
>>>>> PositionMapper, and
>>>>> then you could index into the positions for the matches in the
>>>>> document.
>>>>>
>>>>> I realize this does not cover the just wanting to get the  
>>>>> Payload at
>>>>> the match issue.  Maybe next week...
>>>>>
>>>>> Cheers,
>>>>> Grant
>>>>>
>>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
>>>>>
>>>>>> Any idea on when this might be available (days, weeks...)?
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> On 7/16/07, Grant Ingersoll < gsingers@apache.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> : Do we have a best practice for going from, say a SpanQuery
>>>>> doc/
>>>>>>>> : position information and retrieving the actual range of
>>>>>>> positions of
>>>>>>>> : content from the Document?  Is it just to reanalyze the
>>>>> Document
>>>>>>>> : using the appropriate Analyzer and start recording once you
>>>>>>> hit the
>>>>>>>> : positions you are interested in?    Seems like Term Vectors
>>>>>>> _could_
>>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868)
>>>>> doesn't
>>>>>>>> : really help, because they are stored in a term-centric
>>>>> manner.  I
>>>>>>>> : guess what I am after is a position centric approach.  That
>>>>>>> is, give
>>>>>>>>
>>>>>>>> this is kind of what i was suggesting in the last message i  
>>>>>>>> sent
>>>>>>>> to the java-user thread about paylods and SpanQueries (which  
>>>>>>>> i'm
>>>>>>>> guessing is what prompted this thread as well)...
>>>>>>>>
>>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery-
>>>>>>>> tf3988826.html#a11551628
>>>>>>>
>>>>>>>
>>>>>>> This is one use case, the other is related to the new patch I
>>>>>>> submitted for LUCENE-960.  In this case, I have a  
>>>>>>> SpanQueryFilter
>>>>>>> that identifies a bunch of docs and positions ahead of time.   
>>>>>>> Then
>>>
>>>>>>> the user enters new Span Query and I want to relate the matches
>>>>> from
>>>>>>> the user query with the positions of matches in the filter and
>>>>> then
>>>>>>> show that window.
>>>>>>>
>>>>>>>>
>>>>>>>> my point was that currently, to retrieve a payload you need a
>>>>>>>> TermPositions instance, which is designed for iterating in the
>>>>>>>> order of...
>>>>>>>>     seek(term)
>>>>>>>>       skipTo(doc)
>>>>>>>>          nextPosition()
>>>>>>>>             getPayload()
>>>>>>>> ...which is great for getting the payload of every instance
>>>>>>>> (ie:position) of a specific term in a given document (or in
>>>>> every
>>>>>>>> document) but without serious changes to the Spans API, the
>>>>> ideal
>>>>>>>> payload
>>>>>>>> API would let you say...
>>>>>>>>     skipTo(doc)
>>>>>>>>        advance(startPosition)
>>>>>>>>          getPayload()
>>>>>>>>        while (nextPosition() < endPosition)
>>>>>>>>          getPosition()
>>>>>>>>
>>>>>>>> but this seems like a nearly impossible API to implement
>>>>> given the
>>>>>>>> natore
>>>>>>>> of hte inverted index and the fact that terms aren't ever
>>>>> stored in
>>>>>>>> position order.
>>>>>>>>
>>>>>>>> there's a lot i really don't know/understand about the lucene
>>>>> term
>>>>>>>> position internals ... but as i recall, the datastructure
>>>>> written
>>>>>>>> to disk
>>>>>>>> isn't actually a tree structure inverted index, it's a long
>>>>>>>> sequence of
>>>>>>>> tuples correct?  so in theory you could scan along the tuples
>>>>>>>> untill you
>>>>>>>> find the doc you are interested in, ignoring all of the term
>>>>> info
>>>>>>>> along
>>>>>>>> the way, then whatever term you happen be on at the moment, you
>>>>>>>> could scan
>>>>>>>> along all of the positions until you find one in the range
>>>>> you are
>>>>>>>> interested in -- assuming you do, then you record the current
>>>>> Term
>>>>>>>> (and
>>>>>>>> read your payload data if interested)
>>>>>>>
>>>>>>> I think the main issue I see is in both the payloads and the
>>>>> matching
>>>>>>> case above is that they require a document centric approach.   
>>>>>>> And
>>>>>>> then, for each Document,
>>>>>>> you ideally want to be able to just index into an array so that
>>>>> you
>>>>>>> can go directly to the position that is needed based on
>>>>>>> Span.getStart()
>>>>>>>
>>>>>>>>
>>>>>>>> if i remember correctly, the first part of this is easy, and
>>>>>>>> relative fast
>>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions will
>>>>> happily
>>>>>>>> scan for
>>>>>>>> the first <term,doc> pair with the correct docId,
>>>>> irregardless of
>>>>>>>> the term
>>>>>>>> ... the only thing i'm not sure about is how efficient it is to
>>>>>>>> loop over
>>>>>>>> nextPosition() for every term you find to see if any of them
>>>>> are in
>>>>>>>> your
>>>>>>>> range ... the best case scenerio is that the first position
>>>>>>>> returned is
>>>>>>>> above the high end of your range, in which case you can stop
>>>>>>>> immediately
>>>>>>>> and seek to the next term -- butthe worst case is that you call
>>>>>>>> nextPosition() over an over a lot of times before you get a
>>>>>>>> position in
>>>>>>>> (or above) your rnage .... an advancePosition(pos) that
>>>>> wokred like
>>>>>>>> seek
>>>>>>>> or skipTo might be helpful here.
>>>>>>>>
>>>>>>>> : I feel like I am missing something obvious.  I would
>>>>> suspect the
>>>>>>>> : highlighter needs to do this, but it seems to take the
>>>>> reanalyze
>>>>>>>> : approach as well (I admit, though, that I have little
>>>>> experience
>>>>>>>> with
>>>>>>>> : the highlighter.)
>>>>>>>>
>>>>>>>> as i understand it the default case is to reanalyze, but if you
>>>>>>> have
>>>>>>>> TermFreqVector info stored with positions (ie: a
>>>>>>>> TermPositionVector) then
>>>>>>>> it can use that to construct a TokenStream by iterating over  
>>>>>>>> all
>>>>>>>> terms and
>>>>>>>> writing them into a big array in position order (see the
>>>>>>>> TermSources class
>>>>>>>> in the highlighter)
>>>>>>>
>>>>>>>
>>>>>>> Ah, I see that now.  Thanks.
>>>>>>>>
>>>>>>>> this makes sense when highlighting because it doesn't know what
>>>>>>>> kind of
>>>>>>>> fragmenter is going to be used so it needs the whole
>>>>> TokenStream,
>>>>>>>> but it
>>>>>>>> seems less then ideal when you are only interested in a small
>>>>>>>> number of
>>>>>>>> position ranges that you know in advance.
>>>>>>>>
>>>>>>>> : I am wondering if it would be useful to have an alternative
>>>>> Term
>>>>>>>> : Vector storage mechanism that was position centric.
>>>>> Because we
>>>>>>>> : couldn't take advantage of the lexicographic compression, it
>>>>>>> would
>>>>>>>> : take up more disk space, but it would be a lot faster for
>>>>> these
>>>>>>>> kinds
>>>>>>>>
>>>>>>>> i'm not sure if it's really neccessary to store the data in a
>>>>>>> position
>>>>>>>> centric manner, assuming we have a way to "seek" by position
>>>>> like i
>>>>>>>> described above -- but then again i don't really know that
>>>>> what i
>>>>>>>> described above is all that possible/practical/performant.
>>>>>>>>
>>>>>>>
>>>>>>> I suppose I could use my Mapper approach to organize things in a
>>>>>>> position centric way now that I think about it more.  Just
>>>>> means some
>>>>>>> unpacking and repacking.  Still, probably would perform well
>>>>> enough
>>>>>>> since I can setup the correct structure on the fly.  I will
>>>>> give this
>>>>>>> a try.  Maybe even add a Mapper to do this.
>>>>>>>
>>>>>>>
>>>>>>> -Grant
>>>>>>>
>>>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>
>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>> Grant Ingersoll
>>>>> http://www.grantingersoll.com/
>>>>> http://lucene.grantingersoll.com
>>>>> http://www.paperoftheweek.com/
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucene.grantingersoll.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Best Practices for getting Strings from a position range

Posted by Peter Keegan <pe...@gmail.com>.

Grant,

I'm afraid I don't understand how to use this mapper in the context of a
SpanQuery. It seems like I would have to modify SpanScorer to fetch payload
data and provide a new method to access the payloads while iterating through
the documents. If this can be accomplished without modifying Spans, could
you provide a bit more detail?

Thanks,
Peter

On 8/9/07, Peter Keegan <pe...@gmail.com> wrote:
>
> Hi Grant,
>
> I'm hoping to check this out soon.
>
> Thanks,
> Peter
>
> On 8/7/07, Grant Ingersoll <gsingers@apache.org > wrote:
> >
> > Hi Peter,
> >
> > Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
> > provides a TermVectorMapper that loads by position.
> >
> > Still not what ideally what you want, but I haven't had time to scope
> > that one out yet.,
> >
> > -Grant
> >
> > On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
> >
> > > Hi Grant,
> > >
> > > No problem - I know you are very busy.  I just wanted to get a
> > > sense for the
> > > timing because I'd like to use this for a release this Fall. If I
> > > can get a
> > > prototype working in the coming weeks AND the performance is
> > > great :) , this
> > > would be terrific. If not, I'll have to fall back on a more complex
> > > design
> > > that handles the query outside of Lucene :(
> > >
> > > In the meantime, I'll try playing with LUCENE-868.
> > >
> > > Thanks for the update.
> > > Peter
> > >
> > > On 7/24/07, Grant Ingersoll <grant.ingersoll@gmail.com > wrote:
> > >>
> > >> Sorry, Peter, I haven't had a chance to work on it.  I don't see it
> > >> happening this week, but maybe next.
> > >>
> > >> I do think the Mapper approach via TermVectors will work.  It will
> > >> require implementing a new mapper that orders by position, but I
> > >> don't think that is too hard.   I started on one on the LUCENE-868
> > >> patch (version 4) but it is not complete.  Maybe you want to pick
> > >> it up?
> > >>
> > >> With this approach, you would iterate your spans, when you come to a
> > >> new doc, you would load the term vector using the PositionMapper, and
> > >> then you could index into the positions for the matches in the
> > >> document.
> > >>
> > >> I realize this does not cover the just wanting to get the Payload at
> > >> the match issue.  Maybe next week...
> > >>
> > >> Cheers,
> > >> Grant
> > >>
> > >> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
> > >>
> > >> > Any idea on when this might be available (days, weeks...)?
> > >> >
> > >> > Peter
> > >> >
> > >> > On 7/16/07, Grant Ingersoll < gsingers@apache.org> wrote:
> > >> >>
> > >> >>
> > >> >> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
> > >> >>
> > >> >> >
> > >> >> > : Do we have a best practice for going from, say a SpanQuery
> > >> doc/
> > >> >> > : position information and retrieving the actual range of
> > >> >> positions of
> > >> >> > : content from the Document?  Is it just to reanalyze the
> > >> Document
> > >> >> > : using the appropriate Analyzer and start recording once you
> > >> >> hit the
> > >> >> > : positions you are interested in?    Seems like Term Vectors
> > >> >> _could_
> > >> >> > : help, but even my new Mapper approach patch (LUCENE-868)
> > >> doesn't
> > >> >> > : really help, because they are stored in a term-centric
> > >> manner.  I
> > >> >> > : guess what I am after is a position centric approach.  That
> > >> >> is, give
> > >> >> >
> > >> >> > this is kind of what i was suggesting in the last message i sent
> > >> >> > to the java-user thread about paylods and SpanQueries (which i'm
> > >> >> > guessing is what prompted this thread as well)...
> > >> >> >
> > >> >> > http://www.nabble.com/Payloads-and-PhraseQuery-
> > >> >> > tf3988826.html#a11551628
> > >> >>
> > >> >>
> > >> >> This is one use case, the other is related to the new patch I
> > >> >> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
> > >> >> that identifies a bunch of docs and positions ahead of time.  Then
> >
> > >> >> the user enters new Span Query and I want to relate the matches
> > >> from
> > >> >> the user query with the positions of matches in the filter and
> > >> then
> > >> >> show that window.
> > >> >>
> > >> >> >
> > >> >> > my point was that currently, to retrieve a payload you need a
> > >> >> > TermPositions instance, which is designed for iterating in the
> > >> >> > order of...
> > >> >> >     seek(term)
> > >> >> >       skipTo(doc)
> > >> >> >          nextPosition()
> > >> >> >             getPayload()
> > >> >> > ...which is great for getting the payload of every instance
> > >> >> > (ie:position) of a specific term in a given document (or in
> > >> every
> > >> >> > document) but without serious changes to the Spans API, the
> > >> ideal
> > >> >> > payload
> > >> >> > API would let you say...
> > >> >> >     skipTo(doc)
> > >> >> >        advance(startPosition)
> > >> >> >          getPayload()
> > >> >> >        while (nextPosition() < endPosition)
> > >> >> >          getPosition()
> > >> >> >
> > >> >> > but this seems like a nearly impossible API to implement
> > >> given the
> > >> >> > natore
> > >> >> > of hte inverted index and the fact that terms aren't ever
> > >> stored in
> > >> >> > position order.
> > >> >> >
> > >> >> > there's a lot i really don't know/understand about the lucene
> > >> term
> > >> >> > position internals ... but as i recall, the datastructure
> > >> written
> > >> >> > to disk
> > >> >> > isn't actually a tree structure inverted index, it's a long
> > >> >> > sequence of
> > >> >> > tuples correct?  so in theory you could scan along the tuples
> > >> >> > untill you
> > >> >> > find the doc you are interested in, ignoring all of the term
> > >> info
> > >> >> > along
> > >> >> > the way, then whatever term you happen be on at the moment, you
> > >> >> > could scan
> > >> >> > along all of the positions until you find one in the range
> > >> you are
> > >> >> > interested in -- assuming you do, then you record the current
> > >> Term
> > >> >> > (and
> > >> >> > read your payload data if interested)
> > >> >>
> > >> >> I think the main issue I see is in both the payloads and the
> > >> matching
> > >> >> case above is that they require a document centric approach.  And
> > >> >> then, for each Document,
> > >> >> you ideally want to be able to just index into an array so that
> > >> you
> > >> >> can go directly to the position that is needed based on
> > >> >> Span.getStart()
> > >> >>
> > >> >> >
> > >> >> > if i remember correctly, the first part of this is easy, and
> > >> >> > relative fast
> > >> >> > -- i think skipTo(doc) on a TermDoc or TermPositions will
> > >> happily
> > >> >> > scan for
> > >> >> > the first <term,doc> pair with the correct docId,
> > >> irregardless of
> > >> >> > the term
> > >> >> > ... the only thing i'm not sure about is how efficient it is to
> > >> >> > loop over
> > >> >> > nextPosition() for every term you find to see if any of them
> > >> are in
> > >> >> > your
> > >> >> > range ... the best case scenerio is that the first position
> > >> >> > returned is
> > >> >> > above the high end of your range, in which case you can stop
> > >> >> > immediately
> > >> >> > and seek to the next term -- butthe worst case is that you call
> > >> >> > nextPosition() over an over a lot of times before you get a
> > >> >> > position in
> > >> >> > (or above) your rnage .... an advancePosition(pos) that
> > >> wokred like
> > >> >> > seek
> > >> >> > or skipTo might be helpful here.
> > >> >> >
> > >> >> > : I feel like I am missing something obvious.  I would
> > >> suspect the
> > >> >> > : highlighter needs to do this, but it seems to take the
> > >> reanalyze
> > >> >> > : approach as well (I admit, though, that I have little
> > >> experience
> > >> >> > with
> > >> >> > : the highlighter.)
> > >> >> >
> > >> >> > as i understand it the default case is to reanalyze, but if you
> > >> >> have
> > >> >> > TermFreqVector info stored with positions (ie: a
> > >> >> > TermPositionVector) then
> > >> >> > it can use that to construct a TokenStream by iterating over all
> > >> >> > terms and
> > >> >> > writing them into a big array in position order (see the
> > >> >> > TermSources class
> > >> >> > in the highlighter)
> > >> >>
> > >> >>
> > >> >> Ah, I see that now.  Thanks.
> > >> >> >
> > >> >> > this makes sense when highlighting because it doesn't know what
> > >> >> > kind of
> > >> >> > fragmenter is going to be used so it needs the whole
> > >> TokenStream,
> > >> >> > but it
> > >> >> > seems less then ideal when you are only interested in a small
> > >> >> > number of
> > >> >> > position ranges that you know in advance.
> > >> >> >
> > >> >> > : I am wondering if it would be useful to have an alternative
> > >> Term
> > >> >> > : Vector storage mechanism that was position centric.
> > >> Because we
> > >> >> > : couldn't take advantage of the lexicographic compression, it
> > >> >> would
> > >> >> > : take up more disk space, but it would be a lot faster for
> > >> these
> > >> >> > kinds
> > >> >> >
> > >> >> > i'm not sure if it's really neccessary to store the data in a
> > >> >> position
> > >> >> > centric manner, assuming we have a way to "seek" by position
> > >> like i
> > >> >> > described above -- but then again i don't really know that
> > >> what i
> > >> >> > described above is all that possible/practical/performant.
> > >> >> >
> > >> >>
> > >> >> I suppose I could use my Mapper approach to organize things in a
> > >> >> position centric way now that I think about it more.  Just
> > >> means some
> > >> >> unpacking and repacking.  Still, probably would perform well
> > >> enough
> > >> >> since I can setup the correct structure on the fly.  I will
> > >> give this
> > >> >> a try.  Maybe even add a Mapper to do this.
> > >> >>
> > >> >>
> > >> >> -Grant
> > >> >>
> > >> >>
> > >> ---------------------------------------------------------------------
> >
> > >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >> >>
> > >> >>
> > >>
> > >> ------------------------------------------------------
> > >> Grant Ingersoll
> > >> http://www.grantingersoll.com/
> > >> http://lucene.grantingersoll.com
> > >> http://www.paperoftheweek.com/
> > >>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >>
> > >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucene.grantingersoll.com
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>

Re: Best Practices for getting Strings from a position range

Posted by Peter Keegan <pe...@gmail.com>.

Hi Grant,

I'm hoping to check this out soon.

Thanks,
Peter

On 8/7/07, Grant Ingersoll <gs...@apache.org> wrote:
>
> Hi Peter,
>
> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
> provides a TermVectorMapper that loads by position.
>
> Still not what ideally what you want, but I haven't had time to scope
> that one out yet.,
>
> -Grant
>
> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
>
> > Hi Grant,
> >
> > No problem - I know you are very busy.  I just wanted to get a
> > sense for the
> > timing because I'd like to use this for a release this Fall. If I
> > can get a
> > prototype working in the coming weeks AND the performance is
> > great :) , this
> > would be terrific. If not, I'll have to fall back on a more complex
> > design
> > that handles the query outside of Lucene :(
> >
> > In the meantime, I'll try playing with LUCENE-868.
> >
> > Thanks for the update.
> > Peter
> >
> > On 7/24/07, Grant Ingersoll <gr...@gmail.com> wrote:
> >>
> >> Sorry, Peter, I haven't had a chance to work on it.  I don't see it
> >> happening this week, but maybe next.
> >>
> >> I do think the Mapper approach via TermVectors will work.  It will
> >> require implementing a new mapper that orders by position, but I
> >> don't think that is too hard.   I started on one on the LUCENE-868
> >> patch (version 4) but it is not complete.  Maybe you want to pick
> >> it up?
> >>
> >> With this approach, you would iterate your spans, when you come to a
> >> new doc, you would load the term vector using the PositionMapper, and
> >> then you could index into the positions for the matches in the
> >> document.
> >>
> >> I realize this does not cover the just wanting to get the Payload at
> >> the match issue.  Maybe next week...
> >>
> >> Cheers,
> >> Grant
> >>
> >> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
> >>
> >> > Any idea on when this might be available (days, weeks...)?
> >> >
> >> > Peter
> >> >
> >> > On 7/16/07, Grant Ingersoll <gs...@apache.org> wrote:
> >> >>
> >> >>
> >> >> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
> >> >>
> >> >> >
> >> >> > : Do we have a best practice for going from, say a SpanQuery
> >> doc/
> >> >> > : position information and retrieving the actual range of
> >> >> positions of
> >> >> > : content from the Document?  Is it just to reanalyze the
> >> Document
> >> >> > : using the appropriate Analyzer and start recording once you
> >> >> hit the
> >> >> > : positions you are interested in?    Seems like Term Vectors
> >> >> _could_
> >> >> > : help, but even my new Mapper approach patch (LUCENE-868)
> >> doesn't
> >> >> > : really help, because they are stored in a term-centric
> >> manner.  I
> >> >> > : guess what I am after is a position centric approach.  That
> >> >> is, give
> >> >> >
> >> >> > this is kind of what i was suggesting in the last message i sent
> >> >> > to the java-user thread about paylods and SpanQueries (which i'm
> >> >> > guessing is what prompted this thread as well)...
> >> >> >
> >> >> > http://www.nabble.com/Payloads-and-PhraseQuery-
> >> >> > tf3988826.html#a11551628
> >> >>
> >> >>
> >> >> This is one use case, the other is related to the new patch I
> >> >> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
> >> >> that identifies a bunch of docs and positions ahead of time.  Then
> >> >> the user enters new Span Query and I want to relate the matches
> >> from
> >> >> the user query with the positions of matches in the filter and
> >> then
> >> >> show that window.
> >> >>
> >> >> >
> >> >> > my point was that currently, to retrieve a payload you need a
> >> >> > TermPositions instance, which is designed for iterating in the
> >> >> > order of...
> >> >> >     seek(term)
> >> >> >       skipTo(doc)
> >> >> >          nextPosition()
> >> >> >             getPayload()
> >> >> > ...which is great for getting the payload of every instance
> >> >> > (ie:position) of a specific term in a given document (or in
> >> every
> >> >> > document) but without serious changes to the Spans API, the
> >> ideal
> >> >> > payload
> >> >> > API would let you say...
> >> >> >     skipTo(doc)
> >> >> >        advance(startPosition)
> >> >> >          getPayload()
> >> >> >        while (nextPosition() < endPosition)
> >> >> >          getPosition()
> >> >> >
> >> >> > but this seems like a nearly impossible API to implement
> >> given the
> >> >> > natore
> >> >> > of hte inverted index and the fact that terms aren't ever
> >> stored in
> >> >> > position order.
> >> >> >
> >> >> > there's a lot i really don't know/understand about the lucene
> >> term
> >> >> > position internals ... but as i recall, the datastructure
> >> written
> >> >> > to disk
> >> >> > isn't actually a tree structure inverted index, it's a long
> >> >> > sequence of
> >> >> > tuples correct?  so in theory you could scan along the tuples
> >> >> > untill you
> >> >> > find the doc you are interested in, ignoring all of the term
> >> info
> >> >> > along
> >> >> > the way, then whatever term you happen be on at the moment, you
> >> >> > could scan
> >> >> > along all of the positions until you find one in the range
> >> you are
> >> >> > interested in -- assuming you do, then you record the current
> >> Term
> >> >> > (and
> >> >> > read your payload data if interested)
> >> >>
> >> >> I think the main issue I see is in both the payloads and the
> >> matching
> >> >> case above is that they require a document centric approach.  And
> >> >> then, for each Document,
> >> >> you ideally want to be able to just index into an array so that
> >> you
> >> >> can go directly to the position that is needed based on
> >> >> Span.getStart()
> >> >>
> >> >> >
> >> >> > if i remember correctly, the first part of this is easy, and
> >> >> > relative fast
> >> >> > -- i think skipTo(doc) on a TermDoc or TermPositions will
> >> happily
> >> >> > scan for
> >> >> > the first <term,doc> pair with the correct docId,
> >> irregardless of
> >> >> > the term
> >> >> > ... the only thing i'm not sure about is how efficient it is to
> >> >> > loop over
> >> >> > nextPosition() for every term you find to see if any of them
> >> are in
> >> >> > your
> >> >> > range ... the best case scenerio is that the first position
> >> >> > returned is
> >> >> > above the high end of your range, in which case you can stop
> >> >> > immediately
> >> >> > and seek to the next term -- butthe worst case is that you call
> >> >> > nextPosition() over an over a lot of times before you get a
> >> >> > position in
> >> >> > (or above) your rnage .... an advancePosition(pos) that
> >> wokred like
> >> >> > seek
> >> >> > or skipTo might be helpful here.
> >> >> >
> >> >> > : I feel like I am missing something obvious.  I would
> >> suspect the
> >> >> > : highlighter needs to do this, but it seems to take the
> >> reanalyze
> >> >> > : approach as well (I admit, though, that I have little
> >> experience
> >> >> > with
> >> >> > : the highlighter.)
> >> >> >
> >> >> > as i understand it the default case is to reanalyze, but if you
> >> >> have
> >> >> > TermFreqVector info stored with positions (ie: a
> >> >> > TermPositionVector) then
> >> >> > it can use that to construct a TokenStream by iterating over all
> >> >> > terms and
> >> >> > writing them into a big array in position order (see the
> >> >> > TermSources class
> >> >> > in the highlighter)
> >> >>
> >> >>
> >> >> Ah, I see that now.  Thanks.
> >> >> >
> >> >> > this makes sense when highlighting because it doesn't know what
> >> >> > kind of
> >> >> > fragmenter is going to be used so it needs the whole
> >> TokenStream,
> >> >> > but it
> >> >> > seems less then ideal when you are only interested in a small
> >> >> > number of
> >> >> > position ranges that you know in advance.
> >> >> >
> >> >> > : I am wondering if it would be useful to have an alternative
> >> Term
> >> >> > : Vector storage mechanism that was position centric.
> >> Because we
> >> >> > : couldn't take advantage of the lexicographic compression, it
> >> >> would
> >> >> > : take up more disk space, but it would be a lot faster for
> >> these
> >> >> > kinds
> >> >> >
> >> >> > i'm not sure if it's really neccessary to store the data in a
> >> >> position
> >> >> > centric manner, assuming we have a way to "seek" by position
> >> like i
> >> >> > described above -- but then again i don't really know that
> >> what i
> >> >> > described above is all that possible/practical/performant.
> >> >> >
> >> >>
> >> >> I suppose I could use my Mapper approach to organize things in a
> >> >> position centric way now that I think about it more.  Just
> >> means some
> >> >> unpacking and repacking.  Still, probably would perform well
> >> enough
> >> >> since I can setup the correct structure on the fly.  I will
> >> give this
> >> >> a try.  Maybe even add a Mapper to do this.
> >> >>
> >> >>
> >> >> -Grant
> >> >>
> >> >>
> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >>
> >>
> >> ------------------------------------------------------
> >> Grant Ingersoll
> >> http://www.grantingersoll.com/
> >> http://lucene.grantingersoll.com
> >> http://www.paperoftheweek.com/
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Best Practices for getting Strings from a position range

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Peter,

Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It  
provides a TermVectorMapper that loads by position.

Still not what ideally what you want, but I haven't had time to scope  
that one out yet.,

-Grant

On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:

> Hi Grant,
>
> No problem - I know you are very busy.  I just wanted to get a  
> sense for the
> timing because I'd like to use this for a release this Fall. If I  
> can get a
> prototype working in the coming weeks AND the performance is  
> great :) , this
> would be terrific. If not, I'll have to fall back on a more complex  
> design
> that handles the query outside of Lucene :(
>
> In the meantime, I'll try playing with LUCENE-868.
>
> Thanks for the update.
> Peter
>
> On 7/24/07, Grant Ingersoll <gr...@gmail.com> wrote:
>>
>> Sorry, Peter, I haven't had a chance to work on it.  I don't see it
>> happening this week, but maybe next.
>>
>> I do think the Mapper approach via TermVectors will work.  It will
>> require implementing a new mapper that orders by position, but I
>> don't think that is too hard.   I started on one on the LUCENE-868
>> patch (version 4) but it is not complete.  Maybe you want to pick  
>> it up?
>>
>> With this approach, you would iterate your spans, when you come to a
>> new doc, you would load the term vector using the PositionMapper, and
>> then you could index into the positions for the matches in the  
>> document.
>>
>> I realize this does not cover the just wanting to get the Payload at
>> the match issue.  Maybe next week...
>>
>> Cheers,
>> Grant
>>
>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
>>
>> > Any idea on when this might be available (days, weeks...)?
>> >
>> > Peter
>> >
>> > On 7/16/07, Grant Ingersoll <gs...@apache.org> wrote:
>> >>
>> >>
>> >> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>> >>
>> >> >
>> >> > : Do we have a best practice for going from, say a SpanQuery  
>> doc/
>> >> > : position information and retrieving the actual range of
>> >> positions of
>> >> > : content from the Document?  Is it just to reanalyze the  
>> Document
>> >> > : using the appropriate Analyzer and start recording once you
>> >> hit the
>> >> > : positions you are interested in?    Seems like Term Vectors
>> >> _could_
>> >> > : help, but even my new Mapper approach patch (LUCENE-868)  
>> doesn't
>> >> > : really help, because they are stored in a term-centric  
>> manner.  I
>> >> > : guess what I am after is a position centric approach.  That
>> >> is, give
>> >> >
>> >> > this is kind of what i was suggesting in the last message i sent
>> >> > to the java-user thread about paylods and SpanQueries (which i'm
>> >> > guessing is what prompted this thread as well)...
>> >> >
>> >> > http://www.nabble.com/Payloads-and-PhraseQuery-
>> >> > tf3988826.html#a11551628
>> >>
>> >>
>> >> This is one use case, the other is related to the new patch I
>> >> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
>> >> that identifies a bunch of docs and positions ahead of time.  Then
>> >> the user enters new Span Query and I want to relate the matches  
>> from
>> >> the user query with the positions of matches in the filter and  
>> then
>> >> show that window.
>> >>
>> >> >
>> >> > my point was that currently, to retrieve a payload you need a
>> >> > TermPositions instance, which is designed for iterating in the
>> >> > order of...
>> >> >     seek(term)
>> >> >       skipTo(doc)
>> >> >          nextPosition()
>> >> >             getPayload()
>> >> > ...which is great for getting the payload of every instance
>> >> > (ie:position) of a specific term in a given document (or in  
>> every
>> >> > document) but without serious changes to the Spans API, the  
>> ideal
>> >> > payload
>> >> > API would let you say...
>> >> >     skipTo(doc)
>> >> >        advance(startPosition)
>> >> >          getPayload()
>> >> >        while (nextPosition() < endPosition)
>> >> >          getPosition()
>> >> >
>> >> > but this seems like a nearly impossible API to implement  
>> given the
>> >> > natore
>> >> > of hte inverted index and the fact that terms aren't ever  
>> stored in
>> >> > position order.
>> >> >
>> >> > there's a lot i really don't know/understand about the lucene  
>> term
>> >> > position internals ... but as i recall, the datastructure  
>> written
>> >> > to disk
>> >> > isn't actually a tree structure inverted index, it's a long
>> >> > sequence of
>> >> > tuples correct?  so in theory you could scan along the tuples
>> >> > untill you
>> >> > find the doc you are interested in, ignoring all of the term  
>> info
>> >> > along
>> >> > the way, then whatever term you happen be on at the moment, you
>> >> > could scan
>> >> > along all of the positions until you find one in the range  
>> you are
>> >> > interested in -- assuming you do, then you record the current  
>> Term
>> >> > (and
>> >> > read your payload data if interested)
>> >>
>> >> I think the main issue I see is in both the payloads and the  
>> matching
>> >> case above is that they require a document centric approach.  And
>> >> then, for each Document,
>> >> you ideally want to be able to just index into an array so that  
>> you
>> >> can go directly to the position that is needed based on
>> >> Span.getStart()
>> >>
>> >> >
>> >> > if i remember correctly, the first part of this is easy, and
>> >> > relative fast
>> >> > -- i think skipTo(doc) on a TermDoc or TermPositions will  
>> happily
>> >> > scan for
>> >> > the first <term,doc> pair with the correct docId,  
>> irregardless of
>> >> > the term
>> >> > ... the only thing i'm not sure about is how efficient it is to
>> >> > loop over
>> >> > nextPosition() for every term you find to see if any of them  
>> are in
>> >> > your
>> >> > range ... the best case scenerio is that the first position
>> >> > returned is
>> >> > above the high end of your range, in which case you can stop
>> >> > immediately
>> >> > and seek to the next term -- butthe worst case is that you call
>> >> > nextPosition() over an over a lot of times before you get a
>> >> > position in
>> >> > (or above) your rnage .... an advancePosition(pos) that  
>> wokred like
>> >> > seek
>> >> > or skipTo might be helpful here.
>> >> >
>> >> > : I feel like I am missing something obvious.  I would  
>> suspect the
>> >> > : highlighter needs to do this, but it seems to take the  
>> reanalyze
>> >> > : approach as well (I admit, though, that I have little  
>> experience
>> >> > with
>> >> > : the highlighter.)
>> >> >
>> >> > as i understand it the default case is to reanalyze, but if you
>> >> have
>> >> > TermFreqVector info stored with positions (ie: a
>> >> > TermPositionVector) then
>> >> > it can use that to construct a TokenStream by iterating over all
>> >> > terms and
>> >> > writing them into a big array in position order (see the
>> >> > TermSources class
>> >> > in the highlighter)
>> >>
>> >>
>> >> Ah, I see that now.  Thanks.
>> >> >
>> >> > this makes sense when highlighting because it doesn't know what
>> >> > kind of
>> >> > fragmenter is going to be used so it needs the whole  
>> TokenStream,
>> >> > but it
>> >> > seems less then ideal when you are only interested in a small
>> >> > number of
>> >> > position ranges that you know in advance.
>> >> >
>> >> > : I am wondering if it would be useful to have an alternative  
>> Term
>> >> > : Vector storage mechanism that was position centric.   
>> Because we
>> >> > : couldn't take advantage of the lexicographic compression, it
>> >> would
>> >> > : take up more disk space, but it would be a lot faster for  
>> these
>> >> > kinds
>> >> >
>> >> > i'm not sure if it's really neccessary to store the data in a
>> >> position
>> >> > centric manner, assuming we have a way to "seek" by position  
>> like i
>> >> > described above -- but then again i don't really know that  
>> what i
>> >> > described above is all that possible/practical/performant.
>> >> >
>> >>
>> >> I suppose I could use my Mapper approach to organize things in a
>> >> position centric way now that I think about it more.  Just  
>> means some
>> >> unpacking and repacking.  Still, probably would perform well  
>> enough
>> >> since I can setup the correct structure on the fly.  I will  
>> give this
>> >> a try.  Maybe even add a Mapper to do this.
>> >>
>> >>
>> >> -Grant
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>>
>> ------------------------------------------------------
>> Grant Ingersoll
>> http://www.grantingersoll.com/
>> http://lucene.grantingersoll.com
>> http://www.paperoftheweek.com/
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Best Practices for getting Strings from a position range

Posted by Peter Keegan <pe...@gmail.com>.

Hi Grant,

No problem - I know you are very busy.  I just wanted to get a sense for the
timing because I'd like to use this for a release this Fall. If I can get a
prototype working in the coming weeks AND the performance is great :) , this
would be terrific. If not, I'll have to fall back on a more complex design
that handles the query outside of Lucene :(

In the meantime, I'll try playing with LUCENE-868.

Thanks for the update.
Peter

On 7/24/07, Grant Ingersoll <gr...@gmail.com> wrote:
>
> Sorry, Peter, I haven't had a chance to work on it.  I don't see it
> happening this week, but maybe next.
>
> I do think the Mapper approach via TermVectors will work.  It will
> require implementing a new mapper that orders by position, but I
> don't think that is too hard.   I started on one on the LUCENE-868
> patch (version 4) but it is not complete.  Maybe you want to pick it up?
>
> With this approach, you would iterate your spans, when you come to a
> new doc, you would load the term vector using the PositionMapper, and
> then you could index into the positions for the matches in the document.
>
> I realize this does not cover the just wanting to get the Payload at
> the match issue.  Maybe next week...
>
> Cheers,
> Grant
>
> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
>
> > Any idea on when this might be available (days, weeks...)?
> >
> > Peter
> >
> > On 7/16/07, Grant Ingersoll <gs...@apache.org> wrote:
> >>
> >>
> >> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
> >>
> >> >
> >> > : Do we have a best practice for going from, say a SpanQuery doc/
> >> > : position information and retrieving the actual range of
> >> positions of
> >> > : content from the Document?  Is it just to reanalyze the Document
> >> > : using the appropriate Analyzer and start recording once you
> >> hit the
> >> > : positions you are interested in?    Seems like Term Vectors
> >> _could_
> >> > : help, but even my new Mapper approach patch (LUCENE-868) doesn't
> >> > : really help, because they are stored in a term-centric manner.  I
> >> > : guess what I am after is a position centric approach.  That
> >> is, give
> >> >
> >> > this is kind of what i was suggesting in the last message i sent
> >> > to the java-user thread about paylods and SpanQueries (which i'm
> >> > guessing is what prompted this thread as well)...
> >> >
> >> > http://www.nabble.com/Payloads-and-PhraseQuery-
> >> > tf3988826.html#a11551628
> >>
> >>
> >> This is one use case, the other is related to the new patch I
> >> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
> >> that identifies a bunch of docs and positions ahead of time.  Then
> >> the user enters new Span Query and I want to relate the matches from
> >> the user query with the positions of matches in the filter and then
> >> show that window.
> >>
> >> >
> >> > my point was that currently, to retrieve a payload you need a
> >> > TermPositions instance, which is designed for iterating in the
> >> > order of...
> >> >     seek(term)
> >> >       skipTo(doc)
> >> >          nextPosition()
> >> >             getPayload()
> >> > ...which is great for getting the payload of every instance
> >> > (ie:position) of a specific term in a given document (or in every
> >> > document) but without serious changes to the Spans API, the ideal
> >> > payload
> >> > API would let you say...
> >> >     skipTo(doc)
> >> >        advance(startPosition)
> >> >          getPayload()
> >> >        while (nextPosition() < endPosition)
> >> >          getPosition()
> >> >
> >> > but this seems like a nearly impossible API to implement given the
> >> > natore
> >> > of hte inverted index and the fact that terms aren't ever stored in
> >> > position order.
> >> >
> >> > there's a lot i really don't know/understand about the lucene term
> >> > position internals ... but as i recall, the datastructure written
> >> > to disk
> >> > isn't actually a tree structure inverted index, it's a long
> >> > sequence of
> >> > tuples correct?  so in theory you could scan along the tuples
> >> > untill you
> >> > find the doc you are interested in, ignoring all of the term info
> >> > along
> >> > the way, then whatever term you happen be on at the moment, you
> >> > could scan
> >> > along all of the positions until you find one in the range you are
> >> > interested in -- assuming you do, then you record the current Term
> >> > (and
> >> > read your payload data if interested)
> >>
> >> I think the main issue I see is in both the payloads and the matching
> >> case above is that they require a document centric approach.  And
> >> then, for each Document,
> >> you ideally want to be able to just index into an array so that you
> >> can go directly to the position that is needed based on
> >> Span.getStart()
> >>
> >> >
> >> > if i remember correctly, the first part of this is easy, and
> >> > relative fast
> >> > -- i think skipTo(doc) on a TermDoc or TermPositions will happily
> >> > scan for
> >> > the first <term,doc> pair with the correct docId, irregardless of
> >> > the term
> >> > ... the only thing i'm not sure about is how efficient it is to
> >> > loop over
> >> > nextPosition() for every term you find to see if any of them are in
> >> > your
> >> > range ... the best case scenerio is that the first position
> >> > returned is
> >> > above the high end of your range, in which case you can stop
> >> > immediately
> >> > and seek to the next term -- butthe worst case is that you call
> >> > nextPosition() over an over a lot of times before you get a
> >> > position in
> >> > (or above) your rnage .... an advancePosition(pos) that wokred like
> >> > seek
> >> > or skipTo might be helpful here.
> >> >
> >> > : I feel like I am missing something obvious.  I would suspect the
> >> > : highlighter needs to do this, but it seems to take the reanalyze
> >> > : approach as well (I admit, though, that I have little experience
> >> > with
> >> > : the highlighter.)
> >> >
> >> > as i understand it the default case is to reanalyze, but if you
> >> have
> >> > TermFreqVector info stored with positions (ie: a
> >> > TermPositionVector) then
> >> > it can use that to construct a TokenStream by iterating over all
> >> > terms and
> >> > writing them into a big array in position order (see the
> >> > TermSources class
> >> > in the highlighter)
> >>
> >>
> >> Ah, I see that now.  Thanks.
> >> >
> >> > this makes sense when highlighting because it doesn't know what
> >> > kind of
> >> > fragmenter is going to be used so it needs the whole TokenStream,
> >> > but it
> >> > seems less then ideal when you are only interested in a small
> >> > number of
> >> > position ranges that you know in advance.
> >> >
> >> > : I am wondering if it would be useful to have an alternative Term
> >> > : Vector storage mechanism that was position centric.  Because we
> >> > : couldn't take advantage of the lexicographic compression, it
> >> would
> >> > : take up more disk space, but it would be a lot faster for these
> >> > kinds
> >> >
> >> > i'm not sure if it's really neccessary to store the data in a
> >> position
> >> > centric manner, assuming we have a way to "seek" by position like i
> >> > described above -- but then again i don't really know that what i
> >> > described above is all that possible/practical/performant.
> >> >
> >>
> >> I suppose I could use my Mapper approach to organize things in a
> >> position centric way now that I think about it more.  Just means some
> >> unpacking and repacking.  Still, probably would perform well enough
> >> since I can setup the correct structure on the fly.  I will give this
> >> a try.  Maybe even add a Mapper to do this.
> >>
> >>
> >> -Grant
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Best Practices for getting Strings from a position range

Posted by Grant Ingersoll <gr...@gmail.com>.

Sorry, Peter, I haven't had a chance to work on it.  I don't see it  
happening this week, but maybe next.

I do think the Mapper approach via TermVectors will work.  It will  
require implementing a new mapper that orders by position, but I  
don't think that is too hard.   I started on one on the LUCENE-868  
patch (version 4) but it is not complete.  Maybe you want to pick it up?

With this approach, you would iterate your spans, when you come to a  
new doc, you would load the term vector using the PositionMapper, and  
then you could index into the positions for the matches in the document.

I realize this does not cover the just wanting to get the Payload at  
the match issue.  Maybe next week...

Cheers,
Grant

On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:

> Any idea on when this might be available (days, weeks...)?
>
> Peter
>
> On 7/16/07, Grant Ingersoll <gs...@apache.org> wrote:
>>
>>
>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>>
>> >
>> > : Do we have a best practice for going from, say a SpanQuery doc/
>> > : position information and retrieving the actual range of  
>> positions of
>> > : content from the Document?  Is it just to reanalyze the Document
>> > : using the appropriate Analyzer and start recording once you  
>> hit the
>> > : positions you are interested in?    Seems like Term Vectors  
>> _could_
>> > : help, but even my new Mapper approach patch (LUCENE-868) doesn't
>> > : really help, because they are stored in a term-centric manner.  I
>> > : guess what I am after is a position centric approach.  That  
>> is, give
>> >
>> > this is kind of what i was suggesting in the last message i sent
>> > to the java-user thread about paylods and SpanQueries (which i'm
>> > guessing is what prompted this thread as well)...
>> >
>> > http://www.nabble.com/Payloads-and-PhraseQuery-
>> > tf3988826.html#a11551628
>>
>>
>> This is one use case, the other is related to the new patch I
>> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
>> that identifies a bunch of docs and positions ahead of time.  Then
>> the user enters new Span Query and I want to relate the matches from
>> the user query with the positions of matches in the filter and then
>> show that window.
>>
>> >
>> > my point was that currently, to retrieve a payload you need a
>> > TermPositions instance, which is designed for iterating in the
>> > order of...
>> >     seek(term)
>> >       skipTo(doc)
>> >          nextPosition()
>> >             getPayload()
>> > ...which is great for getting the payload of every instance
>> > (ie:position) of a specific term in a given document (or in every
>> > document) but without serious changes to the Spans API, the ideal
>> > payload
>> > API would let you say...
>> >     skipTo(doc)
>> >        advance(startPosition)
>> >          getPayload()
>> >        while (nextPosition() < endPosition)
>> >          getPosition()
>> >
>> > but this seems like a nearly impossible API to implement given the
>> > natore
>> > of hte inverted index and the fact that terms aren't ever stored in
>> > position order.
>> >
>> > there's a lot i really don't know/understand about the lucene term
>> > position internals ... but as i recall, the datastructure written
>> > to disk
>> > isn't actually a tree structure inverted index, it's a long
>> > sequence of
>> > tuples correct?  so in theory you could scan along the tuples
>> > untill you
>> > find the doc you are interested in, ignoring all of the term info
>> > along
>> > the way, then whatever term you happen be on at the moment, you
>> > could scan
>> > along all of the positions until you find one in the range you are
>> > interested in -- assuming you do, then you record the current Term
>> > (and
>> > read your payload data if interested)
>>
>> I think the main issue I see is in both the payloads and the matching
>> case above is that they require a document centric approach.  And
>> then, for each Document,
>> you ideally want to be able to just index into an array so that you
>> can go directly to the position that is needed based on  
>> Span.getStart()
>>
>> >
>> > if i remember correctly, the first part of this is easy, and
>> > relative fast
>> > -- i think skipTo(doc) on a TermDoc or TermPositions will happily
>> > scan for
>> > the first <term,doc> pair with the correct docId, irregardless of
>> > the term
>> > ... the only thing i'm not sure about is how efficient it is to
>> > loop over
>> > nextPosition() for every term you find to see if any of them are in
>> > your
>> > range ... the best case scenerio is that the first position
>> > returned is
>> > above the high end of your range, in which case you can stop
>> > immediately
>> > and seek to the next term -- butthe worst case is that you call
>> > nextPosition() over an over a lot of times before you get a
>> > position in
>> > (or above) your rnage .... an advancePosition(pos) that wokred like
>> > seek
>> > or skipTo might be helpful here.
>> >
>> > : I feel like I am missing something obvious.  I would suspect the
>> > : highlighter needs to do this, but it seems to take the reanalyze
>> > : approach as well (I admit, though, that I have little experience
>> > with
>> > : the highlighter.)
>> >
>> > as i understand it the default case is to reanalyze, but if you  
>> have
>> > TermFreqVector info stored with positions (ie: a
>> > TermPositionVector) then
>> > it can use that to construct a TokenStream by iterating over all
>> > terms and
>> > writing them into a big array in position order (see the
>> > TermSources class
>> > in the highlighter)
>>
>>
>> Ah, I see that now.  Thanks.
>> >
>> > this makes sense when highlighting because it doesn't know what
>> > kind of
>> > fragmenter is going to be used so it needs the whole TokenStream,
>> > but it
>> > seems less then ideal when you are only interested in a small
>> > number of
>> > position ranges that you know in advance.
>> >
>> > : I am wondering if it would be useful to have an alternative Term
>> > : Vector storage mechanism that was position centric.  Because we
>> > : couldn't take advantage of the lexicographic compression, it  
>> would
>> > : take up more disk space, but it would be a lot faster for these
>> > kinds
>> >
>> > i'm not sure if it's really neccessary to store the data in a  
>> position
>> > centric manner, assuming we have a way to "seek" by position like i
>> > described above -- but then again i don't really know that what i
>> > described above is all that possible/practical/performant.
>> >
>>
>> I suppose I could use my Mapper approach to organize things in a
>> position centric way now that I think about it more.  Just means some
>> unpacking and repacking.  Still, probably would perform well enough
>> since I can setup the correct structure on the fly.  I will give this
>> a try.  Maybe even add a Mapper to do this.
>>
>>
>> -Grant
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Best Practices for getting Strings from a position range

Posted by Peter Keegan <pe...@gmail.com>.

Any idea on when this might be available (days, weeks...)?

Peter

On 7/16/07, Grant Ingersoll <gs...@apache.org> wrote:
>
>
> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>
> >
> > : Do we have a best practice for going from, say a SpanQuery doc/
> > : position information and retrieving the actual range of positions of
> > : content from the Document?  Is it just to reanalyze the Document
> > : using the appropriate Analyzer and start recording once you hit the
> > : positions you are interested in?    Seems like Term Vectors _could_
> > : help, but even my new Mapper approach patch (LUCENE-868) doesn't
> > : really help, because they are stored in a term-centric manner.  I
> > : guess what I am after is a position centric approach.  That is, give
> >
> > this is kind of what i was suggesting in the last message i sent
> > to the java-user thread about paylods and SpanQueries (which i'm
> > guessing is what prompted this thread as well)...
> >
> > http://www.nabble.com/Payloads-and-PhraseQuery-
> > tf3988826.html#a11551628
>
>
> This is one use case, the other is related to the new patch I
> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
> that identifies a bunch of docs and positions ahead of time.  Then
> the user enters new Span Query and I want to relate the matches from
> the user query with the positions of matches in the filter and then
> show that window.
>
> >
> > my point was that currently, to retrieve a payload you need a
> > TermPositions instance, which is designed for iterating in the
> > order of...
> >     seek(term)
> >       skipTo(doc)
> >          nextPosition()
> >             getPayload()
> > ...which is great for getting the payload of every instance
> > (ie:position) of a specific term in a given document (or in every
> > document) but without serious changes to the Spans API, the ideal
> > payload
> > API would let you say...
> >     skipTo(doc)
> >        advance(startPosition)
> >          getPayload()
> >        while (nextPosition() < endPosition)
> >          getPosition()
> >
> > but this seems like a nearly impossible API to implement given the
> > natore
> > of hte inverted index and the fact that terms aren't ever stored in
> > position order.
> >
> > there's a lot i really don't know/understand about the lucene term
> > position internals ... but as i recall, the datastructure written
> > to disk
> > isn't actually a tree structure inverted index, it's a long
> > sequence of
> > tuples correct?  so in theory you could scan along the tuples
> > untill you
> > find the doc you are interested in, ignoring all of the term info
> > along
> > the way, then whatever term you happen be on at the moment, you
> > could scan
> > along all of the positions until you find one in the range you are
> > interested in -- assuming you do, then you record the current Term
> > (and
> > read your payload data if interested)
>
> I think the main issue I see is in both the payloads and the matching
> case above is that they require a document centric approach.  And
> then, for each Document,
> you ideally want to be able to just index into an array so that you
> can go directly to the position that is needed based on Span.getStart()
>
> >
> > if i remember correctly, the first part of this is easy, and
> > relative fast
> > -- i think skipTo(doc) on a TermDoc or TermPositions will happily
> > scan for
> > the first <term,doc> pair with the correct docId, irregardless of
> > the term
> > ... the only thing i'm not sure about is how efficient it is to
> > loop over
> > nextPosition() for every term you find to see if any of them are in
> > your
> > range ... the best case scenerio is that the first position
> > returned is
> > above the high end of your range, in which case you can stop
> > immediately
> > and seek to the next term -- butthe worst case is that you call
> > nextPosition() over an over a lot of times before you get a
> > position in
> > (or above) your rnage .... an advancePosition(pos) that wokred like
> > seek
> > or skipTo might be helpful here.
> >
> > : I feel like I am missing something obvious.  I would suspect the
> > : highlighter needs to do this, but it seems to take the reanalyze
> > : approach as well (I admit, though, that I have little experience
> > with
> > : the highlighter.)
> >
> > as i understand it the default case is to reanalyze, but if you have
> > TermFreqVector info stored with positions (ie: a
> > TermPositionVector) then
> > it can use that to construct a TokenStream by iterating over all
> > terms and
> > writing them into a big array in position order (see the
> > TermSources class
> > in the highlighter)
>
>
> Ah, I see that now.  Thanks.
> >
> > this makes sense when highlighting because it doesn't know what
> > kind of
> > fragmenter is going to be used so it needs the whole TokenStream,
> > but it
> > seems less then ideal when you are only interested in a small
> > number of
> > position ranges that you know in advance.
> >
> > : I am wondering if it would be useful to have an alternative Term
> > : Vector storage mechanism that was position centric.  Because we
> > : couldn't take advantage of the lexicographic compression, it would
> > : take up more disk space, but it would be a lot faster for these
> > kinds
> >
> > i'm not sure if it's really neccessary to store the data in a position
> > centric manner, assuming we have a way to "seek" by position like i
> > described above -- but then again i don't really know that what i
> > described above is all that possible/practical/performant.
> >
>
> I suppose I could use my Mapper approach to organize things in a
> position centric way now that I think about it more.  Just means some
> unpacking and repacking.  Still, probably would perform well enough
> since I can setup the correct structure on the fly.  I will give this
> a try.  Maybe even add a Mapper to do this.
>
>
> -Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Best Practices for getting Strings from a position range

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:

>
> : Do we have a best practice for going from, say a SpanQuery doc/
> : position information and retrieving the actual range of positions of
> : content from the Document?  Is it just to reanalyze the Document
> : using the appropriate Analyzer and start recording once you hit the
> : positions you are interested in?    Seems like Term Vectors _could_
> : help, but even my new Mapper approach patch (LUCENE-868) doesn't
> : really help, because they are stored in a term-centric manner.  I
> : guess what I am after is a position centric approach.  That is, give
>
> this is kind of what i was suggesting in the last message i sent
> to the java-user thread about paylods and SpanQueries (which i'm
> guessing is what prompted this thread as well)...
>
> http://www.nabble.com/Payloads-and-PhraseQuery- 
> tf3988826.html#a11551628


This is one use case, the other is related to the new patch I  
submitted for LUCENE-960.  In this case, I have a SpanQueryFilter  
that identifies a bunch of docs and positions ahead of time.  Then  
the user enters new Span Query and I want to relate the matches from  
the user query with the positions of matches in the filter and then  
show that window.

>
> my point was that currently, to retrieve a payload you need a
> TermPositions instance, which is designed for iterating in the  
> order of...
>     seek(term)
>       skipTo(doc)
>          nextPosition()
>             getPayload()
> ...which is great for getting the payload of every instance
> (ie:position) of a specific term in a given document (or in every
> document) but without serious changes to the Spans API, the ideal  
> payload
> API would let you say...
>     skipTo(doc)
>        advance(startPosition)
>          getPayload()
>        while (nextPosition() < endPosition)
>          getPosition()
>
> but this seems like a nearly impossible API to implement given the  
> natore
> of hte inverted index and the fact that terms aren't ever stored in
> position order.
>
> there's a lot i really don't know/understand about the lucene term
> position internals ... but as i recall, the datastructure written  
> to disk
> isn't actually a tree structure inverted index, it's a long  
> sequence of
> tuples correct?  so in theory you could scan along the tuples  
> untill you
> find the doc you are interested in, ignoring all of the term info  
> along
> the way, then whatever term you happen be on at the moment, you  
> could scan
> along all of the positions until you find one in the range you are
> interested in -- assuming you do, then you record the current Term  
> (and
> read your payload data if interested)

I think the main issue I see is in both the payloads and the matching  
case above is that they require a document centric approach.  And  
then, for each Document,
you ideally want to be able to just index into an array so that you  
can go directly to the position that is needed based on Span.getStart()

>
> if i remember correctly, the first part of this is easy, and  
> relative fast
> -- i think skipTo(doc) on a TermDoc or TermPositions will happily  
> scan for
> the first <term,doc> pair with the correct docId, irregardless of  
> the term
> ... the only thing i'm not sure about is how efficient it is to  
> loop over
> nextPosition() for every term you find to see if any of them are in  
> your
> range ... the best case scenerio is that the first position  
> returned is
> above the high end of your range, in which case you can stop  
> immediately
> and seek to the next term -- butthe worst case is that you call
> nextPosition() over an over a lot of times before you get a  
> position in
> (or above) your rnage .... an advancePosition(pos) that wokred like  
> seek
> or skipTo might be helpful here.
>
> : I feel like I am missing something obvious.  I would suspect the
> : highlighter needs to do this, but it seems to take the reanalyze
> : approach as well (I admit, though, that I have little experience  
> with
> : the highlighter.)
>
> as i understand it the default case is to reanalyze, but if you have
> TermFreqVector info stored with positions (ie: a  
> TermPositionVector) then
> it can use that to construct a TokenStream by iterating over all  
> terms and
> writing them into a big array in position order (see the  
> TermSources class
> in the highlighter)


Ah, I see that now.  Thanks.
>
> this makes sense when highlighting because it doesn't know what  
> kind of
> fragmenter is going to be used so it needs the whole TokenStream,  
> but it
> seems less then ideal when you are only interested in a small  
> number of
> position ranges that you know in advance.
>
> : I am wondering if it would be useful to have an alternative Term
> : Vector storage mechanism that was position centric.  Because we
> : couldn't take advantage of the lexicographic compression, it would
> : take up more disk space, but it would be a lot faster for these  
> kinds
>
> i'm not sure if it's really neccessary to store the data in a position
> centric manner, assuming we have a way to "seek" by position like i
> described above -- but then again i don't really know that what i
> described above is all that possible/practical/performant.
>

I suppose I could use my Mapper approach to organize things in a  
position centric way now that I think about it more.  Just means some  
unpacking and repacking.  Still, probably would perform well enough  
since I can setup the correct structure on the fly.  I will give this  
a try.  Maybe even add a Mapper to do this.


-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Best Practices for getting Strings from a position range

Posted by Chris Hostetter <ho...@fucit.org>.

: Do we have a best practice for going from, say a SpanQuery doc/
: position information and retrieving the actual range of positions of
: content from the Document?  Is it just to reanalyze the Document
: using the appropriate Analyzer and start recording once you hit the
: positions you are interested in?    Seems like Term Vectors _could_
: help, but even my new Mapper approach patch (LUCENE-868) doesn't
: really help, because they are stored in a term-centric manner.  I
: guess what I am after is a position centric approach.  That is, give

this is kind of what i was suggesting in the last message i sent
to the java-user thread about paylods and SpanQueries (which i'm
guessing is what prompted this thread as well)...

http://www.nabble.com/Payloads-and-PhraseQuery-tf3988826.html#a11551628

my point was that currently, to retrieve a payload you need a
TermPositions instance, which is designed for iterating in the order of...
    seek(term)
      skipTo(doc)
         nextPosition()
            getPayload()
...which is great for getting the payload of every instance
(ie:position) of a specific term in a given document (or in every
document) but without serious changes to the Spans API, the ideal payload
API would let you say...
    skipTo(doc)
       advance(startPosition)
         getPayload()
       while (nextPosition() < endPosition)
         getPosition()

but this seems like a nearly impossible API to implement given the natore
of hte inverted index and the fact that terms aren't ever stored in
position order.

there's a lot i really don't know/understand about the lucene term
position internals ... but as i recall, the datastructure written to disk
isn't actually a tree structure inverted index, it's a long sequence of
tuples correct?  so in theory you could scan along the tuples untill you
find the doc you are interested in, ignoring all of the term info along
the way, then whatever term you happen be on at the moment, you could scan
along all of the positions until you find one in the range you are
interested in -- assuming you do, then you record the current Term (and
read your payload data if interested)

if i remember correctly, the first part of this is easy, and relative fast
-- i think skipTo(doc) on a TermDoc or TermPositions will happily scan for
the first <term,doc> pair with the correct docId, irregardless of the term
... the only thing i'm not sure about is how efficient it is to loop over
nextPosition() for every term you find to see if any of them are in your
range ... the best case scenerio is that the first position returned is
above the high end of your range, in which case you can stop immediately
and seek to the next term -- butthe worst case is that you call
nextPosition() over an over a lot of times before you get a position in
(or above) your rnage .... an advancePosition(pos) that wokred like seek
or skipTo might be helpful here.

: I feel like I am missing something obvious.  I would suspect the
: highlighter needs to do this, but it seems to take the reanalyze
: approach as well (I admit, though, that I have little experience with
: the highlighter.)

as i understand it the default case is to reanalyze, but if you have
TermFreqVector info stored with positions (ie: a TermPositionVector) then
it can use that to construct a TokenStream by iterating over all terms and
writing them into a big array in position order (see the TermSources class
in the highlighter)

this makes sense when highlighting because it doesn't know what kind of
fragmenter is going to be used so it needs the whole TokenStream, but it
seems less then ideal when you are only interested in a small number of
position ranges that you know in advance.

: I am wondering if it would be useful to have an alternative Term
: Vector storage mechanism that was position centric.  Because we
: couldn't take advantage of the lexicographic compression, it would
: take up more disk space, but it would be a lot faster for these kinds

i'm not sure if it's really neccessary to store the data in a position
centric manner, assuming we have a way to "seek" by position like i
described above -- but then again i don't really know that what i
described above is all that possible/practical/performant.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org