You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Alex Winston <al...@christianity.com> on 2002/11/08 20:23:39 UTC

Searching Ranges

i was hoping that someone could briefly review my current solution to a
problem that we have encountered to see if anyone could suggest a
possible alternative, because as it stands we have pushed lucene past
its current limits.

PROBLEM:

we were wanting to represent a range of values for a particular field
that is searchable over a particular range.

an example follows for clarification:
we were wanting to store a range of chapters and verses of a book for a
particular document, and in turn search to see if a query range includes
the range that is represented in the index.

if this is unclear please ask for clarification

IMPRACTICAL SOLUTION:

although this solution seems somewhat impractical it is all we could
come up with.

our solution involved storing each possible range value within the term
which would allow for RangeQuerys to be performed on this particular
field.  for very small ranges this seems somewhat practical after
profiling.  although once the field ranges began to span multiple
chapters and verses, the search times became unreasonable because we
were storing thousands of entries for each representative range.

i can elaborate on anything that is unclear,
but any thoughts on a possible alternative solution within lucene that
we overlooked would be extremely helpful.
	

alex

Re: Searching Ranges

Posted by Alex Winston <al...@christianity.com>.
unless i am mistaken this break will only occur if the current term
within the field is greater than (or equal to when exclusive) the
upperTerm.  so if a matching term has been found within the range it
will still continue to iterate until a term meets this criteria, or the
while loop ends, unless this is the intended behavior and I am
overlooking something.

any thoughts?

thanks
alex


On Tue, 2002-11-12 at 13:25, Doug Cutting wrote:
> Isn't the break on line 162 of RangeQuery.java supposed to achieve this?
> 
> Alex Winston wrote:
> > otis,
> > 
> > i was able to fix the junit build problems, with the newest versions of
> > ant in regards to lucene unit tests.  it appears that the junit.jar must
> > appear in the $ANT_HOME/lib dir in order to run such optional taskdefs
> > as JUnitTask.
> > 
> > the following link was very helpful.
> > http://barracuda.enhydra.org/project/mailingLists/barracuda/msg04810.html
> > 
> > additionally i was able to unit test lucene with the one line change
> > that i suggested with success, although i have not looked into how
> > thorough the unit tests are for cases like this.
> > 
> > the diff follows from a cvs snapshot from yesterday (note the added
> > break;):
> > *** RangeQuery.java     Sat Nov  9 09:54:05 2002
> > --- RangeQuery.java.old Sat Nov  9 09:53:37 2002
> > ***************
> > *** 164,170 ****
> >                               TermQuery tq = new
> > TermQuery(term);         // found a match
> >                               tq.setBoost(boost);               // set
> > the boost
> >                               q.add(tq, false, false);            // add
> > to q
> > -                             break; //ADDED!
> >                           }
> >                       } 
> >                       else
> > --- 164,169 ----
> > 
> > 
> > i also pondered the ramifications of such a change, and have a few
> > thoughts.  it appears that this is successful because it eliminates the
> > massive overhead of the byte[] built by the TermScorer when there are
> > thousands of terms, but a side-effect may be that it will not accurately
> > return a valid score.  i have yet to test this, and my understanding of
> > the code is still very limited.  although i do not have a firm grasp of
> > what is involved in scoring, is there not a possibility to score based
> > on the number of results matched for this particular field as opposed to
> > the current implementation.
> > 
> > any thoughts?
> > 
> > as i look through the code some more i will offer my thoughts on a
> > possible reimplementation of RangeQuery to alleviate the overhead when
> > there are thousands of terms as opposed to this simple one line change
> > which may have hidden side-effects.
> > 
> > i can also send a copy of some simple tests to show how to create this
> > situation with profiling results if that would be helpful.
> > 
> > 
> > thanks
> > alex
> > 
> > 
> > 
> > On Fri, 2002-11-08 at 17:40, Alex Winston wrote:
> > 
> >>actually i was mistaken, i thought the tests ran successfully but after
> >>looking again i merely got a BUILD SUCCESSFUL, apparently lucenes build
> >>cannot find JUnitTask out of the box with ant1.5.1.  i have not had any
> >>time to work through the problem.  i will look into it tomorrow, if you
> >>have any thoughts in the meantime let me know.
> >>
> >>thanks
> >>alex
> >>
> >>
> >>
> >>On Fri, 2002-11-08 at 16:46, Otis Gospodnetic wrote:
> >>
> >>>Hello,
> >>>
> >>>Did you say that you run 'ant test-unit' and that all tests still pass?
> >>>If so, could you attach a cvs diff -ucN RangeQuery.java?
> >>>
> >>>Thanks,
> >>>Otis
> >>>
> >>>
> >>>--- Alex Winston <al...@christianity.com> wrote:
> >>>
> >>>>apologizes for replying to myself, but another nice side-effect of
> >>>>this
> >>>>fix is that it virtually eliminates the potential for an
> >>>>OutOfMemoryError, which was a problem i encountered on extremely
> >>>>large
> >>>>fields, over 10000 terms, while i was profiling the RangeQuery class.
> >>>>
> >>>>i can get into specifics if need be, any thoughts?
> >>>>
> >>>>alex
> >>>>
> >>>>
> >>>> On Fri, 2002-11-08 at 15:54, Alex Winston wrote:
> >>>>
> >>>>>thanks for the reply, my apologizes for not explaining myself very
> >>>>>clearly, it has been a long day.
> >>>>>
> >>>>>you expressed exactly our situation, unfortunately this is not an
> >>>>
> >>>>option
> >>>>
> >>>>>because we want to have multiple ranges for each document as well, 
> >>>>>there is a possible extension of what you suggested but that is a
> >>>>
> >>>>last
> >>>>
> >>>>>resort.  kinda crazy i know, but you have to meet requirements :).
> >>>>>
> >>>>>but i also had a thought while i was looking through the lucene
> >>>>
> >>>>code,
> >>>>
> >>>>>and any comments are welcome.  
> >>>>>
> >>>>>i may be very mistaken because it has been a long day but if you
> >>>>
> >>>>look at
> >>>>
> >>>>>the current cvs version of RangeQuery it appears that even if a
> >>>>
> >>>>match is
> >>>>
> >>>>>found it will continue to iterate over terms within a field, and in
> >>>>
> >>>>my
> >>>>
> >>>>>case it is on the order of thousands.  if i add a break after a
> >>>>
> >>>>match
> >>>>
> >>>>>has been found it appears as though the search is improved on avg
> >>>>
> >>>>an
> >>>>
> >>>>>order of magnitude, my math has left me so i cannot be theoretical
> >>>>
> >>>>at
> >>>>
> >>>>>the moment.  i have unit tested the change on my side and on the
> >>>>
> >>>>lucene
> >>>>
> >>>>>side and it works.  note: one hard example is that a query went
> >>>>
> >>>>from 20
> >>>>
> >>>>>seconds to .5 seconds.  any initial thoughts to if there is a case
> >>>>
> >>>>where
> >>>>
> >>>>>this would not work?
> >>>>>
> >>>>>beginning line 164:
> >>>>>TermQuery tq = new TermQuery(term);	  // found a match
> >>>>>tq.setBoost(boost);			   // set the boost
> >>>>>q.add(tq, false, false);		  // add to q
> >>>>>break;  // ADDED!
> >>>>>
> >>>>>
> >>>>>On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> >>>>>
> >>>>>>Alex,
> >>>>>>
> >>>>>>It is rather confusing. It sounds like you've indexed
> >>>>>>a field that that can be between two values (let's say
> >>>>>>E-J) and then when you have a search term such as G
> >>>>>>you want the docs containing E-J (or A-H or F-K but not A-H
> >>>>>>nor A-C nor J-Z)
> >>>>>>
> >>>>>>Just of the top of my head but could you index the upper and
> >>>>>>lower bounds as separate fields then when you search do a
> >>>>>>compound query:
> >>>>>>
> >>>>>>     lower_bound:{ - search_term } AND upper_bound:{ search_term
> >>>>>
> >>>>- }
> >>>>
> >>>>>>just a thought.
> >>>>>>
> >>>>>>>-MikeB.
> >>>>>>
> >>>>>>
> >>>>>>Alex Winston wrote:
> >>>>>>
> >>>>>>
> >>>>>>>i was hoping that someone could briefly review my current
> >>>>>>
> >>>>solution to a
> >>>>
> >>>>>>>problem that we have encountered to see if anyone could suggest
> >>>>>>
> >>>>a
> >>>>
> >>>>>>>possible alternative, because as it stands we have pushed
> >>>>>>
> >>>>lucene past
> >>>>
> >>>>>>>its current limits.
> >>>>>>>
> >>>>>>>PROBLEM:
> >>>>>>>
> >>>>>>>we were wanting to represent a range of values for a particular
> >>>>>>
> >>>>field
> >>>>
> >>>>>>>that is searchable over a particular range.
> >>>>>>>
> >>>>>>>an example follows for clarification:
> >>>>>>>we were wanting to store a range of chapters and verses of a
> >>>>>>
> >>>>book for a
> >>>>
> >>>>>>>particular document, and in turn search to see if a query range
> >>>>>>
> >>>>includes
> >>>>
> >>>>>>>the range that is represented in the index.
> >>>>>>>
> >>>>>>>if this is unclear please ask for clarification
> >>>>>>>
> >>>>>>>IMPRACTICAL SOLUTION:
> >>>>>>>
> >>>>>>>although this solution seems somewhat impractical it is all we
> >>>>>>
> >>>>could
> >>>>
> >>>>>>>come up with.
> >>>>>>>
> >>>>>>>our solution involved storing each possible range value within
> >>>>>>
> >>>>the term
> >>>>
> >>>>>>>which would allow for RangeQuerys to be performed on this
> >>>>>>
> >>>>particular
> >>>>
> >>>>>>>field.  for very small ranges this seems somewhat practical
> >>>>>>
> >>>>after
> >>>>
> >>>>>>>profiling.  although once the field ranges began to span
> >>>>>>
> >>>>multiple
> >>>>
> >>>>>>>chapters and verses, the search times became unreasonable
> >>>>>>
> >>>>because we
> >>>>
> >>>>>>>were storing thousands of entries for each representative
> >>>>>>
> >>>>range.
> >>>>
> >>>>>>>i can elaborate on anything that is unclear,
> >>>>>>>but any thoughts on a possible alternative solution within
> >>>>>>
> >>>>lucene that
> >>>>
> >>>>>>>we overlooked would be extremely helpful.
> >>>>>>>	
> >>>>>>>
> >>>>>>>alex
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>--
> >>>>>>To unsubscribe, e-mail:  
> >>>>>
> >>>><ma...@jakarta.apache.org>
> >>>>
> >>>>>>For additional commands, e-mail:
> >>>>>
> >>>><ma...@jakarta.apache.org>
> >>>>
> >>>>>>
> >>>>
> >>>>ATTACHMENT part 2 application/pgp-signature name=signature.asc
> >>>
> >>>
> >>>
> >>>__________________________________________________
> >>>Do you Yahoo!?
> >>>U2 on LAUNCH - Exclusive greatest hits videos
> >>>http://launch.yahoo.com/u2
> >>>
> >>>--
> >>>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> >>>For additional commands, e-mail: <ma...@jakarta.apache.org>
> >>>
> >>>
> > 
> 
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
-- 
Alex Winston <al...@christianity.com>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Searching Ranges

Posted by Doug Cutting <cu...@lucene.com>.
Isn't the break on line 162 of RangeQuery.java supposed to achieve this?

Alex Winston wrote:
> otis,
> 
> i was able to fix the junit build problems, with the newest versions of
> ant in regards to lucene unit tests.  it appears that the junit.jar must
> appear in the $ANT_HOME/lib dir in order to run such optional taskdefs
> as JUnitTask.
> 
> the following link was very helpful.
> http://barracuda.enhydra.org/project/mailingLists/barracuda/msg04810.html
> 
> additionally i was able to unit test lucene with the one line change
> that i suggested with success, although i have not looked into how
> thorough the unit tests are for cases like this.
> 
> the diff follows from a cvs snapshot from yesterday (note the added
> break;):
> *** RangeQuery.java     Sat Nov  9 09:54:05 2002
> --- RangeQuery.java.old Sat Nov  9 09:53:37 2002
> ***************
> *** 164,170 ****
>                               TermQuery tq = new
> TermQuery(term);         // found a match
>                               tq.setBoost(boost);               // set
> the boost
>                               q.add(tq, false, false);            // add
> to q
> -                             break; //ADDED!
>                           }
>                       } 
>                       else
> --- 164,169 ----
> 
> 
> i also pondered the ramifications of such a change, and have a few
> thoughts.  it appears that this is successful because it eliminates the
> massive overhead of the byte[] built by the TermScorer when there are
> thousands of terms, but a side-effect may be that it will not accurately
> return a valid score.  i have yet to test this, and my understanding of
> the code is still very limited.  although i do not have a firm grasp of
> what is involved in scoring, is there not a possibility to score based
> on the number of results matched for this particular field as opposed to
> the current implementation.
> 
> any thoughts?
> 
> as i look through the code some more i will offer my thoughts on a
> possible reimplementation of RangeQuery to alleviate the overhead when
> there are thousands of terms as opposed to this simple one line change
> which may have hidden side-effects.
> 
> i can also send a copy of some simple tests to show how to create this
> situation with profiling results if that would be helpful.
> 
> 
> thanks
> alex
> 
> 
> 
> On Fri, 2002-11-08 at 17:40, Alex Winston wrote:
> 
>>actually i was mistaken, i thought the tests ran successfully but after
>>looking again i merely got a BUILD SUCCESSFUL, apparently lucenes build
>>cannot find JUnitTask out of the box with ant1.5.1.  i have not had any
>>time to work through the problem.  i will look into it tomorrow, if you
>>have any thoughts in the meantime let me know.
>>
>>thanks
>>alex
>>
>>
>>
>>On Fri, 2002-11-08 at 16:46, Otis Gospodnetic wrote:
>>
>>>Hello,
>>>
>>>Did you say that you run 'ant test-unit' and that all tests still pass?
>>>If so, could you attach a cvs diff -ucN RangeQuery.java?
>>>
>>>Thanks,
>>>Otis
>>>
>>>
>>>--- Alex Winston <al...@christianity.com> wrote:
>>>
>>>>apologizes for replying to myself, but another nice side-effect of
>>>>this
>>>>fix is that it virtually eliminates the potential for an
>>>>OutOfMemoryError, which was a problem i encountered on extremely
>>>>large
>>>>fields, over 10000 terms, while i was profiling the RangeQuery class.
>>>>
>>>>i can get into specifics if need be, any thoughts?
>>>>
>>>>alex
>>>>
>>>>
>>>> On Fri, 2002-11-08 at 15:54, Alex Winston wrote:
>>>>
>>>>>thanks for the reply, my apologizes for not explaining myself very
>>>>>clearly, it has been a long day.
>>>>>
>>>>>you expressed exactly our situation, unfortunately this is not an
>>>>
>>>>option
>>>>
>>>>>because we want to have multiple ranges for each document as well, 
>>>>>there is a possible extension of what you suggested but that is a
>>>>
>>>>last
>>>>
>>>>>resort.  kinda crazy i know, but you have to meet requirements :).
>>>>>
>>>>>but i also had a thought while i was looking through the lucene
>>>>
>>>>code,
>>>>
>>>>>and any comments are welcome.  
>>>>>
>>>>>i may be very mistaken because it has been a long day but if you
>>>>
>>>>look at
>>>>
>>>>>the current cvs version of RangeQuery it appears that even if a
>>>>
>>>>match is
>>>>
>>>>>found it will continue to iterate over terms within a field, and in
>>>>
>>>>my
>>>>
>>>>>case it is on the order of thousands.  if i add a break after a
>>>>
>>>>match
>>>>
>>>>>has been found it appears as though the search is improved on avg
>>>>
>>>>an
>>>>
>>>>>order of magnitude, my math has left me so i cannot be theoretical
>>>>
>>>>at
>>>>
>>>>>the moment.  i have unit tested the change on my side and on the
>>>>
>>>>lucene
>>>>
>>>>>side and it works.  note: one hard example is that a query went
>>>>
>>>>from 20
>>>>
>>>>>seconds to .5 seconds.  any initial thoughts to if there is a case
>>>>
>>>>where
>>>>
>>>>>this would not work?
>>>>>
>>>>>beginning line 164:
>>>>>TermQuery tq = new TermQuery(term);	  // found a match
>>>>>tq.setBoost(boost);			   // set the boost
>>>>>q.add(tq, false, false);		  // add to q
>>>>>break;  // ADDED!
>>>>>
>>>>>
>>>>>On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
>>>>>
>>>>>>Alex,
>>>>>>
>>>>>>It is rather confusing. It sounds like you've indexed
>>>>>>a field that that can be between two values (let's say
>>>>>>E-J) and then when you have a search term such as G
>>>>>>you want the docs containing E-J (or A-H or F-K but not A-H
>>>>>>nor A-C nor J-Z)
>>>>>>
>>>>>>Just of the top of my head but could you index the upper and
>>>>>>lower bounds as separate fields then when you search do a
>>>>>>compound query:
>>>>>>
>>>>>>     lower_bound:{ - search_term } AND upper_bound:{ search_term
>>>>>
>>>>- }
>>>>
>>>>>>just a thought.
>>>>>>
>>>>>>>-MikeB.
>>>>>>
>>>>>>
>>>>>>Alex Winston wrote:
>>>>>>
>>>>>>
>>>>>>>i was hoping that someone could briefly review my current
>>>>>>
>>>>solution to a
>>>>
>>>>>>>problem that we have encountered to see if anyone could suggest
>>>>>>
>>>>a
>>>>
>>>>>>>possible alternative, because as it stands we have pushed
>>>>>>
>>>>lucene past
>>>>
>>>>>>>its current limits.
>>>>>>>
>>>>>>>PROBLEM:
>>>>>>>
>>>>>>>we were wanting to represent a range of values for a particular
>>>>>>
>>>>field
>>>>
>>>>>>>that is searchable over a particular range.
>>>>>>>
>>>>>>>an example follows for clarification:
>>>>>>>we were wanting to store a range of chapters and verses of a
>>>>>>
>>>>book for a
>>>>
>>>>>>>particular document, and in turn search to see if a query range
>>>>>>
>>>>includes
>>>>
>>>>>>>the range that is represented in the index.
>>>>>>>
>>>>>>>if this is unclear please ask for clarification
>>>>>>>
>>>>>>>IMPRACTICAL SOLUTION:
>>>>>>>
>>>>>>>although this solution seems somewhat impractical it is all we
>>>>>>
>>>>could
>>>>
>>>>>>>come up with.
>>>>>>>
>>>>>>>our solution involved storing each possible range value within
>>>>>>
>>>>the term
>>>>
>>>>>>>which would allow for RangeQuerys to be performed on this
>>>>>>
>>>>particular
>>>>
>>>>>>>field.  for very small ranges this seems somewhat practical
>>>>>>
>>>>after
>>>>
>>>>>>>profiling.  although once the field ranges began to span
>>>>>>
>>>>multiple
>>>>
>>>>>>>chapters and verses, the search times became unreasonable
>>>>>>
>>>>because we
>>>>
>>>>>>>were storing thousands of entries for each representative
>>>>>>
>>>>range.
>>>>
>>>>>>>i can elaborate on anything that is unclear,
>>>>>>>but any thoughts on a possible alternative solution within
>>>>>>
>>>>lucene that
>>>>
>>>>>>>we overlooked would be extremely helpful.
>>>>>>>	
>>>>>>>
>>>>>>>alex
>>>>>>
>>>>>>
>>>>>>
>>>>>>--
>>>>>>To unsubscribe, e-mail:  
>>>>>
>>>><ma...@jakarta.apache.org>
>>>>
>>>>>>For additional commands, e-mail:
>>>>>
>>>><ma...@jakarta.apache.org>
>>>>
>>>>>>
>>>>
>>>>ATTACHMENT part 2 application/pgp-signature name=signature.asc
>>>
>>>
>>>
>>>__________________________________________________
>>>Do you Yahoo!?
>>>U2 on LAUNCH - Exclusive greatest hits videos
>>>http://launch.yahoo.com/u2
>>>
>>>--
>>>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>>>For additional commands, e-mail: <ma...@jakarta.apache.org>
>>>
>>>
> 



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Searching Ranges

Posted by Alex Winston <al...@christianity.com>.
otis,

i was able to fix the junit build problems, with the newest versions of
ant in regards to lucene unit tests.  it appears that the junit.jar must
appear in the $ANT_HOME/lib dir in order to run such optional taskdefs
as JUnitTask.

the following link was very helpful.
http://barracuda.enhydra.org/project/mailingLists/barracuda/msg04810.html

additionally i was able to unit test lucene with the one line change
that i suggested with success, although i have not looked into how
thorough the unit tests are for cases like this.

the diff follows from a cvs snapshot from yesterday (note the added
break;):
*** RangeQuery.java     Sat Nov  9 09:54:05 2002
--- RangeQuery.java.old Sat Nov  9 09:53:37 2002
***************
*** 164,170 ****
                              TermQuery tq = new
TermQuery(term);         // found a match
                              tq.setBoost(boost);               // set
the boost
                              q.add(tq, false, false);            // add
to q
-                             break; //ADDED!
                          }
                      } 
                      else
--- 164,169 ----


i also pondered the ramifications of such a change, and have a few
thoughts.  it appears that this is successful because it eliminates the
massive overhead of the byte[] built by the TermScorer when there are
thousands of terms, but a side-effect may be that it will not accurately
return a valid score.  i have yet to test this, and my understanding of
the code is still very limited.  although i do not have a firm grasp of
what is involved in scoring, is there not a possibility to score based
on the number of results matched for this particular field as opposed to
the current implementation.

any thoughts?

as i look through the code some more i will offer my thoughts on a
possible reimplementation of RangeQuery to alleviate the overhead when
there are thousands of terms as opposed to this simple one line change
which may have hidden side-effects.

i can also send a copy of some simple tests to show how to create this
situation with profiling results if that would be helpful.


thanks
alex



On Fri, 2002-11-08 at 17:40, Alex Winston wrote:
> actually i was mistaken, i thought the tests ran successfully but after
> looking again i merely got a BUILD SUCCESSFUL, apparently lucenes build
> cannot find JUnitTask out of the box with ant1.5.1.  i have not had any
> time to work through the problem.  i will look into it tomorrow, if you
> have any thoughts in the meantime let me know.
> 
> thanks
> alex
> 
> 
> 
> On Fri, 2002-11-08 at 16:46, Otis Gospodnetic wrote:
> > Hello,
> > 
> > Did you say that you run 'ant test-unit' and that all tests still pass?
> > If so, could you attach a cvs diff -ucN RangeQuery.java?
> > 
> > Thanks,
> > Otis
> > 
> > 
> > --- Alex Winston <al...@christianity.com> wrote:
> > > apologizes for replying to myself, but another nice side-effect of
> > > this
> > > fix is that it virtually eliminates the potential for an
> > > OutOfMemoryError, which was a problem i encountered on extremely
> > > large
> > > fields, over 10000 terms, while i was profiling the RangeQuery class.
> > > 
> > > i can get into specifics if need be, any thoughts?
> > > 
> > > alex
> > > 
> > > 
> > >  On Fri, 2002-11-08 at 15:54, Alex Winston wrote:
> > > > thanks for the reply, my apologizes for not explaining myself very
> > > > clearly, it has been a long day.
> > > > 
> > > > you expressed exactly our situation, unfortunately this is not an
> > > option
> > > > because we want to have multiple ranges for each document as well, 
> > > > there is a possible extension of what you suggested but that is a
> > > last
> > > > resort.  kinda crazy i know, but you have to meet requirements :).
> > > > 
> > > > but i also had a thought while i was looking through the lucene
> > > code,
> > > > and any comments are welcome.  
> > > > 
> > > > i may be very mistaken because it has been a long day but if you
> > > look at
> > > > the current cvs version of RangeQuery it appears that even if a
> > > match is
> > > > found it will continue to iterate over terms within a field, and in
> > > my
> > > > case it is on the order of thousands.  if i add a break after a
> > > match
> > > > has been found it appears as though the search is improved on avg
> > > an
> > > > order of magnitude, my math has left me so i cannot be theoretical
> > > at
> > > > the moment.  i have unit tested the change on my side and on the
> > > lucene
> > > > side and it works.  note: one hard example is that a query went
> > > from 20
> > > > seconds to .5 seconds.  any initial thoughts to if there is a case
> > > where
> > > > this would not work?
> > > > 
> > > > beginning line 164:
> > > > TermQuery tq = new TermQuery(term);	  // found a match
> > > > tq.setBoost(boost);			   // set the boost
> > > > q.add(tq, false, false);		  // add to q
> > > > break;  // ADDED!
> > > > 
> > > > 
> > > > On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> > > > > Alex,
> > > > > 
> > > > > It is rather confusing. It sounds like you've indexed
> > > > > a field that that can be between two values (let's say
> > > > > E-J) and then when you have a search term such as G
> > > > > you want the docs containing E-J (or A-H or F-K but not A-H
> > > > > nor A-C nor J-Z)
> > > > > 
> > > > > Just of the top of my head but could you index the upper and
> > > > > lower bounds as separate fields then when you search do a
> > > > > compound query:
> > > > > 
> > > > >      lower_bound:{ - search_term } AND upper_bound:{ search_term
> > > - }
> > > > > 
> > > > > just a thought.
> > > > > > -MikeB.
> > > > > 
> > > > > 
> > > > > Alex Winston wrote:
> > > > > 
> > > > > > i was hoping that someone could briefly review my current
> > > solution to a
> > > > > > problem that we have encountered to see if anyone could suggest
> > > a
> > > > > > possible alternative, because as it stands we have pushed
> > > lucene past
> > > > > > its current limits.
> > > > > >
> > > > > > PROBLEM:
> > > > > >
> > > > > > we were wanting to represent a range of values for a particular
> > > field
> > > > > > that is searchable over a particular range.
> > > > > >
> > > > > > an example follows for clarification:
> > > > > > we were wanting to store a range of chapters and verses of a
> > > book for a
> > > > > > particular document, and in turn search to see if a query range
> > > includes
> > > > > > the range that is represented in the index.
> > > > > >
> > > > > > if this is unclear please ask for clarification
> > > > > >
> > > > > > IMPRACTICAL SOLUTION:
> > > > > >
> > > > > > although this solution seems somewhat impractical it is all we
> > > could
> > > > > > come up with.
> > > > > >
> > > > > > our solution involved storing each possible range value within
> > > the term
> > > > > > which would allow for RangeQuerys to be performed on this
> > > particular
> > > > > > field.  for very small ranges this seems somewhat practical
> > > after
> > > > > > profiling.  although once the field ranges began to span
> > > multiple
> > > > > > chapters and verses, the search times became unreasonable
> > > because we
> > > > > > were storing thousands of entries for each representative
> > > range.
> > > > > >
> > > > > > i can elaborate on anything that is unclear,
> > > > > > but any thoughts on a possible alternative solution within
> > > lucene that
> > > > > > we overlooked would be extremely helpful.
> > > > > > 	
> > > > > >
> > > > > > alex
> > > > > 
> > > > > 
> > > > > 
> > > > > --
> > > > > To unsubscribe, e-mail:  
> > > <ma...@jakarta.apache.org>
> > > > > For additional commands, e-mail:
> > > <ma...@jakarta.apache.org>
> > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
> > > ATTACHMENT part 2 application/pgp-signature name=signature.asc
> > 
> > 
> > 
> > __________________________________________________
> > Do you Yahoo!?
> > U2 on LAUNCH - Exclusive greatest hits videos
> > http://launch.yahoo.com/u2
> > 
> > --
> > To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> > For additional commands, e-mail: <ma...@jakarta.apache.org>
> > 
> > 
> 


Re: Searching Ranges

Posted by Alex Winston <al...@christianity.com>.
actually i was mistaken, i thought the tests ran successfully but after
looking again i merely got a BUILD SUCCESSFUL, apparently lucenes build
cannot find JUnitTask out of the box with ant1.5.1.  i have not had any
time to work through the problem.  i will look into it tomorrow, if you
have any thoughts in the meantime let me know.

thanks
alex



On Fri, 2002-11-08 at 16:46, Otis Gospodnetic wrote:
> Hello,
> 
> Did you say that you run 'ant test-unit' and that all tests still pass?
> If so, could you attach a cvs diff -ucN RangeQuery.java?
> 
> Thanks,
> Otis
> 
> 
> --- Alex Winston <al...@christianity.com> wrote:
> > apologizes for replying to myself, but another nice side-effect of
> > this
> > fix is that it virtually eliminates the potential for an
> > OutOfMemoryError, which was a problem i encountered on extremely
> > large
> > fields, over 10000 terms, while i was profiling the RangeQuery class.
> > 
> > i can get into specifics if need be, any thoughts?
> > 
> > alex
> > 
> > 
> >  On Fri, 2002-11-08 at 15:54, Alex Winston wrote:
> > > thanks for the reply, my apologizes for not explaining myself very
> > > clearly, it has been a long day.
> > > 
> > > you expressed exactly our situation, unfortunately this is not an
> > option
> > > because we want to have multiple ranges for each document as well, 
> > > there is a possible extension of what you suggested but that is a
> > last
> > > resort.  kinda crazy i know, but you have to meet requirements :).
> > > 
> > > but i also had a thought while i was looking through the lucene
> > code,
> > > and any comments are welcome.  
> > > 
> > > i may be very mistaken because it has been a long day but if you
> > look at
> > > the current cvs version of RangeQuery it appears that even if a
> > match is
> > > found it will continue to iterate over terms within a field, and in
> > my
> > > case it is on the order of thousands.  if i add a break after a
> > match
> > > has been found it appears as though the search is improved on avg
> > an
> > > order of magnitude, my math has left me so i cannot be theoretical
> > at
> > > the moment.  i have unit tested the change on my side and on the
> > lucene
> > > side and it works.  note: one hard example is that a query went
> > from 20
> > > seconds to .5 seconds.  any initial thoughts to if there is a case
> > where
> > > this would not work?
> > > 
> > > beginning line 164:
> > > TermQuery tq = new TermQuery(term);	  // found a match
> > > tq.setBoost(boost);			   // set the boost
> > > q.add(tq, false, false);		  // add to q
> > > break;  // ADDED!
> > > 
> > > 
> > > On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> > > > Alex,
> > > > 
> > > > It is rather confusing. It sounds like you've indexed
> > > > a field that that can be between two values (let's say
> > > > E-J) and then when you have a search term such as G
> > > > you want the docs containing E-J (or A-H or F-K but not A-H
> > > > nor A-C nor J-Z)
> > > > 
> > > > Just of the top of my head but could you index the upper and
> > > > lower bounds as separate fields then when you search do a
> > > > compound query:
> > > > 
> > > >      lower_bound:{ - search_term } AND upper_bound:{ search_term
> > - }
> > > > 
> > > > just a thought.
> > > > > -MikeB.
> > > > 
> > > > 
> > > > Alex Winston wrote:
> > > > 
> > > > > i was hoping that someone could briefly review my current
> > solution to a
> > > > > problem that we have encountered to see if anyone could suggest
> > a
> > > > > possible alternative, because as it stands we have pushed
> > lucene past
> > > > > its current limits.
> > > > >
> > > > > PROBLEM:
> > > > >
> > > > > we were wanting to represent a range of values for a particular
> > field
> > > > > that is searchable over a particular range.
> > > > >
> > > > > an example follows for clarification:
> > > > > we were wanting to store a range of chapters and verses of a
> > book for a
> > > > > particular document, and in turn search to see if a query range
> > includes
> > > > > the range that is represented in the index.
> > > > >
> > > > > if this is unclear please ask for clarification
> > > > >
> > > > > IMPRACTICAL SOLUTION:
> > > > >
> > > > > although this solution seems somewhat impractical it is all we
> > could
> > > > > come up with.
> > > > >
> > > > > our solution involved storing each possible range value within
> > the term
> > > > > which would allow for RangeQuerys to be performed on this
> > particular
> > > > > field.  for very small ranges this seems somewhat practical
> > after
> > > > > profiling.  although once the field ranges began to span
> > multiple
> > > > > chapters and verses, the search times became unreasonable
> > because we
> > > > > were storing thousands of entries for each representative
> > range.
> > > > >
> > > > > i can elaborate on anything that is unclear,
> > > > > but any thoughts on a possible alternative solution within
> > lucene that
> > > > > we overlooked would be extremely helpful.
> > > > > 	
> > > > >
> > > > > alex
> > > > 
> > > > 
> > > > 
> > > > --
> > > > To unsubscribe, e-mail:  
> > <ma...@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> > > > 
> > > > 
> > > 
> > 
> > 
> 
> > ATTACHMENT part 2 application/pgp-signature name=signature.asc
> 
> 
> 
> __________________________________________________
> Do you Yahoo!?
> U2 on LAUNCH - Exclusive greatest hits videos
> http://launch.yahoo.com/u2
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


Re: Searching Ranges

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

Did you say that you run 'ant test-unit' and that all tests still pass?
If so, could you attach a cvs diff -ucN RangeQuery.java?

Thanks,
Otis


--- Alex Winston <al...@christianity.com> wrote:
> apologizes for replying to myself, but another nice side-effect of
> this
> fix is that it virtually eliminates the potential for an
> OutOfMemoryError, which was a problem i encountered on extremely
> large
> fields, over 10000 terms, while i was profiling the RangeQuery class.
> 
> i can get into specifics if need be, any thoughts?
> 
> alex
> 
> 
>  On Fri, 2002-11-08 at 15:54, Alex Winston wrote:
> > thanks for the reply, my apologizes for not explaining myself very
> > clearly, it has been a long day.
> > 
> > you expressed exactly our situation, unfortunately this is not an
> option
> > because we want to have multiple ranges for each document as well, 
> > there is a possible extension of what you suggested but that is a
> last
> > resort.  kinda crazy i know, but you have to meet requirements :).
> > 
> > but i also had a thought while i was looking through the lucene
> code,
> > and any comments are welcome.  
> > 
> > i may be very mistaken because it has been a long day but if you
> look at
> > the current cvs version of RangeQuery it appears that even if a
> match is
> > found it will continue to iterate over terms within a field, and in
> my
> > case it is on the order of thousands.  if i add a break after a
> match
> > has been found it appears as though the search is improved on avg
> an
> > order of magnitude, my math has left me so i cannot be theoretical
> at
> > the moment.  i have unit tested the change on my side and on the
> lucene
> > side and it works.  note: one hard example is that a query went
> from 20
> > seconds to .5 seconds.  any initial thoughts to if there is a case
> where
> > this would not work?
> > 
> > beginning line 164:
> > TermQuery tq = new TermQuery(term);	  // found a match
> > tq.setBoost(boost);			   // set the boost
> > q.add(tq, false, false);		  // add to q
> > break;  // ADDED!
> > 
> > 
> > On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> > > Alex,
> > > 
> > > It is rather confusing. It sounds like you've indexed
> > > a field that that can be between two values (let's say
> > > E-J) and then when you have a search term such as G
> > > you want the docs containing E-J (or A-H or F-K but not A-H
> > > nor A-C nor J-Z)
> > > 
> > > Just of the top of my head but could you index the upper and
> > > lower bounds as separate fields then when you search do a
> > > compound query:
> > > 
> > >      lower_bound:{ - search_term } AND upper_bound:{ search_term
> - }
> > > 
> > > just a thought.
> > > > -MikeB.
> > > 
> > > 
> > > Alex Winston wrote:
> > > 
> > > > i was hoping that someone could briefly review my current
> solution to a
> > > > problem that we have encountered to see if anyone could suggest
> a
> > > > possible alternative, because as it stands we have pushed
> lucene past
> > > > its current limits.
> > > >
> > > > PROBLEM:
> > > >
> > > > we were wanting to represent a range of values for a particular
> field
> > > > that is searchable over a particular range.
> > > >
> > > > an example follows for clarification:
> > > > we were wanting to store a range of chapters and verses of a
> book for a
> > > > particular document, and in turn search to see if a query range
> includes
> > > > the range that is represented in the index.
> > > >
> > > > if this is unclear please ask for clarification
> > > >
> > > > IMPRACTICAL SOLUTION:
> > > >
> > > > although this solution seems somewhat impractical it is all we
> could
> > > > come up with.
> > > >
> > > > our solution involved storing each possible range value within
> the term
> > > > which would allow for RangeQuerys to be performed on this
> particular
> > > > field.  for very small ranges this seems somewhat practical
> after
> > > > profiling.  although once the field ranges began to span
> multiple
> > > > chapters and verses, the search times became unreasonable
> because we
> > > > were storing thousands of entries for each representative
> range.
> > > >
> > > > i can elaborate on anything that is unclear,
> > > > but any thoughts on a possible alternative solution within
> lucene that
> > > > we overlooked would be extremely helpful.
> > > > 	
> > > >
> > > > alex
> > > 
> > > 
> > > 
> > > --
> > > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> > > 
> > > 
> > 
> 
> 

> ATTACHMENT part 2 application/pgp-signature name=signature.asc



__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Searching Ranges

Posted by Alex Winston <al...@christianity.com>.
apologizes for replying to myself, but another nice side-effect of this
fix is that it virtually eliminates the potential for an
OutOfMemoryError, which was a problem i encountered on extremely large
fields, over 10000 terms, while i was profiling the RangeQuery class.

i can get into specifics if need be, any thoughts?

alex


 On Fri, 2002-11-08 at 15:54, Alex Winston wrote:
> thanks for the reply, my apologizes for not explaining myself very
> clearly, it has been a long day.
> 
> you expressed exactly our situation, unfortunately this is not an option
> because we want to have multiple ranges for each document as well, 
> there is a possible extension of what you suggested but that is a last
> resort.  kinda crazy i know, but you have to meet requirements :).
> 
> but i also had a thought while i was looking through the lucene code,
> and any comments are welcome.  
> 
> i may be very mistaken because it has been a long day but if you look at
> the current cvs version of RangeQuery it appears that even if a match is
> found it will continue to iterate over terms within a field, and in my
> case it is on the order of thousands.  if i add a break after a match
> has been found it appears as though the search is improved on avg an
> order of magnitude, my math has left me so i cannot be theoretical at
> the moment.  i have unit tested the change on my side and on the lucene
> side and it works.  note: one hard example is that a query went from 20
> seconds to .5 seconds.  any initial thoughts to if there is a case where
> this would not work?
> 
> beginning line 164:
> TermQuery tq = new TermQuery(term);	  // found a match
> tq.setBoost(boost);			   // set the boost
> q.add(tq, false, false);		  // add to q
> break;  // ADDED!
> 
> 
> On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> > Alex,
> > 
> > It is rather confusing. It sounds like you've indexed
> > a field that that can be between two values (let's say
> > E-J) and then when you have a search term such as G
> > you want the docs containing E-J (or A-H or F-K but not A-H
> > nor A-C nor J-Z)
> > 
> > Just of the top of my head but could you index the upper and
> > lower bounds as separate fields then when you search do a
> > compound query:
> > 
> >      lower_bound:{ - search_term } AND upper_bound:{ search_term - }
> > 
> > just a thought.
> > > -MikeB.
> > 
> > 
> > Alex Winston wrote:
> > 
> > > i was hoping that someone could briefly review my current solution to a
> > > problem that we have encountered to see if anyone could suggest a
> > > possible alternative, because as it stands we have pushed lucene past
> > > its current limits.
> > >
> > > PROBLEM:
> > >
> > > we were wanting to represent a range of values for a particular field
> > > that is searchable over a particular range.
> > >
> > > an example follows for clarification:
> > > we were wanting to store a range of chapters and verses of a book for a
> > > particular document, and in turn search to see if a query range includes
> > > the range that is represented in the index.
> > >
> > > if this is unclear please ask for clarification
> > >
> > > IMPRACTICAL SOLUTION:
> > >
> > > although this solution seems somewhat impractical it is all we could
> > > come up with.
> > >
> > > our solution involved storing each possible range value within the term
> > > which would allow for RangeQuerys to be performed on this particular
> > > field.  for very small ranges this seems somewhat practical after
> > > profiling.  although once the field ranges began to span multiple
> > > chapters and verses, the search times became unreasonable because we
> > > were storing thousands of entries for each representative range.
> > >
> > > i can elaborate on anything that is unclear,
> > > but any thoughts on a possible alternative solution within lucene that
> > > we overlooked would be extremely helpful.
> > > 	
> > >
> > > alex
> > 
> > 
> > 
> > --
> > To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> > For additional commands, e-mail: <ma...@jakarta.apache.org>
> > 
> > 
> 


Re: Searching Ranges

Posted by Terry Steichen <te...@net-frame.com>.
Am I the only one that has trouble reading Alex's messages?  I don't know
what he does, but whatever it is, I have to get an editor to extract an
attachment and that is too much work.

Terry

----- Original Message -----
From: "Alex Winston" <al...@christianity.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, November 20, 2002 5:26 PM
Subject: Re: Searching Ranges




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Searching Ranges

Posted by Alex Winston <al...@christianity.com>.
doug,
  if you happen to remember this thread, i was wanting to know if you
had any thoughts on improving this search in the situation below, my
temp fix does not work in all situations, so i am back to square one.

  i have completely gutted the RangeQuery and created an additional
RangeScorer to help eliminate some of the overheard incurred in the
special situation below, but the search times are still unacceptable. 
currently i have reduced the logic down to simply iterating over the set
of terms between the range and returning the set of termDocs for each,
and then in turn maintaining an [] of the results.  although my
implementation is substantially faster than before it is still very
slow.  my thought was that i might be able to accomplish a more
efficient range query at the Reader level, any thoughts?

  i am certain that some of the redundant iteration can be eliminated i
am just not sure how.

thanks
alex



> Alex Winston wrote:
> > lets say that i have a document named "d1", which contains a field named
> > "references".  within the "references" field i maintain a list of terms
> > that represent my range from 001-005, more specifically the field would
> > contain the terms "001 002 003 004 005".
> >
> > i would now like to search this range to determine if it falls within
> > the range 003-010, so my query would look like "references:[003 010]".

Re: Searching Ranges

Posted by Scott Ganyo <sc...@etapestry.com>.
Ok.  If I understand what you are saying, I believe that would be 
correct if you were iterating over Documents and for each Document you 
were trying to match a Term in the Range.

That is reversed, though:  The correct flow is Term->Document[]. 
Instead, you iterate over the set of Terms and for each Term there is a 
set of Documents that contain that Term.  Therefore, the RangeQuery 
creates a TermQuery for each Term that exists within the specified 
range.  The net result is that all Documents that have Terms within the 
Range are included in the result.

Does that make sense?
Scott

Alex Winston wrote:

> good thoughts and something that i would like to explore further. let me
> create a more concrete example that we can use to help visualize the
> problem and a possible solution and then get feedback.  it may be that
> the changes work in my case but not for all cases, or that there is
> something else that could help alleviate the overhead i am experiencing.
>
> lets say that i have a document named "d1", which contains a field named
> "references".  within the "references" field i maintain a list of terms
> that represent my range from 001-005, more specifically the field would
> contain the terms "001 002 003 004 005".
>
> i would now like to search this range to determine if it falls within
> the range 003-010, so my query would look like "references:[003 010]".
>
> in this case RangeQuery begins to iterate the terms contained within
> "references" to determine if "d1" is a match.  as it iterates each term
> is analyzed, 001 and 002 fail so we continue to iterate at which point
> 003 is determined to be a match.
>
> at this point there is no need to continue to search the terms because
> we have determined there is a match, not only that but it helps reduce
> the size of the byte[] cache that is created i believe.  like i
> mentioned earlier, i am not sure of the ramifications this may have on
> scoring, but it works for most cases i can think of, but if i am missing
> something that is what i need feedback on :).
>
> you can imagine how this improves the avg efficiency in my case if i
> have 10000 terms in "references".  although i may be doing something
> that was either not intended or ill-designed.
>
> thanks, any thoughts?
> alex
>
>
>
> On Mon, 2002-11-11 at 10:50, Scott Ganyo wrote:
>
> >Hi Alex,
> >
> >I just looked at this and had the following thought:
> >
> >The RangeQuery must continue to iterate after the first match is found
> >in order to match everything within the specified range.  In other
> >words, if you have a range of "a" to "d", you can't stop with "a", you
> >need to continue to "d".  At the point you move beyond "d" is the point
> >where the query should stop iterating.  That is reflected in lines
> >160-162.  It seems to me that your solution would only work where your
> >range consists of a single term.
> >
> >Please let me know if I'm just misunderstanding the situation.
> >
> >Scott
> >
> >Alex Winston wrote:
> >
> >
> >>thanks for the reply, my apologizes for not explaining myself very
> >>clearly, it has been a long day.
> >>
> >>you expressed exactly our situation, unfortunately this is not an option
> >>because we want to have multiple ranges for each document as well,
> >>there is a possible extension of what you suggested but that is a last
> >>resort.  kinda crazy i know, but you have to meet requirements :).
> >>
> >>but i also had a thought while i was looking through the lucene code,
> >>and any comments are welcome.
> >>
> >>i may be very mistaken because it has been a long day but if you look at
> >>the current cvs version of RangeQuery it appears that even if a match is
> >>found it will continue to iterate over terms within a field, and in my
> >>case it is on the order of thousands.  if i add a break after a match
> >>has been found it appears as though the search is improved on avg an
> >>order of magnitude, my math has left me so i cannot be theoretical at
> >>the moment.  i have unit tested the change on my side and on the lucene
> >>side and it works.  note: one hard example is that a query went from 20
> >>seconds to .5 seconds.  any initial thoughts to if there is a case where
> >>this would not work?
> >>
> >>beginning line 164:
> >>TermQuery tq = new TermQuery(term);	  // found a match
> >>tq.setBoost(boost);			   // set the boost
> >>q.add(tq, false, false);		  // add to q
> >>break;  // ADDED!
> >>
> >>
> >>On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> >>
> >>
> >>>Alex,
> >>>
> >>>It is rather confusing. It sounds like you've indexed
> >>>a field that that can be between two values (let's say
> >>>E-J) and then when you have a search term such as G
> >>>you want the docs containing E-J (or A-H or F-K but not A-H
> >>>nor A-C nor J-Z)
> >>>
> >>>Just of the top of my head but could you index the upper and
> >>>lower bounds as separate fields then when you search do a
> >>>compound query:
> >>>
> >>>    lower_bound:{ - search_term } AND upper_bound:{ search_term - }
> >>>
> >>>just a thought.
> >>>
> >>>
> >>>>-MikeB.
> >>>
> >>>
> >>>Alex Winston wrote:
> >>>
> >>>
> >>>
> >>>>i was hoping that someone could briefly review my current solution 
> to a
> >>>>problem that we have encountered to see if anyone could suggest a
> >>>>possible alternative, because as it stands we have pushed lucene past
> >>>>its current limits.
> >>>>
> >>>>PROBLEM:
> >>>>
> >>>>we were wanting to represent a range of values for a particular field
> >>>>that is searchable over a particular range.
> >>>>
> >>>>an example follows for clarification:
> >>>>we were wanting to store a range of chapters and verses of a book 
> for a
> >>>>particular document, and in turn search to see if a query range 
> includes
> >>>>the range that is represented in the index.
> >>>>
> >>>>if this is unclear please ask for clarification
> >>>>
> >>>>IMPRACTICAL SOLUTION:
> >>>>
> >>>>although this solution seems somewhat impractical it is all we could
> >>>>come up with.
> >>>>
> >>>>our solution involved storing each possible range value within the 
> term
> >>>>which would allow for RangeQuerys to be performed on this particular
> >>>>field.  for very small ranges this seems somewhat practical after
> >>>>profiling.  although once the field ranges began to span multiple
> >>>>chapters and verses, the search times became unreasonable because we
> >>>>were storing thousands of entries for each representative range.
> >>>>
> >>>>i can elaborate on anything that is unclear,
> >>>>but any thoughts on a possible alternative solution within lucene that
> >>>>we overlooked would be extremely helpful.
> >>>>	
> >>>>
> >>>>alex
> >>>
> >>>
> >>>
> >>>--
> >>>To unsubscribe, e-mail:
> >>>For additional commands, e-mail:
> >>>
> >>>
> >>
> >--
> >Brain: Pinky, are you pondering what I’m pondering?
> >Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were
> >they thinking?
> >
> >
> >--
> >To unsubscribe, e-mail:
> >For additional commands, e-mail:
> >
> >
>

-- 
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Searching Ranges

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I have never used the RangeQuery, but my understanding was that it is
for cases like this:

Assume you have documents with a field called 'percentage' that can
have a value between 0 and 100.
Assume your index has 3 documents, one with percentage=10, one with
percentage=60, and one with percentage=90.

You could then to RangeQuery: percentage:[50 100]

This will match 2 documents, one with percentage=60, and one with
percentage=90.


I think that is slightly different from queries where you want to see
whether a field with value of "001 002 003 004 005" has at least 1 term
that falls in the range specified in the query string.

I haven't tested what I said here, so I could be wrong.

Otis




--- Alex Winston <al...@christianity.com> wrote:
> good thoughts and something that i would like to explore further. let
> me
> create a more concrete example that we can use to help visualize the
> problem and a possible solution and then get feedback.  it may be
> that
> the changes work in my case but not for all cases, or that there is
> something else that could help alleviate the overhead i am
> experiencing.
> 
> lets say that i have a document named "d1", which contains a field
> named
> "references".  within the "references" field i maintain a list of
> terms
> that represent my range from 001-005, more specifically the field
> would
> contain the terms "001 002 003 004 005".
> 
> i would now like to search this range to determine if it falls within
> the range 003-010, so my query would look like "references:[003
> 010]".
> 
> in this case RangeQuery begins to iterate the terms contained within
> "references" to determine if "d1" is a match.  as it iterates each
> term
> is analyzed, 001 and 002 fail so we continue to iterate at which
> point
> 003 is determined to be a match.
> 
> at this point there is no need to continue to search the terms
> because
> we have determined there is a match, not only that but it helps
> reduce
> the size of the byte[] cache that is created i believe.  like i
> mentioned earlier, i am not sure of the ramifications this may have
> on
> scoring, but it works for most cases i can think of, but if i am
> missing
> something that is what i need feedback on :).
> 
> you can imagine how this improves the avg efficiency in my case if i
> have 10000 terms in "references".  although i may be doing something
> that was either not intended or ill-designed.
> 
> thanks, any thoughts?
> alex
> 
> 
> 
> On Mon, 2002-11-11 at 10:50, Scott Ganyo wrote:
> > Hi Alex,
> > 
> > I just looked at this and had the following thought:
> > 
> > The RangeQuery must continue to iterate after the first match is
> found 
> > in order to match everything within the specified range.  In other 
> > words, if you have a range of "a" to "d", you can't stop with "a",
> you 
> > need to continue to "d".  At the point you move beyond "d" is the
> point 
> > where the query should stop iterating.  That is reflected in lines 
> > 160-162.  It seems to me that your solution would only work where
> your 
> > range consists of a single term.
> > 
> > Please let me know if I'm just misunderstanding the situation.
> > 
> > Scott
> > 
> > Alex Winston wrote:
> > 
> > > thanks for the reply, my apologizes for not explaining myself
> very
> > > clearly, it has been a long day.
> > >
> > > you expressed exactly our situation, unfortunately this is not an
> option
> > > because we want to have multiple ranges for each document as
> well,
> > > there is a possible extension of what you suggested but that is a
> last
> > > resort.  kinda crazy i know, but you have to meet requirements
> :).
> > >
> > > but i also had a thought while i was looking through the lucene
> code,
> > > and any comments are welcome.
> > >
> > > i may be very mistaken because it has been a long day but if you
> look at
> > > the current cvs version of RangeQuery it appears that even if a
> match is
> > > found it will continue to iterate over terms within a field, and
> in my
> > > case it is on the order of thousands.  if i add a break after a
> match
> > > has been found it appears as though the search is improved on avg
> an
> > > order of magnitude, my math has left me so i cannot be
> theoretical at
> > > the moment.  i have unit tested the change on my side and on the
> lucene
> > > side and it works.  note: one hard example is that a query went
> from 20
> > > seconds to .5 seconds.  any initial thoughts to if there is a
> case where
> > > this would not work?
> > >
> > > beginning line 164:
> > > TermQuery tq = new TermQuery(term);	  // found a match
> > > tq.setBoost(boost);			   // set the boost
> > > q.add(tq, false, false);		  // add to q
> > > break;  // ADDED!
> > >
> > >
> > > On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> > >
> > > >Alex,
> > > >
> > > >It is rather confusing. It sounds like you've indexed
> > > >a field that that can be between two values (let's say
> > > >E-J) and then when you have a search term such as G
> > > >you want the docs containing E-J (or A-H or F-K but not A-H
> > > >nor A-C nor J-Z)
> > > >
> > > >Just of the top of my head but could you index the upper and
> > > >lower bounds as separate fields then when you search do a
> > > >compound query:
> > > >
> > > >     lower_bound:{ - search_term } AND upper_bound:{ search_term
> - }
> > > >
> > > >just a thought.
> > > >
> > > >>-MikeB.
> > > >
> > > >
> > > >Alex Winston wrote:
> > > >
> > > >
> > > >>i was hoping that someone could briefly review my current
> solution to a
> > > >>problem that we have encountered to see if anyone could suggest
> a
> > > >>possible alternative, because as it stands we have pushed
> lucene past
> > > >>its current limits.
> > > >>
> > > >>PROBLEM:
> > > >>
> > > >>we were wanting to represent a range of values for a particular
> field
> > > >>that is searchable over a particular range.
> > > >>
> > > >>an example follows for clarification:
> > > >>we were wanting to store a range of chapters and verses of a
> book for a
> > > >>particular document, and in turn search to see if a query range
> includes
> > > >>the range that is represented in the index.
> > > >>
> > > >>if this is unclear please ask for clarification
> > > >>
> > > >>IMPRACTICAL SOLUTION:
> > > >>
> > > >>although this solution seems somewhat impractical it is all we
> could
> > > >>come up with.
> > > >>
> > > >>our solution involved storing each possible range value within
> the term
> > > >>which would allow for RangeQuerys to be performed on this
> particular
> > > >>field.  for very small ranges this seems somewhat practical
> after
> > > >>profiling.  although once the field ranges began to span
> multiple
> > > >>chapters and verses, the search times became unreasonable
> because we
> > > >>were storing thousands of entries for each representative
> range.
> > > >>
> > > >>i can elaborate on anything that is unclear,
> > > >>but any thoughts on a possible alternative solution within
> lucene that
> > > >>we overlooked would be extremely helpful.
> > > >>	
> > > >>
> > > >>alex
> > > >
> > > >
> > > >
> > > >--
> > > >To unsubscribe, e-mail:
> > > >For additional commands, e-mail:
> > > >
> > > >
> > >
> > 
> > -- 
> > Brain: Pinky, are you pondering what I�m pondering?
> > Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what
> were 
> > they thinking?
> > 
> > 
> > --
> > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> > 
> > 
> 
> 

> ATTACHMENT part 2 application/pgp-signature name=signature.asc



__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Searching Ranges

Posted by Alex Winston <al...@christianity.com>.
good thoughts and something that i would like to explore further. let me
create a more concrete example that we can use to help visualize the
problem and a possible solution and then get feedback.  it may be that
the changes work in my case but not for all cases, or that there is
something else that could help alleviate the overhead i am experiencing.

lets say that i have a document named "d1", which contains a field named
"references".  within the "references" field i maintain a list of terms
that represent my range from 001-005, more specifically the field would
contain the terms "001 002 003 004 005".

i would now like to search this range to determine if it falls within
the range 003-010, so my query would look like "references:[003 010]".

in this case RangeQuery begins to iterate the terms contained within
"references" to determine if "d1" is a match.  as it iterates each term
is analyzed, 001 and 002 fail so we continue to iterate at which point
003 is determined to be a match.

at this point there is no need to continue to search the terms because
we have determined there is a match, not only that but it helps reduce
the size of the byte[] cache that is created i believe.  like i
mentioned earlier, i am not sure of the ramifications this may have on
scoring, but it works for most cases i can think of, but if i am missing
something that is what i need feedback on :).

you can imagine how this improves the avg efficiency in my case if i
have 10000 terms in "references".  although i may be doing something
that was either not intended or ill-designed.

thanks, any thoughts?
alex



On Mon, 2002-11-11 at 10:50, Scott Ganyo wrote:
> Hi Alex,
> 
> I just looked at this and had the following thought:
> 
> The RangeQuery must continue to iterate after the first match is found 
> in order to match everything within the specified range.  In other 
> words, if you have a range of "a" to "d", you can't stop with "a", you 
> need to continue to "d".  At the point you move beyond "d" is the point 
> where the query should stop iterating.  That is reflected in lines 
> 160-162.  It seems to me that your solution would only work where your 
> range consists of a single term.
> 
> Please let me know if I'm just misunderstanding the situation.
> 
> Scott
> 
> Alex Winston wrote:
> 
> > thanks for the reply, my apologizes for not explaining myself very
> > clearly, it has been a long day.
> >
> > you expressed exactly our situation, unfortunately this is not an option
> > because we want to have multiple ranges for each document as well,
> > there is a possible extension of what you suggested but that is a last
> > resort.  kinda crazy i know, but you have to meet requirements :).
> >
> > but i also had a thought while i was looking through the lucene code,
> > and any comments are welcome.
> >
> > i may be very mistaken because it has been a long day but if you look at
> > the current cvs version of RangeQuery it appears that even if a match is
> > found it will continue to iterate over terms within a field, and in my
> > case it is on the order of thousands.  if i add a break after a match
> > has been found it appears as though the search is improved on avg an
> > order of magnitude, my math has left me so i cannot be theoretical at
> > the moment.  i have unit tested the change on my side and on the lucene
> > side and it works.  note: one hard example is that a query went from 20
> > seconds to .5 seconds.  any initial thoughts to if there is a case where
> > this would not work?
> >
> > beginning line 164:
> > TermQuery tq = new TermQuery(term);	  // found a match
> > tq.setBoost(boost);			   // set the boost
> > q.add(tq, false, false);		  // add to q
> > break;  // ADDED!
> >
> >
> > On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> >
> > >Alex,
> > >
> > >It is rather confusing. It sounds like you've indexed
> > >a field that that can be between two values (let's say
> > >E-J) and then when you have a search term such as G
> > >you want the docs containing E-J (or A-H or F-K but not A-H
> > >nor A-C nor J-Z)
> > >
> > >Just of the top of my head but could you index the upper and
> > >lower bounds as separate fields then when you search do a
> > >compound query:
> > >
> > >     lower_bound:{ - search_term } AND upper_bound:{ search_term - }
> > >
> > >just a thought.
> > >
> > >>-MikeB.
> > >
> > >
> > >Alex Winston wrote:
> > >
> > >
> > >>i was hoping that someone could briefly review my current solution to a
> > >>problem that we have encountered to see if anyone could suggest a
> > >>possible alternative, because as it stands we have pushed lucene past
> > >>its current limits.
> > >>
> > >>PROBLEM:
> > >>
> > >>we were wanting to represent a range of values for a particular field
> > >>that is searchable over a particular range.
> > >>
> > >>an example follows for clarification:
> > >>we were wanting to store a range of chapters and verses of a book for a
> > >>particular document, and in turn search to see if a query range includes
> > >>the range that is represented in the index.
> > >>
> > >>if this is unclear please ask for clarification
> > >>
> > >>IMPRACTICAL SOLUTION:
> > >>
> > >>although this solution seems somewhat impractical it is all we could
> > >>come up with.
> > >>
> > >>our solution involved storing each possible range value within the term
> > >>which would allow for RangeQuerys to be performed on this particular
> > >>field.  for very small ranges this seems somewhat practical after
> > >>profiling.  although once the field ranges began to span multiple
> > >>chapters and verses, the search times became unreasonable because we
> > >>were storing thousands of entries for each representative range.
> > >>
> > >>i can elaborate on anything that is unclear,
> > >>but any thoughts on a possible alternative solution within lucene that
> > >>we overlooked would be extremely helpful.
> > >>	
> > >>
> > >>alex
> > >
> > >
> > >
> > >--
> > >To unsubscribe, e-mail:
> > >For additional commands, e-mail:
> > >
> > >
> >
> 
> -- 
> Brain: Pinky, are you pondering what I’m pondering?
> Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
> they thinking?
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


Re: Searching Ranges

Posted by Scott Ganyo <sc...@etapestry.com>.
Hi Alex,

I just looked at this and had the following thought:

The RangeQuery must continue to iterate after the first match is found 
in order to match everything within the specified range.  In other 
words, if you have a range of "a" to "d", you can't stop with "a", you 
need to continue to "d".  At the point you move beyond "d" is the point 
where the query should stop iterating.  That is reflected in lines 
160-162.  It seems to me that your solution would only work where your 
range consists of a single term.

Please let me know if I'm just misunderstanding the situation.

Scott

Alex Winston wrote:

> thanks for the reply, my apologizes for not explaining myself very
> clearly, it has been a long day.
>
> you expressed exactly our situation, unfortunately this is not an option
> because we want to have multiple ranges for each document as well,
> there is a possible extension of what you suggested but that is a last
> resort.  kinda crazy i know, but you have to meet requirements :).
>
> but i also had a thought while i was looking through the lucene code,
> and any comments are welcome.
>
> i may be very mistaken because it has been a long day but if you look at
> the current cvs version of RangeQuery it appears that even if a match is
> found it will continue to iterate over terms within a field, and in my
> case it is on the order of thousands.  if i add a break after a match
> has been found it appears as though the search is improved on avg an
> order of magnitude, my math has left me so i cannot be theoretical at
> the moment.  i have unit tested the change on my side and on the lucene
> side and it works.  note: one hard example is that a query went from 20
> seconds to .5 seconds.  any initial thoughts to if there is a case where
> this would not work?
>
> beginning line 164:
> TermQuery tq = new TermQuery(term);	  // found a match
> tq.setBoost(boost);			   // set the boost
> q.add(tq, false, false);		  // add to q
> break;  // ADDED!
>
>
> On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
>
> >Alex,
> >
> >It is rather confusing. It sounds like you've indexed
> >a field that that can be between two values (let's say
> >E-J) and then when you have a search term such as G
> >you want the docs containing E-J (or A-H or F-K but not A-H
> >nor A-C nor J-Z)
> >
> >Just of the top of my head but could you index the upper and
> >lower bounds as separate fields then when you search do a
> >compound query:
> >
> >     lower_bound:{ - search_term } AND upper_bound:{ search_term - }
> >
> >just a thought.
> >
> >>-MikeB.
> >
> >
> >Alex Winston wrote:
> >
> >
> >>i was hoping that someone could briefly review my current solution to a
> >>problem that we have encountered to see if anyone could suggest a
> >>possible alternative, because as it stands we have pushed lucene past
> >>its current limits.
> >>
> >>PROBLEM:
> >>
> >>we were wanting to represent a range of values for a particular field
> >>that is searchable over a particular range.
> >>
> >>an example follows for clarification:
> >>we were wanting to store a range of chapters and verses of a book for a
> >>particular document, and in turn search to see if a query range includes
> >>the range that is represented in the index.
> >>
> >>if this is unclear please ask for clarification
> >>
> >>IMPRACTICAL SOLUTION:
> >>
> >>although this solution seems somewhat impractical it is all we could
> >>come up with.
> >>
> >>our solution involved storing each possible range value within the term
> >>which would allow for RangeQuerys to be performed on this particular
> >>field.  for very small ranges this seems somewhat practical after
> >>profiling.  although once the field ranges began to span multiple
> >>chapters and verses, the search times became unreasonable because we
> >>were storing thousands of entries for each representative range.
> >>
> >>i can elaborate on anything that is unclear,
> >>but any thoughts on a possible alternative solution within lucene that
> >>we overlooked would be extremely helpful.
> >>	
> >>
> >>alex
> >
> >
> >
> >--
> >To unsubscribe, e-mail:
> >For additional commands, e-mail:
> >
> >
>

-- 
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Searching Ranges

Posted by Alex Winston <al...@christianity.com>.
thanks for the reply, my apologizes for not explaining myself very
clearly, it has been a long day.

you expressed exactly our situation, unfortunately this is not an option
because we want to have multiple ranges for each document as well, 
there is a possible extension of what you suggested but that is a last
resort.  kinda crazy i know, but you have to meet requirements :).

but i also had a thought while i was looking through the lucene code,
and any comments are welcome.  

i may be very mistaken because it has been a long day but if you look at
the current cvs version of RangeQuery it appears that even if a match is
found it will continue to iterate over terms within a field, and in my
case it is on the order of thousands.  if i add a break after a match
has been found it appears as though the search is improved on avg an
order of magnitude, my math has left me so i cannot be theoretical at
the moment.  i have unit tested the change on my side and on the lucene
side and it works.  note: one hard example is that a query went from 20
seconds to .5 seconds.  any initial thoughts to if there is a case where
this would not work?

beginning line 164:
TermQuery tq = new TermQuery(term);	  // found a match
tq.setBoost(boost);			   // set the boost
q.add(tq, false, false);		  // add to q
break;  // ADDED!


On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> Alex,
> 
> It is rather confusing. It sounds like you've indexed
> a field that that can be between two values (let's say
> E-J) and then when you have a search term such as G
> you want the docs containing E-J (or A-H or F-K but not A-H
> nor A-C nor J-Z)
> 
> Just of the top of my head but could you index the upper and
> lower bounds as separate fields then when you search do a
> compound query:
> 
>      lower_bound:{ - search_term } AND upper_bound:{ search_term - }
> 
> just a thought.
> > -MikeB.
> 
> 
> Alex Winston wrote:
> 
> > i was hoping that someone could briefly review my current solution to a
> > problem that we have encountered to see if anyone could suggest a
> > possible alternative, because as it stands we have pushed lucene past
> > its current limits.
> >
> > PROBLEM:
> >
> > we were wanting to represent a range of values for a particular field
> > that is searchable over a particular range.
> >
> > an example follows for clarification:
> > we were wanting to store a range of chapters and verses of a book for a
> > particular document, and in turn search to see if a query range includes
> > the range that is represented in the index.
> >
> > if this is unclear please ask for clarification
> >
> > IMPRACTICAL SOLUTION:
> >
> > although this solution seems somewhat impractical it is all we could
> > come up with.
> >
> > our solution involved storing each possible range value within the term
> > which would allow for RangeQuerys to be performed on this particular
> > field.  for very small ranges this seems somewhat practical after
> > profiling.  although once the field ranges began to span multiple
> > chapters and verses, the search times became unreasonable because we
> > were storing thousands of entries for each representative range.
> >
> > i can elaborate on anything that is unclear,
> > but any thoughts on a possible alternative solution within lucene that
> > we overlooked would be extremely helpful.
> > 	
> >
> > alex
> 
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


Re: Searching Ranges

Posted by Mike Barry <mb...@cos.com>.
Alex,

It is rather confusing. It sounds like you've indexed
a field that that can be between two values (let's say
E-J) and then when you have a search term such as G
you want the docs containing E-J (or A-H or F-K but not A-H
nor A-C nor J-Z)

Just of the top of my head but could you index the upper and
lower bounds as separate fields then when you search do a
compound query:

     lower_bound:{ - search_term } AND upper_bound:{ search_term - }

just a thought.

-MikeB.


Alex Winston wrote:

> i was hoping that someone could briefly review my current solution to a
> problem that we have encountered to see if anyone could suggest a
> possible alternative, because as it stands we have pushed lucene past
> its current limits.
>
> PROBLEM:
>
> we were wanting to represent a range of values for a particular field
> that is searchable over a particular range.
>
> an example follows for clarification:
> we were wanting to store a range of chapters and verses of a book for a
> particular document, and in turn search to see if a query range includes
> the range that is represented in the index.
>
> if this is unclear please ask for clarification
>
> IMPRACTICAL SOLUTION:
>
> although this solution seems somewhat impractical it is all we could
> come up with.
>
> our solution involved storing each possible range value within the term
> which would allow for RangeQuerys to be performed on this particular
> field.  for very small ranges this seems somewhat practical after
> profiling.  although once the field ranges began to span multiple
> chapters and verses, the search times became unreasonable because we
> were storing thousands of entries for each representative range.
>
> i can elaborate on anything that is unclear,
> but any thoughts on a possible alternative solution within lucene that
> we overlooked would be extremely helpful.
> 	
>
> alex



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>