You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Claude Devarenne <cl...@library.ucsf.edu> on 2004/05/18 19:38:01 UTC

How to handle range queries over large ranges and avoid Too Many Boolean clauses

Hi,

I have over 60,000 documents in my index which is slightly over a 1 GB 
in size.  The documents range from the late seventies up to now.  I 
have indexed dates as a keyword field using a string because the dates 
are in YYYYMMDD format.  When I do range queries things are OK as long 
as I don't exceed the built-in number of boolean clauses, so that's a 
range of 3 years, e.g. 1979 to 1981.  The users are not only doing 
complex queries but also want to query over long ranges, e.g. [19790101 
TO 19991231].

Given these requirements, I am thinking of doing a query without the 
date range, bring the unique ids back from the hits and then do a date 
query in the SQL database I have that contains the same data.  Another 
alternative is to do the query without the date range in Lucene and 
then sort the results within the range.  I still have to learn how to 
use the new sorting code and confessed I did not have time to look at 
it yet.

Is there a simpler, easier way to do this?

Claude


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 18, 2004, at 3:56 PM, Claude Devarenne wrote:
> Thanks, I'll try that.  It would nice too if I could extend field (it 
> is a final class) and create a numerical field.  Is that not 
> desirable?

It isn't that much more effort to have something like NumberUtils 
listed here: 
http://wiki.apache.org/jakarta-lucene/SearchNumericalFields

I'm not sure of the pros/cons to making Field extensible or not, but it 
really is of marginal benefit since it ultimately it needs a String and 
a conversion of numeric to String in your own code isn't involved.  I 
suppose we could put something like NumberUtils (maybe called 
NumberField to be like DateField) in the core to have a built-in 
solution.  We probably ought to also go another step and provide Date 
-> YYYYMMDD conversion as additional parts to DateField.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Claude Devarenne <cl...@library.ucsf.edu>.
Thanks, I'll try that.  It would nice too if I could extend field (it 
is a final class) and create a numerical field.  Is that not desirable?

Claude

On May 18, 2004, at 12:06 PM, Ype Kingma wrote:

> On Tuesday 18 May 2004 19:38, Claude Devarenne wrote:
>> Hi,
>>
>> I have over 60,000 documents in my index which is slightly over a 1 GB
>> in size.  The documents range from the late seventies up to now.  I
>> have indexed dates as a keyword field using a string because the dates
>> are in YYYYMMDD format.  When I do range queries things are OK as long
>> as I don't exceed the built-in number of boolean clauses, so that's a
>> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
>> complex queries but also want to query over long ranges, e.g. 
>> [19790101
>> TO 19991231].
>>
>> Given these requirements, I am thinking of doing a query without the
>> date range, bring the unique ids back from the hits and then do a date
>> query in the SQL database I have that contains the same data.  Another
>> alternative is to do the query without the date range in Lucene and
>> then sort the results within the range.  I still have to learn how to
>> use the new sorting code and confessed I did not have time to look at
>> it yet.
>>
>> Is there a simpler, easier way to do this?
>
> I wouldn't know of a simpler and easier way, but there is another way
> to reduce the number of clauses involved in long date ranges.
> This can be done by indexing not only YYYYMMDD but also YYYYMM and
> YYYY, and adapting the query range mechanism to use the shorter term
> whenever possible. (YYY and YYYYMMD might also be useful.)
>
>
> Kind regards,
> Ype
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Ype Kingma <yk...@xs4all.nl>.
On Tuesday 18 May 2004 19:38, Claude Devarenne wrote:
> Hi,
>
> I have over 60,000 documents in my index which is slightly over a 1 GB
> in size.  The documents range from the late seventies up to now.  I
> have indexed dates as a keyword field using a string because the dates
> are in YYYYMMDD format.  When I do range queries things are OK as long
> as I don't exceed the built-in number of boolean clauses, so that's a
> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
> complex queries but also want to query over long ranges, e.g. [19790101
> TO 19991231].
>
> Given these requirements, I am thinking of doing a query without the
> date range, bring the unique ids back from the hits and then do a date
> query in the SQL database I have that contains the same data.  Another
> alternative is to do the query without the date range in Lucene and
> then sort the results within the range.  I still have to learn how to
> use the new sorting code and confessed I did not have time to look at
> it yet.
>
> Is there a simpler, easier way to do this?

I wouldn't know of a simpler and easier way, but there is another way
to reduce the number of clauses involved in long date ranges.
This can be done by indexing not only YYYYMMDD but also YYYYMM and
YYYY, and adapting the query range mechanism to use the shorter term
whenever possible. (YYY and YYYYMMD might also be useful.)


Kind regards,
Ype


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Claude Devarenne <cl...@library.ucsf.edu>.
Thanks, I'll try that first and then Ype's suggestion if necessary.  I 
have been shying away from filters so now I have no excuse ;-)

Claude

On May 18, 2004, at 1:35 PM, Andy Goodell wrote:

> In our application we had a similar problem with non-date ranges until
> we realized that it wasnt so much that we were searching for the
> values in the range as restricting the search to that range, and then
> we used an extension to the org.apache.lucene.search.Filter class, and
> our implementation got much simpler and faster.
>
> - andy g
>
> On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne
> <cl...@library.ucsf.edu> wrote:
>>
>> Hi,
>>
>> I have over 60,000 documents in my index which is slightly over a 1 GB
>> in size.  The documents range from the late seventies up to now.  I
>> have indexed dates as a keyword field using a string because the dates
>> are in YYYYMMDD format.  When I do range queries things are OK as long
>> as I don't exceed the built-in number of boolean clauses, so that's a
>> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
>> complex queries but also want to query over long ranges, e.g. 
>> [19790101
>> TO 19991231].
>>
>> Given these requirements, I am thinking of doing a query without the
>> date range, bring the unique ids back from the hits and then do a date
>> query in the SQL database I have that contains the same data.  Another
>> alternative is to do the query without the date range in Lucene and
>> then sort the results within the range.  I still have to learn how to
>> use the new sorting code and confessed I did not have time to look at
>> it yet.
>>
>> Is there a simpler, easier way to do this?
>>
>> Claude
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Andy Goodell <go...@gmail.com>.
In our application we had a similar problem with non-date ranges until
we realized that it wasnt so much that we were searching for the
values in the range as restricting the search to that range, and then
we used an extension to the org.apache.lucene.search.Filter class, and
our implementation got much simpler and faster.

- andy g

On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne
<cl...@library.ucsf.edu> wrote:
> 
> Hi,
> 
> I have over 60,000 documents in my index which is slightly over a 1 GB
> in size.  The documents range from the late seventies up to now.  I
> have indexed dates as a keyword field using a string because the dates
> are in YYYYMMDD format.  When I do range queries things are OK as long
> as I don't exceed the built-in number of boolean clauses, so that's a
> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
> complex queries but also want to query over long ranges, e.g. [19790101
> TO 19991231].
> 
> Given these requirements, I am thinking of doing a query without the
> date range, bring the unique ids back from the hits and then do a date
> query in the SQL database I have that contains the same data.  Another
> alternative is to do the query without the date range in Lucene and
> then sort the results within the range.  I still have to learn how to
> use the new sorting code and confessed I did not have time to look at
> it yet.
> 
> Is there a simpler, easier way to do this?
> 
> Claude
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Claude Devarenne <cl...@library.ucsf.edu>.
Thanks,  I will look at the sorting code.  Sorting results by date is 
next on list.  For now, I only have a small number of documents but the 
set is to grow to over 8 million documents for the collection I am 
working on.  Another collection we have is 40 million documents or so.  
 From what you say it seems to me that sorting will not scale then when 
I get to larger number of documents.  I am considering using an SQL 
back end to implement sorting: bring back the unique IDs from lucene 
and then sort in SQL.

Claude

On May 18, 2004, at 11:23 PM, Morus Walter wrote:

> Claude Devarenne writes:
>> Hi,
>>
>> I have over 60,000 documents in my index which is slightly over a 1 GB
>> in size.  The documents range from the late seventies up to now.  I
>> have indexed dates as a keyword field using a string because the dates
>> are in YYYYMMDD format.  When I do range queries things are OK as long
>> as I don't exceed the built-in number of boolean clauses, so that's a
>> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
>> complex queries but also want to query over long ranges, e.g. 
>> [19790101
>> TO 19991231].
>>
>> Given these requirements, I am thinking of doing a query without the
>> date range, bring the unique ids back from the hits and then do a date
>> query in the SQL database I have that contains the same data.  Another
>> alternative is to do the query without the date range in Lucene and
>> then sort the results within the range.  I still have to learn how to
>> use the new sorting code and confessed I did not have time to look at
>> it yet.
>>
>> Is there a simpler, easier way to do this?
>>
> I think it would be worth to take a look at the sorting code.
>
> The idea of the sorting code is to have an array of the dates for each 
> doc
> in memory and access this array for sorting.
> Now sorting isn't the only thing one might use this array for.
> Doing a range check is another.
> So you might extend the sorting code by a range selection.
>
> There is no code for this in lucene and you have to create your own 
> searcher
> but it gives you a fast way to search and sort by date.
>
> I did this independently from the new sorting code (I just started a 
> little
> to early) and it works quite well.
> The only drawback from this (and the new sorting code) is, that it 
> requires
> an array of field values that must be rebuilt each time the index 
> changes.
> Shouldn't be a problem for 60000 documents.
>
> Morus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Morus Walter <mo...@tanto.de>.
Claude Devarenne writes:
> Hi,
> 
> I have over 60,000 documents in my index which is slightly over a 1 GB 
> in size.  The documents range from the late seventies up to now.  I 
> have indexed dates as a keyword field using a string because the dates 
> are in YYYYMMDD format.  When I do range queries things are OK as long 
> as I don't exceed the built-in number of boolean clauses, so that's a 
> range of 3 years, e.g. 1979 to 1981.  The users are not only doing 
> complex queries but also want to query over long ranges, e.g. [19790101 
> TO 19991231].
> 
> Given these requirements, I am thinking of doing a query without the 
> date range, bring the unique ids back from the hits and then do a date 
> query in the SQL database I have that contains the same data.  Another 
> alternative is to do the query without the date range in Lucene and 
> then sort the results within the range.  I still have to learn how to 
> use the new sorting code and confessed I did not have time to look at 
> it yet.
> 
> Is there a simpler, easier way to do this?
> 
I think it would be worth to take a look at the sorting code.

The idea of the sorting code is to have an array of the dates for each doc
in memory and access this array for sorting.
Now sorting isn't the only thing one might use this array for.
Doing a range check is another.
So you might extend the sorting code by a range selection.

There is no code for this in lucene and you have to create your own searcher
but it gives you a fast way to search and sort by date.

I did this independently from the new sorting code (I just started a little
to early) and it works quite well.
The only drawback from this (and the new sorting code) is, that it requires
an array of field values that must be rebuilt each time the index changes.
Shouldn't be a problem for 60000 documents.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Matt Quail <ma...@ctx.com.au>.
 > Is there a simpler, easier way to do this?

Yes. I have started implementing a "QuickRangeQuery" class, that doesn't 
have the BooleanQuery limitation, but scores every matching document as 1.0.

I will see if I can get it finished in the next 24 hours, and post back 
to this thread.

=Matt

PS: I'm not sure about the "QuickRangeQuery" class name... maybe 
"NormalizedRangeQuery", "RangeQuery2"... *shrug*

Claude Devarenne wrote:
> Hi,
> 
> I have over 60,000 documents in my index which is slightly over a 1 GB 
> in size.  The documents range from the late seventies up to now.  I have 
> indexed dates as a keyword field using a string because the dates are in 
> YYYYMMDD format.  When I do range queries things are OK as long as I 
> don't exceed the built-in number of boolean clauses, so that's a range 
> of 3 years, e.g. 1979 to 1981.  The users are not only doing complex 
> queries but also want to query over long ranges, e.g. [19790101 TO 
> 19991231].
> 
> Given these requirements, I am thinking of doing a query without the 
> date range, bring the unique ids back from the hits and then do a date 
> query in the SQL database I have that contains the same data.  Another 
> alternative is to do the query without the date range in Lucene and then 
> sort the results within the range.  I still have to learn how to use the 
> new sorting code and confessed I did not have time to look at it yet.
> 
> Is there a simpler, easier way to do this?
> 
> Claude
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org