You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Avi Levy <le...@wesee.com> on 2013/04/09 16:51:01 UTC

How to improve retrieval time when searching for a date range

Hello,

I have a Lucene.NET index created with version 2.9.4.1. The size of the
index is about 25 Million entries (In the production environment I will have
50 Million or more). The Index size is 5.75GB. The index is used for
searching by text. I need to add a new functionality that allows performing
a query for a specific date range in addition to the textual search (The
query is for text AND date range). The date range the user can select from
is either last 7 days or last 30 days.

The implementation I tried was to add a new indexed only numeric field
representing a date. The date is indexed as integer in the format yyyyMMdd.
I am indexing this field with a precision step of 1 (to make the retrieval
the fastest).  During retrieval I create a Boolean query that has the
original query and I added a clause for with MUST for the date range.

When I compare the results to regular textual queries I see much slower
results. I compared by running 10 queries for warm-up (I don't count the
results). Then another 90 queries where I count the results.

I will appreciate suggestion and tips on how to the performance of searching
by dates can be improved.

You can see below the statistics for the runs, and the code for creating the
fields and the query.

Thanks,
Avi

No changes (using index with no dates)
08 18:17:01,213 [1]  INFO: {(null)} - Min search time: 2
08 18:17:01,213 [1]  INFO: {(null)} - Max search time: 88
08 18:17:01,213 [1]  INFO: {(null)} - Average search time: 23.0674157303371
08 18:17:01,213 [1]  INFO: {(null)} - Search time Variance : 20.5
08 18:17:01,213 [1]  INFO: {(null)} - Number of results above 700ms: 0

Index With Date (not using dates in query)
08 18:22:49,093 [1]  INFO: {(null)} - Min search time: 3
08 18:22:49,093 [1]  INFO: {(null)} - Max search time: 176
08 18:22:49,093 [1]  INFO: {(null)} - Average search time: 50.9325842696629
08 18:22:49,093 [1]  INFO: {(null)} - Search time Variance : 46.85
08 18:22:49,093 [1]  INFO: {(null)} - Number of results above 700ms: 0

With Dates - Last 7 Days
08 19:38:17,988 [1]  INFO: {(null)} - Min search time: 33
08 19:38:17,988 [1]  INFO: {(null)} - Max search time: 1668
08 19:38:17,988 [1]  INFO: {(null)} - Average search time: 704.741573033708
08 19:38:17,988 [1]  INFO: {(null)} - Search time Variance : 607.05
08 19:38:17,988 [1]  INFO: {(null)} - Number of results above 700ms: 44

With Dates - Last 30 Days
08 19:48:17,123 [1]  INFO: {(null)} - Min search time: 105
08 19:48:17,123 [1]  INFO: {(null)} - Max search time: 4808
08 19:48:17,123 [1]  INFO: {(null)} - Average search time: 2846.75280898876
08 19:48:17,123 [1]  INFO: {(null)} - Search time Variance : 1934.11
08 19:48:17,123 [1]  INFO: {(null)} - Number of results above 700ms: 72

Here are the field's definitions:

var idField = new Field( "ID", String.Empty, Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS ); 
document.Add( idField );
var id2Field = new Field( "ID2", String.Empty, Field.Store.YES,
Field.Index.NO );
document.Add( id2Field );

var txtField = new Field( "txtField", String.Empty, Field.Store.NO,
Field.Index. ANALYZED ); document.Add( txtField );

var txt2Field = new Field( "txt2Field", String.Empty, Field.Store.NO,
Field.Index. ANALYZED ); document.Add( txt2Field );

var txt3Field = new Field( "txt3Field", String.Empty, Field.Store.NO,
Field.Index. ANALYZED ); document.Add( txt3Field );

 

// The new date field

var dateField = new NumericField( "Date", 1, Field.Store.NO, true ); 
document.Add(dateField);

 

I set the values to the fields. For the new date field I set it like this:

Int64 dateInt = <some date>;

dateField.SetIntValue(dateInt);

 

The query:

var fields = new String[3];

Dictionary<String, Single> boosts = new Dictionary<String, Single>();

fields[0]="txtField";

boosts.Add( fields[0],<Value>);

fields[1]="txt2Field";

boosts.Add( fields[1],<Value>);

fields[2]="txt3Field";

boosts.Add( fields[2],<Value>);

MultiFieldQueryParser parser = new MultiFieldQueryParser( Version.LUCENE_29,
fields, analyzer, boosts );
var boolQuery = new BooleanQuery(); 
Query simpleParsedQuery = parser.Parse( queryText );
boolQuery.Add( simpleParsedQuery, BooleanClause.Occur.MUST );
DateTime beginDate = <Date 7 or 30 days ago).
Int32 beginDateInt = beginDate.Day + beginDate.Month * 100 + beginDate.Year
* 10000;

DateTime now = DateTime.UtcNow;

Int32 endDateInt = now.Day + now.Month * 100 + now.Year * 10000;

NumericRangeQuery datesQuery = NumericRangeQuery.NewIntRange( "Date",
beginDateInt, endDateInt, true, true );

boolQuery.Add( datesQuery, BooleanClause.Occur.MUST );

 


RE: How to improve retrieval time when searching for a date range

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

Precision step=1 is not necessarily the fastest (see javadocs of Lucene, should be similar in Lucene.NET). Try the default, 4, first. In general, those range queries will always be slower than text-only queries, as there is much more work to do (more terms, more documents,...)

This question is more related to Lucene.NET so I would ask the question on their mailing list.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Avi Levy [mailto:levy@wesee.com]
> Sent: Tuesday, April 09, 2013 4:51 PM
> To: java-user@lucene.apache.org
> Subject: How to improve retrieval time when searching for a date range
> 
> Hello,
> 
> I have a Lucene.NET index created with version 2.9.4.1. The size of the index
> is about 25 Million entries (In the production environment I will have
> 50 Million or more). The Index size is 5.75GB. The index is used for searching
> by text. I need to add a new functionality that allows performing a query for
> a specific date range in addition to the textual search (The query is for text
> AND date range). The date range the user can select from is either last 7 days
> or last 30 days.
> 
> The implementation I tried was to add a new indexed only numeric field
> representing a date. The date is indexed as integer in the format yyyyMMdd.
> I am indexing this field with a precision step of 1 (to make the retrieval the
> fastest).  During retrieval I create a Boolean query that has the original query
> and I added a clause for with MUST for the date range.
> 
> When I compare the results to regular textual queries I see much slower
> results. I compared by running 10 queries for warm-up (I don't count the
> results). Then another 90 queries where I count the results.
> 
> I will appreciate suggestion and tips on how to the performance of searching
> by dates can be improved.
> 
> You can see below the statistics for the runs, and the code for creating the
> fields and the query.
> 
> Thanks,
> Avi
> 
> No changes (using index with no dates)
> 08 18:17:01,213 [1]  INFO: {(null)} - Min search time: 2
> 08 18:17:01,213 [1]  INFO: {(null)} - Max search time: 88
> 08 18:17:01,213 [1]  INFO: {(null)} - Average search time: 23.0674157303371
> 08 18:17:01,213 [1]  INFO: {(null)} - Search time Variance : 20.5
> 08 18:17:01,213 [1]  INFO: {(null)} - Number of results above 700ms: 0
> 
> Index With Date (not using dates in query)
> 08 18:22:49,093 [1]  INFO: {(null)} - Min search time: 3
> 08 18:22:49,093 [1]  INFO: {(null)} - Max search time: 176
> 08 18:22:49,093 [1]  INFO: {(null)} - Average search time: 50.9325842696629
> 08 18:22:49,093 [1]  INFO: {(null)} - Search time Variance : 46.85
> 08 18:22:49,093 [1]  INFO: {(null)} - Number of results above 700ms: 0
> 
> With Dates - Last 7 Days
> 08 19:38:17,988 [1]  INFO: {(null)} - Min search time: 33
> 08 19:38:17,988 [1]  INFO: {(null)} - Max search time: 1668
> 08 19:38:17,988 [1]  INFO: {(null)} - Average search time: 704.741573033708
> 08 19:38:17,988 [1]  INFO: {(null)} - Search time Variance : 607.05
> 08 19:38:17,988 [1]  INFO: {(null)} - Number of results above 700ms: 44
> 
> With Dates - Last 30 Days
> 08 19:48:17,123 [1]  INFO: {(null)} - Min search time: 105
> 08 19:48:17,123 [1]  INFO: {(null)} - Max search time: 4808
> 08 19:48:17,123 [1]  INFO: {(null)} - Average search time: 2846.75280898876
> 08 19:48:17,123 [1]  INFO: {(null)} - Search time Variance : 1934.11
> 08 19:48:17,123 [1]  INFO: {(null)} - Number of results above 700ms: 72
> 
> Here are the field's definitions:
> 
> var idField = new Field( "ID", String.Empty, Field.Store.YES,
> Field.Index.NOT_ANALYZED_NO_NORMS ); document.Add( idField ); var
> id2Field = new Field( "ID2", String.Empty, Field.Store.YES, Field.Index.NO );
> document.Add( id2Field );
> 
> var txtField = new Field( "txtField", String.Empty, Field.Store.NO, Field.Index.
> ANALYZED ); document.Add( txtField );
> 
> var txt2Field = new Field( "txt2Field", String.Empty, Field.Store.NO,
> Field.Index. ANALYZED ); document.Add( txt2Field );
> 
> var txt3Field = new Field( "txt3Field", String.Empty, Field.Store.NO,
> Field.Index. ANALYZED ); document.Add( txt3Field );
> 
> 
> 
> // The new date field
> 
> var dateField = new NumericField( "Date", 1, Field.Store.NO, true );
> document.Add(dateField);
> 
> 
> 
> I set the values to the fields. For the new date field I set it like this:
> 
> Int64 dateInt = <some date>;
> 
> dateField.SetIntValue(dateInt);
> 
> 
> 
> The query:
> 
> var fields = new String[3];
> 
> Dictionary<String, Single> boosts = new Dictionary<String, Single>();
> 
> fields[0]="txtField";
> 
> boosts.Add( fields[0],<Value>);
> 
> fields[1]="txt2Field";
> 
> boosts.Add( fields[1],<Value>);
> 
> fields[2]="txt3Field";
> 
> boosts.Add( fields[2],<Value>);
> 
> MultiFieldQueryParser parser = new MultiFieldQueryParser(
> Version.LUCENE_29, fields, analyzer, boosts ); var boolQuery = new
> BooleanQuery(); Query simpleParsedQuery = parser.Parse( queryText );
> boolQuery.Add( simpleParsedQuery, BooleanClause.Occur.MUST );
> DateTime beginDate = <Date 7 or 30 days ago).
> Int32 beginDateInt = beginDate.Day + beginDate.Month * 100 +
> beginDate.Year
> * 10000;
> 
> DateTime now = DateTime.UtcNow;
> 
> Int32 endDateInt = now.Day + now.Month * 100 + now.Year * 10000;
> 
> NumericRangeQuery datesQuery = NumericRangeQuery.NewIntRange(
> "Date", beginDateInt, endDateInt, true, true );
> 
> boolQuery.Add( datesQuery, BooleanClause.Occur.MUST );
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org