You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Avi Levy <le...@wesee.com> on 2013/04/09 16:58:01 UTC

How to improve retrieval time when searching for a date range

Hello,

I have a Lucene.NET index created with version 2.9.4.1. The size of the
index is about 25 Million entries (In the production environment I will have
50 Million or more). The Index size is 5.75GB. The index is used for
searching by text. I need to add a new functionality that allows performing
a query for a specific date range in addition to the textual search (The
query is for text AND date range). The date range the user can select from
is either last 7 days or last 30 days.

The implementation I tried was to add a new indexed only numeric field
representing a date. The date is indexed as integer in the format yyyyMMdd.
I am indexing this field with a precision step of 1 (to make the retrieval
the fastest).  During retrieval I create a Boolean query that has the
original query and I added a clause for with MUST for the date range.

A few days ago I posted a question and got some useful suggestions. I have
reached a point where I get acceptable search times when I compare queries
on the index with the dates to the index without them. However, the problem
I am facing now is that the queries with the dates are slow. I will
appreciate suggestion and tips on how to the performance of searching by
dates can be improved.

You can see below the statistics for the runs, and the code for creating the
fields and the query.

Thanks,
Avi

No changes (using index with no dates)
08 18:17:01,213 [1]  INFO: {(null)} - Min search time: 2
08 18:17:01,213 [1]  INFO: {(null)} - Max search time: 88
08 18:17:01,213 [1]  INFO: {(null)} - Average search time: 23.0674157303371
08 18:17:01,213 [1]  INFO: {(null)} - Search time Variance : 20.5
08 18:17:01,213 [1]  INFO: {(null)} - Number of results above 700ms: 0

Index With Date (not using dates in query)
08 18:22:49,093 [1]  INFO: {(null)} - Min search time: 3
08 18:22:49,093 [1]  INFO: {(null)} - Max search time: 176
08 18:22:49,093 [1]  INFO: {(null)} - Average search time: 50.9325842696629
08 18:22:49,093 [1]  INFO: {(null)} - Search time Variance : 46.85
08 18:22:49,093 [1]  INFO: {(null)} - Number of results above 700ms: 0

With Dates - Last 7 Days
08 19:38:17,988 [1]  INFO: {(null)} - Min search time: 33
08 19:38:17,988 [1]  INFO: {(null)} - Max search time: 1668
08 19:38:17,988 [1]  INFO: {(null)} - Average search time: 704.741573033708
08 19:38:17,988 [1]  INFO: {(null)} - Search time Variance : 607.05
08 19:38:17,988 [1]  INFO: {(null)} - Number of results above 700ms: 44

With Dates - Last 30 Days
08 19:48:17,123 [1]  INFO: {(null)} - Min search time: 105
08 19:48:17,123 [1]  INFO: {(null)} - Max search time: 4808
08 19:48:17,123 [1]  INFO: {(null)} - Average search time: 2846.75280898876
08 19:48:17,123 [1]  INFO: {(null)} - Search time Variance : 1934.11
08 19:48:17,123 [1]  INFO: {(null)} - Number of results above 700ms: 72

Here are the field's definitions:

var idField = new Field( "ID", String.Empty, Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS ); 
document.Add( idField );
var id2Field = new Field( "ID2", String.Empty, Field.Store.YES,
Field.Index.NO );
document.Add( id2Field );

var txtField = new Field( "txtField", String.Empty, Field.Store.NO,
Field.Index. ANALYZED ); document.Add( txtField );

var txt2Field = new Field( "txt2Field", String.Empty, Field.Store.NO,
Field.Index. ANALYZED ); document.Add( txt2Field );

var txt3Field = new Field( "txt3Field", String.Empty, Field.Store.NO,
Field.Index. ANALYZED ); document.Add( txt3Field );

 

// The new date field

var dateField = new NumericField( "Date", 1, Field.Store.NO, true ); 
document.Add(dateField);

 

I set the values to the fields. For the new date field I set it like this:

Int64 dateInt = <some date>;

dateField.SetIntValue(dateInt);

 

The query:

var fields = new String[3];

Dictionary<String, Single> boosts = new Dictionary<String, Single>();

fields[0]="txtField";

boosts.Add( fields[0],<Value>);

fields[1]="txt2Field";

boosts.Add( fields[1],<Value>);

fields[2]="txt3Field";

boosts.Add( fields[2],<Value>);

MultiFieldQueryParser parser = new MultiFieldQueryParser( Version.LUCENE_29,
fields, analyzer, boosts );
var boolQuery = new BooleanQuery(); 
Query simpleParsedQuery = parser.Parse( queryText );
boolQuery.Add( simpleParsedQuery, BooleanClause.Occur.MUST );
DateTime beginDate = <Date 7 or 30 days ago).
Int32 beginDateInt = beginDate.Day + beginDate.Month * 100 + beginDate.Year
* 10000;

DateTime now = DateTime.UtcNow;

Int32 endDateInt = now.Day + now.Month * 100 + now.Year * 10000;

NumericRangeQuery datesQuery = NumericRangeQuery.NewIntRange( "Date",
beginDateInt, endDateInt, true, true );

boolQuery.Add( datesQuery, BooleanClause.Occur.MUST );

 


RE: How to improve retrieval time when searching for a date range

Posted by Moray McConnachie <mm...@oxford-analytica.com>.
Have you experimented with using strings with no analyser instead of
numerics? Apologies if this was in your original post.

This is what we do (in an older version of Lucene), though I haven't run
comparatives on it and I have no idea if it's best practice. But date
strings in yyyyMMdd behave just fine as they can be used for order and
range queries.

You sacrifice some index size.

Yours,
Moray


-----Original Message-----
From: Avi Levy [mailto:levy@wesee.com] 
Sent: 09 April 2013 15:58
To: user@lucenenet.apache.org
Subject: How to improve retrieval time when searching for a date range

Hello,

I have a Lucene.NET index created with version 2.9.4.1. The size of the
index is about 25 Million entries (In the production environment I will
have
50 Million or more). The Index size is 5.75GB. The index is used for
searching by text. I need to add a new functionality that allows
performing a query for a specific date range in addition to the textual
search (The query is for text AND date range). The date range the user
can select from is either last 7 days or last 30 days.

The implementation I tried was to add a new indexed only numeric field
representing a date. The date is indexed as integer in the format
yyyyMMdd.
I am indexing this field with a precision step of 1 (to make the
retrieval the fastest).  During retrieval I create a Boolean query that
has the original query and I added a clause for with MUST for the date
range.

A few days ago I posted a question and got some useful suggestions. I
have reached a point where I get acceptable search times when I compare
queries on the index with the dates to the index without them. However,
the problem I am facing now is that the queries with the dates are slow.
I will appreciate suggestion and tips on how to the performance of
searching by dates can be improved.

You can see below the statistics for the runs, and the code for creating
the fields and the query.

Thanks,
Avi

No changes (using index with no dates)
08 18:17:01,213 [1]  INFO: {(null)} - Min search time: 2
08 18:17:01,213 [1]  INFO: {(null)} - Max search time: 88
08 18:17:01,213 [1]  INFO: {(null)} - Average search time:
23.0674157303371
08 18:17:01,213 [1]  INFO: {(null)} - Search time Variance : 20.5
08 18:17:01,213 [1]  INFO: {(null)} - Number of results above 700ms: 0

Index With Date (not using dates in query)
08 18:22:49,093 [1]  INFO: {(null)} - Min search time: 3
08 18:22:49,093 [1]  INFO: {(null)} - Max search time: 176
08 18:22:49,093 [1]  INFO: {(null)} - Average search time:
50.9325842696629
08 18:22:49,093 [1]  INFO: {(null)} - Search time Variance : 46.85
08 18:22:49,093 [1]  INFO: {(null)} - Number of results above 700ms: 0

With Dates - Last 7 Days
08 19:38:17,988 [1]  INFO: {(null)} - Min search time: 33
08 19:38:17,988 [1]  INFO: {(null)} - Max search time: 1668
08 19:38:17,988 [1]  INFO: {(null)} - Average search time:
704.741573033708
08 19:38:17,988 [1]  INFO: {(null)} - Search time Variance : 607.05
08 19:38:17,988 [1]  INFO: {(null)} - Number of results above 700ms: 44

With Dates - Last 30 Days
08 19:48:17,123 [1]  INFO: {(null)} - Min search time: 105
08 19:48:17,123 [1]  INFO: {(null)} - Max search time: 4808
08 19:48:17,123 [1]  INFO: {(null)} - Average search time:
2846.75280898876
08 19:48:17,123 [1]  INFO: {(null)} - Search time Variance : 1934.11
08 19:48:17,123 [1]  INFO: {(null)} - Number of results above 700ms: 72

Here are the field's definitions:

var idField = new Field( "ID", String.Empty, Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS ); document.Add( idField ); var
id2Field = new Field( "ID2", String.Empty, Field.Store.YES,
Field.Index.NO ); document.Add( id2Field );

var txtField = new Field( "txtField", String.Empty, Field.Store.NO,
Field.Index. ANALYZED ); document.Add( txtField );

var txt2Field = new Field( "txt2Field", String.Empty, Field.Store.NO,
Field.Index. ANALYZED ); document.Add( txt2Field );

var txt3Field = new Field( "txt3Field", String.Empty, Field.Store.NO,
Field.Index. ANALYZED ); document.Add( txt3Field );

 

// The new date field

var dateField = new NumericField( "Date", 1, Field.Store.NO, true );
document.Add(dateField);

 

I set the values to the fields. For the new date field I set it like
this:

Int64 dateInt = <some date>;

dateField.SetIntValue(dateInt);

 

The query:

var fields = new String[3];

Dictionary<String, Single> boosts = new Dictionary<String, Single>();

fields[0]="txtField";

boosts.Add( fields[0],<Value>);

fields[1]="txt2Field";

boosts.Add( fields[1],<Value>);

fields[2]="txt3Field";

boosts.Add( fields[2],<Value>);

MultiFieldQueryParser parser = new MultiFieldQueryParser(
Version.LUCENE_29, fields, analyzer, boosts ); var boolQuery = new
BooleanQuery(); Query simpleParsedQuery = parser.Parse( queryText );
boolQuery.Add( simpleParsedQuery, BooleanClause.Occur.MUST ); DateTime
beginDate = <Date 7 or 30 days ago).
Int32 beginDateInt = beginDate.Day + beginDate.Month * 100 +
beginDate.Year
* 10000;

DateTime now = DateTime.UtcNow;

Int32 endDateInt = now.Day + now.Month * 100 + now.Year * 10000;

NumericRangeQuery datesQuery = NumericRangeQuery.NewIntRange( "Date",
beginDateInt, endDateInt, true, true );

boolQuery.Add( datesQuery, BooleanClause.Occur.MUST );

 >

---------------------------------------------------------
Disclaimer 

This message and any attachments are confidential and/or privileged. If this has been sent to you in error, please do not use, retain or disclose them, and contact the sender as soon as possible.

Oxford Analytica Ltd
Registered in England: No. 1196703
5 Alfred Street, Oxford
United Kingdom, OX1 4EH
---------------------------------------------------------


Re: How to improve retrieval time when searching for a date range

Posted by Itamar Syn-Hershko <it...@code972.com>.
Did you try using a filter as  I suggested? a Range query, and Range query,
is going to be rather expensive as you make its range larger


On Tue, Apr 9, 2013 at 5:58 PM, Avi Levy <le...@wesee.com> wrote:

> Hello,
>
> I have a Lucene.NET index created with version 2.9.4.1. The size of the
> index is about 25 Million entries (In the production environment I will
> have
> 50 Million or more). The Index size is 5.75GB. The index is used for
> searching by text. I need to add a new functionality that allows performing
> a query for a specific date range in addition to the textual search (The
> query is for text AND date range). The date range the user can select from
> is either last 7 days or last 30 days.
>
> The implementation I tried was to add a new indexed only numeric field
> representing a date. The date is indexed as integer in the format yyyyMMdd.
> I am indexing this field with a precision step of 1 (to make the retrieval
> the fastest).  During retrieval I create a Boolean query that has the
> original query and I added a clause for with MUST for the date range.
>
> A few days ago I posted a question and got some useful suggestions. I have
> reached a point where I get acceptable search times when I compare queries
> on the index with the dates to the index without them. However, the problem
> I am facing now is that the queries with the dates are slow. I will
> appreciate suggestion and tips on how to the performance of searching by
> dates can be improved.
>
> You can see below the statistics for the runs, and the code for creating
> the
> fields and the query.
>
> Thanks,
> Avi
>
> No changes (using index with no dates)
> 08 18:17:01,213 [1]  INFO: {(null)} - Min search time: 2
> 08 18:17:01,213 [1]  INFO: {(null)} - Max search time: 88
> 08 18:17:01,213 [1]  INFO: {(null)} - Average search time: 23.0674157303371
> 08 18:17:01,213 [1]  INFO: {(null)} - Search time Variance : 20.5
> 08 18:17:01,213 [1]  INFO: {(null)} - Number of results above 700ms: 0
>
> Index With Date (not using dates in query)
> 08 18:22:49,093 [1]  INFO: {(null)} - Min search time: 3
> 08 18:22:49,093 [1]  INFO: {(null)} - Max search time: 176
> 08 18:22:49,093 [1]  INFO: {(null)} - Average search time: 50.9325842696629
> 08 18:22:49,093 [1]  INFO: {(null)} - Search time Variance : 46.85
> 08 18:22:49,093 [1]  INFO: {(null)} - Number of results above 700ms: 0
>
> With Dates - Last 7 Days
> 08 19:38:17,988 [1]  INFO: {(null)} - Min search time: 33
> 08 19:38:17,988 [1]  INFO: {(null)} - Max search time: 1668
> 08 19:38:17,988 [1]  INFO: {(null)} - Average search time: 704.741573033708
> 08 19:38:17,988 [1]  INFO: {(null)} - Search time Variance : 607.05
> 08 19:38:17,988 [1]  INFO: {(null)} - Number of results above 700ms: 44
>
> With Dates - Last 30 Days
> 08 19:48:17,123 [1]  INFO: {(null)} - Min search time: 105
> 08 19:48:17,123 [1]  INFO: {(null)} - Max search time: 4808
> 08 19:48:17,123 [1]  INFO: {(null)} - Average search time: 2846.75280898876
> 08 19:48:17,123 [1]  INFO: {(null)} - Search time Variance : 1934.11
> 08 19:48:17,123 [1]  INFO: {(null)} - Number of results above 700ms: 72
>
> Here are the field's definitions:
>
> var idField = new Field( "ID", String.Empty, Field.Store.YES,
> Field.Index.NOT_ANALYZED_NO_NORMS );
> document.Add( idField );
> var id2Field = new Field( "ID2", String.Empty, Field.Store.YES,
> Field.Index.NO );
> document.Add( id2Field );
>
> var txtField = new Field( "txtField", String.Empty, Field.Store.NO,
> Field.Index. ANALYZED ); document.Add( txtField );
>
> var txt2Field = new Field( "txt2Field", String.Empty, Field.Store.NO,
> Field.Index. ANALYZED ); document.Add( txt2Field );
>
> var txt3Field = new Field( "txt3Field", String.Empty, Field.Store.NO,
> Field.Index. ANALYZED ); document.Add( txt3Field );
>
>
>
> // The new date field
>
> var dateField = new NumericField( "Date", 1, Field.Store.NO, true );
> document.Add(dateField);
>
>
>
> I set the values to the fields. For the new date field I set it like this:
>
> Int64 dateInt = <some date>;
>
> dateField.SetIntValue(dateInt);
>
>
>
> The query:
>
> var fields = new String[3];
>
> Dictionary<String, Single> boosts = new Dictionary<String, Single>();
>
> fields[0]="txtField";
>
> boosts.Add( fields[0],<Value>);
>
> fields[1]="txt2Field";
>
> boosts.Add( fields[1],<Value>);
>
> fields[2]="txt3Field";
>
> boosts.Add( fields[2],<Value>);
>
> MultiFieldQueryParser parser = new MultiFieldQueryParser(
> Version.LUCENE_29,
> fields, analyzer, boosts );
> var boolQuery = new BooleanQuery();
> Query simpleParsedQuery = parser.Parse( queryText );
> boolQuery.Add( simpleParsedQuery, BooleanClause.Occur.MUST );
> DateTime beginDate = <Date 7 or 30 days ago).
> Int32 beginDateInt = beginDate.Day + beginDate.Month * 100 + beginDate.Year
> * 10000;
>
> DateTime now = DateTime.UtcNow;
>
> Int32 endDateInt = now.Day + now.Month * 100 + now.Year * 10000;
>
> NumericRangeQuery datesQuery = NumericRangeQuery.NewIntRange( "Date",
> beginDateInt, endDateInt, true, true );
>
> boolQuery.Add( datesQuery, BooleanClause.Occur.MUST );
>
>
>
>