You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Julien Nioche <Ju...@lingway.com> on 2004/07/01 10:53:13 UTC

Re: Optimizing for long queries? >> 40% faster by changing INDEX_INTERVAL

I got a little bit deeper in my experiments with INDEX_INTERVAL. In a
previous mail to the user list I reported a 10% improvement over the regular
setting (128) with one of my application.
I refined the measures by taking the time spent not in the whole
application, but in a method that encapsulates Lucene searches. Only the
search time is measured, not the access to the Documents.

Two sets of queries are generated using a log of user queries from our
application. Theses queries are in natural language and are expanded by our
product into a Lucene boolean query. Attached is the boolean generated for
the query "Burgundy wine" - just to give you an idea of what I mean by large
query (this one is particularly big).

These queries are used on an optimized index (INDEX_INTERVAL=16) and a
regular index. The index used for this test is 720 MB - FSDirectory on
Fedora 1 the .tii file is 3398 Kb in the modified version against 488Kb in
the original. Both sets of queries have the same size (783). The xls file
contains the times for both indexes sorted by decreasing order. Actually the
numbers indicates not a single search but a group of up to 4 searches.

In average, changing the indexinterval to 16 yields an improvement of about
40% compared to the regular setting.
I will try with a bigger sample of 40.000 queries and with smaller queries
as well.

The original motivation for this feature can be found at
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04092.html

What is the best way to set up this value in IndexWriter? Maybe we could
limit to a few possible values like :
DEFAULT = 128
AVERAGE = 64
HIGH = 32
in order to avoid too low settings.

Any comments or suggestions? Can anyone give feedback on this?

Julien



----- Original Message ----- 
From: "Julien Nioche" <Ju...@lingway.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, June 29, 2004 3:03 PM
Subject: Re: Optimizing for long queries?


> I ran some tests changing TermInfosWriter.INDEX_INTERVAL to 16.
> On my application (which does a lot on top of lucene - including SQL
> transactions and so on) I won 10% percent time.
> I suppose this could be a bigger improvements in other applications,
because
> the search with Lucene is not 100% of my application.
>
> The index used for this test is 720 MB - FSDirectory on Fedora 1
> the .tii file is 3398 Kb in the modified version against 488Kb in the
> original (INDEX_INTERVAL=128)
>
> Has anyone tried changing this value? Do you get similar results?
>
> Julien
>
> ----- Original Message ----- 
> From: "Julien Nioche" <Ju...@lingway.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Monday, June 28, 2004 10:04 AM
> Subject: Re: Optimizing for long queries?
>
>
> > Hello Drew,
> >
> > I don't think it's in the FAQ.
> >
> > 1 - What you could do is to sort your query terms by ascending
alphabetic
> > order. In my case it improved a little bit the performance. It could be
> > interesting to know how it worked in your case.
> >
> > 2- Another solution is to play with TermInfosWriter.INDEX_INTERVAL at
> > indexation time. I quote Doug :
> >
> > "..., try reducing TermInfosWriter.INDEX_INTERVAL.  You'll
> > have to re-create your indexes each time you change this constant.  You
> > might try a value like 16.  This would keep the number of terms in
> > memory from being too huge (1 of 16 terms), but would reduce the average
> > number scanned from 64 to 8, which would be substantial.  Tell me how
> > this works.  If it makes a big difference, then perhaps we should make
> > this parameter more easily changeable."
> >
> > Have you used a profiler on your application? This could be useful to
spot
> > possible improvments.
> >
> >
> > ----- Original Message ----- 
> > From: "Drew Farris" <dr...@gmail.com>
> > To: <lu...@jakarta.apache.org>
> > Sent: Friday, June 25, 2004 8:24 PM
> > Subject: Optimizing for long queries?
> >
> >
> > > Apologies if this is a FAQ, but I didn't have much luck searching the
> > > list archives for answers on this subject:
> > >
> > > I'm using Lucene in a context where we have frequently have queries
> > > that search for as many as 30-50 terms in a single field. Does anyone
> > > have any thoughts concerning ways optimize Lucene for queries of these
> > > lengths?
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

Re: Optimizing for long queries? >> 40% faster by changing INDEX_INTERVAL

Posted by Julien Nioche <Ju...@lingway.com>.
The xls files did not pass. You can download them from the following URLs :
http://jnioche.freesurf.fr/shortQueries.xls
http://jnioche.freesurf.fr/longQueries.xls

----- Original Message ----- 
From: "Julien Nioche" <Ju...@lingway.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Thursday, July 01, 2004 2:32 PM
Subject: Re: Optimizing for long queries? >> 40% faster by changing
INDEX_INTERVAL


> A similar experiment with 500 shorter queries shows a 20% speed
improvement.
> (see xls file for details)
> By shorter query I mean something like that :
> ((titre:"burgundy wines"~3 titre:"burgundy wine"~3)) ((texte:"burgundy
> wines"~3^3.0 texte:"burgundy wine"~3^3.0)) ((descr:"burgundy wines"~3^4.0
> descr:"burgundy wine"~3^4.0)) ((kw:"burgundy wines"~3^4.0 kw:"burgundy
> wine"~3^4.0))
>
> ----- Original Message ----- 
>
> From: "Julien Nioche" <Ju...@lingway.com>
> To: "Lucene Developers List" <lu...@jakarta.apache.org>
> Cc: <dr...@gmail.com>
> Sent: Thursday, July 01, 2004 10:53 AM
> Subject: Re: Optimizing for long queries? >> 40% faster by changing
> INDEX_INTERVAL
>
>
> > I got a little bit deeper in my experiments with INDEX_INTERVAL. In a
> > previous mail to the user list I reported a 10% improvement over the
> regular
> > setting (128) with one of my application.
> > I refined the measures by taking the time spent not in the whole
> > application, but in a method that encapsulates Lucene searches. Only the
> > search time is measured, not the access to the Documents.
> >
> > Two sets of queries are generated using a log of user queries from our
> > application. Theses queries are in natural language and are expanded by
> our
> > product into a Lucene boolean query. Attached is the boolean generated
for
> > the query "Burgundy wine" - just to give you an idea of what I mean by
> large
> > query (this one is particularly big).
> >
> > These queries are used on an optimized index (INDEX_INTERVAL=16) and a
> > regular index. The index used for this test is 720 MB - FSDirectory on
> > Fedora 1 the .tii file is 3398 Kb in the modified version against 488Kb
in
> > the original. Both sets of queries have the same size (783). The xls
file
> > contains the times for both indexes sorted by decreasing order. Actually
> the
> > numbers indicates not a single search but a group of up to 4 searches.
> >
> > In average, changing the indexinterval to 16 yields an improvement of
> about
> > 40% compared to the regular setting.
> > I will try with a bigger sample of 40.000 queries and with smaller
queries
> > as well.
> >
> > The original motivation for this feature can be found at
> > http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04092.html
> >
> > What is the best way to set up this value in IndexWriter? Maybe we could
> > limit to a few possible values like :
> > DEFAULT = 128
> > AVERAGE = 64
> > HIGH = 32
> > in order to avoid too low settings.
> >
> > Any comments or suggestions? Can anyone give feedback on this?
> >
> > Julien
> >
> >
> >
> > ----- Original Message ----- 
> > From: "Julien Nioche" <Ju...@lingway.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Tuesday, June 29, 2004 3:03 PM
> > Subject: Re: Optimizing for long queries?
> >
> >
> > > I ran some tests changing TermInfosWriter.INDEX_INTERVAL to 16.
> > > On my application (which does a lot on top of lucene - including SQL
> > > transactions and so on) I won 10% percent time.
> > > I suppose this could be a bigger improvements in other applications,
> > because
> > > the search with Lucene is not 100% of my application.
> > >
> > > The index used for this test is 720 MB - FSDirectory on Fedora 1
> > > the .tii file is 3398 Kb in the modified version against 488Kb in the
> > > original (INDEX_INTERVAL=128)
> > >
> > > Has anyone tried changing this value? Do you get similar results?
> > >
> > > Julien
> > >
> > > ----- Original Message ----- 
> > > From: "Julien Nioche" <Ju...@lingway.com>
> > > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > > Sent: Monday, June 28, 2004 10:04 AM
> > > Subject: Re: Optimizing for long queries?
> > >
> > >
> > > > Hello Drew,
> > > >
> > > > I don't think it's in the FAQ.
> > > >
> > > > 1 - What you could do is to sort your query terms by ascending
> > alphabetic
> > > > order. In my case it improved a little bit the performance. It could
> be
> > > > interesting to know how it worked in your case.
> > > >
> > > > 2- Another solution is to play with TermInfosWriter.INDEX_INTERVAL
at
> > > > indexation time. I quote Doug :
> > > >
> > > > "..., try reducing TermInfosWriter.INDEX_INTERVAL.  You'll
> > > > have to re-create your indexes each time you change this constant.
> You
> > > > might try a value like 16.  This would keep the number of terms in
> > > > memory from being too huge (1 of 16 terms), but would reduce the
> average
> > > > number scanned from 64 to 8, which would be substantial.  Tell me
how
> > > > this works.  If it makes a big difference, then perhaps we should
make
> > > > this parameter more easily changeable."
> > > >
> > > > Have you used a profiler on your application? This could be useful
to
> > spot
> > > > possible improvments.
> > > >
> > > >
> > > > ----- Original Message ----- 
> > > > From: "Drew Farris" <dr...@gmail.com>
> > > > To: <lu...@jakarta.apache.org>
> > > > Sent: Friday, June 25, 2004 8:24 PM
> > > > Subject: Optimizing for long queries?
> > > >
> > > >
> > > > > Apologies if this is a FAQ, but I didn't have much luck searching
> the
> > > > > list archives for answers on this subject:
> > > > >
> > > > > I'm using Lucene in a context where we have frequently have
queries
> > > > > that search for as many as 30-50 terms in a single field. Does
> anyone
> > > > > have any thoughts concerning ways optimize Lucene for queries of
> these
> > > > > lengths?
> > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > > For additional commands, e-mail:
lucene-user-help@jakarta.apache.org
> > > > >
> > > > >
> > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > >
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> >
>
>
> --------------------------------------------------------------------------
--
> ----
>
>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


----------------------------------------------------------------------------
----


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Optimizing for long queries? >> 40% faster by changing INDEX_INTERVAL

Posted by Julien Nioche <Ju...@lingway.com>.
A similar experiment with 500 shorter queries shows a 20% speed improvement.
(see xls file for details)
By shorter query I mean something like that :
((titre:"burgundy wines"~3 titre:"burgundy wine"~3)) ((texte:"burgundy
wines"~3^3.0 texte:"burgundy wine"~3^3.0)) ((descr:"burgundy wines"~3^4.0
descr:"burgundy wine"~3^4.0)) ((kw:"burgundy wines"~3^4.0 kw:"burgundy
wine"~3^4.0))

----- Original Message ----- 

From: "Julien Nioche" <Ju...@lingway.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Cc: <dr...@gmail.com>
Sent: Thursday, July 01, 2004 10:53 AM
Subject: Re: Optimizing for long queries? >> 40% faster by changing
INDEX_INTERVAL


> I got a little bit deeper in my experiments with INDEX_INTERVAL. In a
> previous mail to the user list I reported a 10% improvement over the
regular
> setting (128) with one of my application.
> I refined the measures by taking the time spent not in the whole
> application, but in a method that encapsulates Lucene searches. Only the
> search time is measured, not the access to the Documents.
>
> Two sets of queries are generated using a log of user queries from our
> application. Theses queries are in natural language and are expanded by
our
> product into a Lucene boolean query. Attached is the boolean generated for
> the query "Burgundy wine" - just to give you an idea of what I mean by
large
> query (this one is particularly big).
>
> These queries are used on an optimized index (INDEX_INTERVAL=16) and a
> regular index. The index used for this test is 720 MB - FSDirectory on
> Fedora 1 the .tii file is 3398 Kb in the modified version against 488Kb in
> the original. Both sets of queries have the same size (783). The xls file
> contains the times for both indexes sorted by decreasing order. Actually
the
> numbers indicates not a single search but a group of up to 4 searches.
>
> In average, changing the indexinterval to 16 yields an improvement of
about
> 40% compared to the regular setting.
> I will try with a bigger sample of 40.000 queries and with smaller queries
> as well.
>
> The original motivation for this feature can be found at
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04092.html
>
> What is the best way to set up this value in IndexWriter? Maybe we could
> limit to a few possible values like :
> DEFAULT = 128
> AVERAGE = 64
> HIGH = 32
> in order to avoid too low settings.
>
> Any comments or suggestions? Can anyone give feedback on this?
>
> Julien
>
>
>
> ----- Original Message ----- 
> From: "Julien Nioche" <Ju...@lingway.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Tuesday, June 29, 2004 3:03 PM
> Subject: Re: Optimizing for long queries?
>
>
> > I ran some tests changing TermInfosWriter.INDEX_INTERVAL to 16.
> > On my application (which does a lot on top of lucene - including SQL
> > transactions and so on) I won 10% percent time.
> > I suppose this could be a bigger improvements in other applications,
> because
> > the search with Lucene is not 100% of my application.
> >
> > The index used for this test is 720 MB - FSDirectory on Fedora 1
> > the .tii file is 3398 Kb in the modified version against 488Kb in the
> > original (INDEX_INTERVAL=128)
> >
> > Has anyone tried changing this value? Do you get similar results?
> >
> > Julien
> >
> > ----- Original Message ----- 
> > From: "Julien Nioche" <Ju...@lingway.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Monday, June 28, 2004 10:04 AM
> > Subject: Re: Optimizing for long queries?
> >
> >
> > > Hello Drew,
> > >
> > > I don't think it's in the FAQ.
> > >
> > > 1 - What you could do is to sort your query terms by ascending
> alphabetic
> > > order. In my case it improved a little bit the performance. It could
be
> > > interesting to know how it worked in your case.
> > >
> > > 2- Another solution is to play with TermInfosWriter.INDEX_INTERVAL at
> > > indexation time. I quote Doug :
> > >
> > > "..., try reducing TermInfosWriter.INDEX_INTERVAL.  You'll
> > > have to re-create your indexes each time you change this constant.
You
> > > might try a value like 16.  This would keep the number of terms in
> > > memory from being too huge (1 of 16 terms), but would reduce the
average
> > > number scanned from 64 to 8, which would be substantial.  Tell me how
> > > this works.  If it makes a big difference, then perhaps we should make
> > > this parameter more easily changeable."
> > >
> > > Have you used a profiler on your application? This could be useful to
> spot
> > > possible improvments.
> > >
> > >
> > > ----- Original Message ----- 
> > > From: "Drew Farris" <dr...@gmail.com>
> > > To: <lu...@jakarta.apache.org>
> > > Sent: Friday, June 25, 2004 8:24 PM
> > > Subject: Optimizing for long queries?
> > >
> > >
> > > > Apologies if this is a FAQ, but I didn't have much luck searching
the
> > > > list archives for answers on this subject:
> > > >
> > > > I'm using Lucene in a context where we have frequently have queries
> > > > that search for as many as 30-50 terms in a single field. Does
anyone
> > > > have any thoughts concerning ways optimize Lucene for queries of
these
> > > > lengths?
> > > >
> > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > >
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>


----------------------------------------------------------------------------
----


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org