You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Mittal, Sourabh (IDEAS)" <So...@morganstanley.com> on 2009/02/02 08:58:11 UTC

Performance issue

Hi All, 

We face serious performance issues when users do 2 letter search e.g ho,
jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs. 
Below is our implementation details:
 
1. Search performs on 7 fields.
2. PrefixQuery implementation on all fields
3. AND search. 
4. Our indexer size is 300 MB. 
5. We show only 100 top documents only on the basis of score. 
6. We user StandardAnalyzer & StandardTokenizer for indexing &
searching. 
7. Lucene 2.4
8. JDK1 .6

Please suggest me how can we improve the performance. 

Regards,
Sourabh Mittal
Morgan Stanley | IDEAS Practice Areas
Manikchand Ikon | South Wing 18 | Dhole Patil Road
Pune, 411001
Phone: +91 20 2620-7053
Sourabh-931.Mittal@morganstanley.com

 

--------------------------------------------------------------------------
NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Re: Performance issue

Posted by mittals <so...@morganstanley.com>.

Hi,

All of the fields are text. We have 3 @IndexedEmbedded object with in a
object.
users can search for any string, few of the cases are morgan, john, orcl,
healthcare, pa ma, ma fin.

here is e.g. of our one query:

(+(trading.tn:pa^150.0 lastName:pa*^45.0 firstName:pa*^30.0 roleDesc:pa^3.0
roleShortDesc:pa^3.0 phoneDesc:pa^9.0 cityDesc:pa^3.0) +(trading.tn:ma^150.0
lastName:ma*^45.0 firstName:ma*^30.0 roleDesc:ma^3.0 roleShortDesc:ma^3.0
phoneDesc:ma^9.0 cityDesc:ma^3.0))

another thing is how lucene merge the result of 2 terms. e.g. in above case
we have an AND operator, so lucene merge result of pa & ma terms.

Regards,
Sourabh


Matthew Hall-7 wrote:
> 
> Do you NEED to be using 7 fields here?
> 
> Like Erick said, if you could give us an example of the types of data 
> you are trying to search against, it would be quite helpful.
> 
> Its possible that you might be able to say collapse your 7 fields down 
> to a single field, which would likely reduce the overall number of or 
> clauses in your searches, speeding things up nicely.
> 
> At my project we search two letter prefix searches in sub seconds, for 
> much larger datasets.  Alot of this however is directly due to how our 
> indexes are structured.
> 
> -Matt
> 
> Erick Erickson wrote:
>> Prefix queries are expensive here. The problem is
>> that each one forms a very large OR clause on all
>> the terms that start with those two letters. For instance,
>> if a field in your index contained
>> mine
>> milanta
>> mica
>>
>> a prefix search on "mi" would form
>> mine OR milanta OR mica.
>>
>> Doing this across seven fields could get expensive.
>>
>> Two things:
>> 1> what is the problem you are trying to solve? Perhaps some
>> of the folks on the list can give you some suggestions. You can
>> think about many strategies depending upon what you want
>> to accomplish. A 300M index isn't very big, so you could, for
>> instance, think about indexing a separate field that contains only
>> the two beginning letters and search *that* in this case. I'll
>> assume that three letter prefix queries are OK.
>>
>> 2> How are you measuring query time? If you're measuring the
>> time it takes when you first start a searcher, be aware that the
>> first few queries are usually slow because the caches haven't
>> been filled. Further, are you measuring total response time or
>> are you measuring *just* the query time? It's possible that the
>> time is being spent assembling the response in your code
>> rather than actual searching. You might insert some timers
>> to determine that.
>>
>> Best
>> Erick
>>
>> On Mon, Feb 2, 2009 at 2:58 AM, Mittal, Sourabh (IDEAS) <
>> Sourabh-931.Mittal@morganstanley.com> wrote:
>>
>>   
>>> Hi All,
>>>
>>> We face serious performance issues when users do 2 letter search e.g ho,
>>> jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs.
>>> Below is our implementation details:
>>>
>>> 1. Search performs on 7 fields.
>>> 2. PrefixQuery implementation on all fields
>>> 3. AND search.
>>> 4. Our indexer size is 300 MB.
>>> 5. We show only 100 top documents only on the basis of score.
>>> 6. We user StandardAnalyzer & StandardTokenizer for indexing &
>>> searching.
>>> 7. Lucene 2.4
>>> 8. JDK1 .6
>>>
>>> Please suggest me how can we improve the performance.
>>>
>>> Regards,
>>> Sourabh Mittal
>>> Morgan Stanley | IDEAS Practice Areas
>>> Manikchand Ikon | South Wing 18 | Dhole Patil Road
>>> Pune, 411001
>>> Phone: +91 20 2620-7053
>>> Sourabh-931.Mittal@morganstanley.com
>>>
>>>
>>>
>>> --------------------------------------------------------------------------
>>> NOTICE: If received in error, please destroy and notify sender. Sender
>>> does
>>> not intend to waive confidentiality or privilege. Use of this email is
>>> prohibited when received in error.
>>>
>>>     
>>
>>   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Performance-issue-tp21788313p21808283.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance issue

Posted by Matthew Hall <mh...@informatics.jax.org>.

Do you NEED to be using 7 fields here?

Like Erick said, if you could give us an example of the types of data 
you are trying to search against, it would be quite helpful.

Its possible that you might be able to say collapse your 7 fields down 
to a single field, which would likely reduce the overall number of or 
clauses in your searches, speeding things up nicely.

At my project we search two letter prefix searches in sub seconds, for 
much larger datasets.  Alot of this however is directly due to how our 
indexes are structured.

-Matt

Erick Erickson wrote:
> Prefix queries are expensive here. The problem is
> that each one forms a very large OR clause on all
> the terms that start with those two letters. For instance,
> if a field in your index contained
> mine
> milanta
> mica
>
> a prefix search on "mi" would form
> mine OR milanta OR mica.
>
> Doing this across seven fields could get expensive.
>
> Two things:
> 1> what is the problem you are trying to solve? Perhaps some
> of the folks on the list can give you some suggestions. You can
> think about many strategies depending upon what you want
> to accomplish. A 300M index isn't very big, so you could, for
> instance, think about indexing a separate field that contains only
> the two beginning letters and search *that* in this case. I'll
> assume that three letter prefix queries are OK.
>
> 2> How are you measuring query time? If you're measuring the
> time it takes when you first start a searcher, be aware that the
> first few queries are usually slow because the caches haven't
> been filled. Further, are you measuring total response time or
> are you measuring *just* the query time? It's possible that the
> time is being spent assembling the response in your code
> rather than actual searching. You might insert some timers
> to determine that.
>
> Best
> Erick
>
> On Mon, Feb 2, 2009 at 2:58 AM, Mittal, Sourabh (IDEAS) <
> Sourabh-931.Mittal@morganstanley.com> wrote:
>
>   
>> Hi All,
>>
>> We face serious performance issues when users do 2 letter search e.g ho,
>> jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs.
>> Below is our implementation details:
>>
>> 1. Search performs on 7 fields.
>> 2. PrefixQuery implementation on all fields
>> 3. AND search.
>> 4. Our indexer size is 300 MB.
>> 5. We show only 100 top documents only on the basis of score.
>> 6. We user StandardAnalyzer & StandardTokenizer for indexing &
>> searching.
>> 7. Lucene 2.4
>> 8. JDK1 .6
>>
>> Please suggest me how can we improve the performance.
>>
>> Regards,
>> Sourabh Mittal
>> Morgan Stanley | IDEAS Practice Areas
>> Manikchand Ikon | South Wing 18 | Dhole Patil Road
>> Pune, 411001
>> Phone: +91 20 2620-7053
>> Sourabh-931.Mittal@morganstanley.com
>>
>>
>>
>> --------------------------------------------------------------------------
>> NOTICE: If received in error, please destroy and notify sender. Sender does
>> not intend to waive confidentiality or privilege. Use of this email is
>> prohibited when received in error.
>>
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance issue

Posted by Erick Erickson <er...@gmail.com>.

Prefix queries are expensive here. The problem is
that each one forms a very large OR clause on all
the terms that start with those two letters. For instance,
if a field in your index contained
mine
milanta
mica

a prefix search on "mi" would form
mine OR milanta OR mica.

Doing this across seven fields could get expensive.

Two things:
1> what is the problem you are trying to solve? Perhaps some
of the folks on the list can give you some suggestions. You can
think about many strategies depending upon what you want
to accomplish. A 300M index isn't very big, so you could, for
instance, think about indexing a separate field that contains only
the two beginning letters and search *that* in this case. I'll
assume that three letter prefix queries are OK.

2> How are you measuring query time? If you're measuring the
time it takes when you first start a searcher, be aware that the
first few queries are usually slow because the caches haven't
been filled. Further, are you measuring total response time or
are you measuring *just* the query time? It's possible that the
time is being spent assembling the response in your code
rather than actual searching. You might insert some timers
to determine that.

Best
Erick

On Mon, Feb 2, 2009 at 2:58 AM, Mittal, Sourabh (IDEAS) <
Sourabh-931.Mittal@morganstanley.com> wrote:

> Hi All,
>
> We face serious performance issues when users do 2 letter search e.g ho,
> jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs.
> Below is our implementation details:
>
> 1. Search performs on 7 fields.
> 2. PrefixQuery implementation on all fields
> 3. AND search.
> 4. Our indexer size is 300 MB.
> 5. We show only 100 top documents only on the basis of score.
> 6. We user StandardAnalyzer & StandardTokenizer for indexing &
> searching.
> 7. Lucene 2.4
> 8. JDK1 .6
>
> Please suggest me how can we improve the performance.
>
> Regards,
> Sourabh Mittal
> Morgan Stanley | IDEAS Practice Areas
> Manikchand Ikon | South Wing 18 | Dhole Patil Road
> Pune, 411001
> Phone: +91 20 2620-7053
> Sourabh-931.Mittal@morganstanley.com
>
>
>
> --------------------------------------------------------------------------
> NOTICE: If received in error, please destroy and notify sender. Sender does
> not intend to waive confidentiality or privilege. Use of this email is
> prohibited when received in error.
>

Re: Performance issue

Posted by Grant Ingersoll <gs...@apache.org>.

Can you give us more info on what they are searching for w/ 2 letter  
searches?  Typically, prefix queries that short are going to have a  
lot of terms to match.  You might try having a field that you index  
using a variation of ngrams that are anchored at the first character.   
For example, encountering the term Lucene, you would index "lu",  
"luc", "luce" all at the same position (this has limits, b/c you will  
need an upper bound cutoff too).  This way, you would get exact  
matches and you don't need to use prefix queries.  Then, you just need  
to recognize these searches and generate the appropriate queries.

Alternatively, you may consider requiring a minimum number of  
characters in a query, but this depends on your application.

-Grant

On Feb 2, 2009, at 2:58 AM, Mittal, Sourabh (IDEAS) wrote:

> Hi All,
>
> We face serious performance issues when users do 2 letter search e.g  
> ho,
> jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs.
> Below is our implementation details:
>
> 1. Search performs on 7 fields.
> 2. PrefixQuery implementation on all fields
> 3. AND search.
> 4. Our indexer size is 300 MB.
> 5. We show only 100 top documents only on the basis of score.
> 6. We user StandardAnalyzer & StandardTokenizer for indexing &
> searching.
> 7. Lucene 2.4
> 8. JDK1 .6
>
> Please suggest me how can we improve the performance.
>
> Regards,
> Sourabh Mittal
> Morgan Stanley | IDEAS Practice Areas
> Manikchand Ikon | South Wing 18 | Dhole Patil Road
> Pune, 411001
> Phone: +91 20 2620-7053
> Sourabh-931.Mittal@morganstanley.com
>
>
>
> --------------------------------------------------------------------------
> NOTICE: If received in error, please destroy and notify sender.  
> Sender does not intend to waive confidentiality or privilege. Use of  
> this email is prohibited when received in error.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org