You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Konrad Kolosowski <ko...@ca.ibm.com> on 2003/06/12 03:26:03 UTC

OutOfMemoryErrors searching with WildCardQueries

I need to proof an on-line system against Out Of Memory Errors, that some
times crash our system.  The system allows boolean searches with wild
cards.

It is not recommended to use WildCardQuery with wild card at the first
position.   Having wildcard at first position works for small number of
documents in the index but results in errors for a larger index (containing
3k of 1-2 pages docs).  If one types a query with many wild cards, close to
the beginning of terms, e.g.  a* OR b* OR ... OR z*, is not it going to
lead to the same problem?

If I impose a requirement that not first one but first 3 letters of a word
in a query cannot be a wild card.  Will it provide an additional safety and
reduce the memory consumption during search?  If it does than I think it
probably would not help when index contains large number of terms with
common prefix anyway.

If the index grows to hundred thousand documents, with users simultaneously
searching indexes for different locales, what is the best way to cup the
memory requirement?  Limiting number of terms, or number of terms
containing wild cards, or eliminating wild card searches altogether.

Thanks for explanation or any pointers.

Konrad Kolosowski


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: OutOfMemoryErrors searching with WildCardQueries

Posted by Konrad Kolosowski <ko...@ca.ibm.com>.
After Dave Kor put me on track, I thought I will need to dive into hacking
Lucene on my own, but having the fix already in the repository is great.
Thank you Doug.
I assume the fix will be picked up by 1.3 release.  Is there an expected
time frame for 1.3 Final build?
Thanks.

Konrad Kolosowski



                                                                                                                                       
                      Doug Cutting                                                                                                     
                      <cutting@lucene.c        To:       Lucene Users List <lu...@jakarta.apache.org>                            
                      om>                      cc:                                                                                     
                                               Subject:  Re: OutOfMemoryErrors searching with WildCardQueries                          
                      06/12/2003 02:28                                                                                                 
                      PM                                                                                                               
                      Please respond to                                                                                                
                      "Lucene Users                                                                                                    
                      List"                                                                                                            
                                                                                                                                       



Konrad Kolosowski wrote:
> If the index grows to hundred thousand documents, with users
simultaneously
> searching indexes for different locales, what is the best way to cup the
> memory requirement?  Limiting number of terms, or number of terms
> containing wild cards, or eliminating wild card searches altogether.

This was discussed recently on lucene-dev@jakarta.apache.org in a thread
whose subject contains "too many hits - OutOfMemoryError".

I checked in a patch which limits the number of terms that a wildcard is
permitted to expand into.  The default is 1000.  If a term expands to
more than that then an exception is thrown.  Each term that a wildcard
expands into requires around 2kB.  So this limits each wildcarded query
term to 2MB.  If you have queries with large numbers of wildcarded terms
then you might consider also limiting that.

This patch is in the latest version of Lucene in CVS, but not yet in a
release.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: OutOfMemoryErrors searching with WildCardQueries

Posted by Doug Cutting <cu...@lucene.com>.
Konrad Kolosowski wrote:
> If the index grows to hundred thousand documents, with users simultaneously
> searching indexes for different locales, what is the best way to cup the
> memory requirement?  Limiting number of terms, or number of terms
> containing wild cards, or eliminating wild card searches altogether.

This was discussed recently on lucene-dev@jakarta.apache.org in a thread 
whose subject contains "too many hits - OutOfMemoryError".

I checked in a patch which limits the number of terms that a wildcard is 
permitted to expand into.  The default is 1000.  If a term expands to 
more than that then an exception is thrown.  Each term that a wildcard 
expands into requires around 2kB.  So this limits each wildcarded query 
term to 2MB.  If you have queries with large numbers of wildcarded terms 
then you might consider also limiting that.

This patch is in the latest version of Lucene in CVS, but not yet in a 
release.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: OutOfMemoryErrors searching with WildCardQueries

Posted by Dave Kor <da...@nexusedge.com>.
OOM occurs only if the prefix/wild/fuzzy query matches with a large
percentage of the terms in the index due to their use of term expansion.
Placing a limit to expand only the first x matching terms would solve this
problem, that is assuming that the tradeoff in accuracy is acceptable.


Dave Kor Kian Wei
Consultant
Product Engineering
NexusEdge Technologies Pte. Ltd.
6 Aljunied Ave 3, #01-02 (Level 4)
Singapore 389932
Tel : (+65)848-2552
Fax : (+65)747-4536
Web : www.nexusedge.com

> -----Original Message-----
> From: Konrad Kolosowski [mailto:konradk@ca.ibm.com]
> Sent: Thursday, June 12, 2003 9:26 AM
> To: Lucene Users List
> Subject: OutOfMemoryErrors searching with WildCardQueries
>
>
> I need to proof an on-line system against Out Of Memory Errors, that some
> times crash our system.  The system allows boolean searches with wild
> cards.
>
> It is not recommended to use WildCardQuery with wild card at the first
> position.   Having wildcard at first position works for small number of
> documents in the index but results in errors for a larger index
> (containing
> 3k of 1-2 pages docs).  If one types a query with many wild
> cards, close to
> the beginning of terms, e.g.  a* OR b* OR ... OR z*, is not it going to
> lead to the same problem?
>
> If I impose a requirement that not first one but first 3 letters of a word
> in a query cannot be a wild card.  Will it provide an additional
> safety and
> reduce the memory consumption during search?  If it does than I think it
> probably would not help when index contains large number of terms with
> common prefix anyway.
>
> If the index grows to hundred thousand documents, with users
> simultaneously
> searching indexes for different locales, what is the best way to cup the
> memory requirement?  Limiting number of terms, or number of terms
> containing wild cards, or eliminating wild card searches altogether.
>
> Thanks for explanation or any pointers.
>
> Konrad Kolosowski
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org