You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Volodymyr Bychkoviak <vb...@i-hypergrid.com> on 2005/03/14 12:54:38 UTC

WildCard search replacement

Hi all.

 

I have large index of documents (about 1.6 millions)

One field (for example called “number”) contains string of digits.

 I need to do wildcard search on this field such as “*expression*” (i.e. 
all documents that contains “expression” in this field.

 
When I run such search with very short expression (i.e. "*321") I get 
OutOfMemoryError or TooManyClauses Exception. (This case depends on 
BooleanQuery.maxClauseCount setting).

 So I found following workaround. I index this field as sequence of 
terms, each of containing single digit from needed value. (For example I 
have “123214213” value that needs to be indexed. Then it will be indexed 
as sequence of “1”,”2”,”3”,”2”,”1”,”4”,”2”,”1”,”3” terms.) This can be 
done by custom Analyzer class.

 
To search in this by “wildcard” query I do search by PhraseQuery, which 
contains single digit terms.

 For example: to search documents which contains “321” in field named 
“number” I create following PhraseQuery:

    PhraseQuery phraseQuery = new PhraseQuery();

    phraseQuery.add(new Term("number ","3"));

    phraseQuery.add(new Term("number ","2"));

    phraseQuery.add(new Term("number ","1"));

 
This approach works faster in case when you need to do search by very 
short expression and never run out of memory (or throws TooManyClauses 
Exception).

 
I think this can be useful for someone who needs similar functionality.

Also any comments are appreciated.

 

Regards,

Volodymyr Bychkoviak

 

 

 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: WildCard search replacement

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
That's a great technique - thanks for sharing it!

	Erik

On Mar 14, 2005, at 6:54 AM, Volodymyr Bychkoviak wrote:

> Hi all.
>
>
> I have large index of documents (about 1.6 millions)
>
> One field (for example called “number”) contains string of digits.
>
> I need to do wildcard search on this field such as “*expression*” 
> (i.e. all documents that contains “expression” in this field.
>
> When I run such search with very short expression (i.e. "*321") I get 
> OutOfMemoryError or TooManyClauses Exception. (This case depends on 
> BooleanQuery.maxClauseCount setting).
>
> So I found following workaround. I index this field as sequence of 
> terms, each of containing single digit from needed value. (For example 
> I have “123214213” value that needs to be indexed. Then it will be 
> indexed as sequence of “1”,”2”,”3”,”2”,”1”,”4”,”2”,”1”,”3” terms.) 
> This can be done by custom Analyzer class.
>
> To search in this by “wildcard” query I do search by PhraseQuery, 
> which contains single digit terms.
>
> For example: to search documents which contains “321” in field named 
> “number” I create following PhraseQuery:
>
>    PhraseQuery phraseQuery = new PhraseQuery();
>
>    phraseQuery.add(new Term("number ","3"));
>
>    phraseQuery.add(new Term("number ","2"));
>
>    phraseQuery.add(new Term("number ","1"));
>
> This approach works faster in case when you need to do search by very 
> short expression and never run out of memory (or throws TooManyClauses 
> Exception).
>
> I think this can be useful for someone who needs similar functionality.
>
> Also any comments are appreciated.
>
>
> Regards,
>
> Volodymyr Bychkoviak
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org