You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Volodymyr Bychkoviak <vb...@i-hypergrid.com> on 2005/03/14 12:54:38 UTC
WildCard search replacement
Hi all.
I have large index of documents (about 1.6 millions)
One field (for example called “number”) contains string of digits.
I need to do wildcard search on this field such as “*expression*” (i.e.
all documents that contains “expression” in this field.
When I run such search with very short expression (i.e. "*321") I get
OutOfMemoryError or TooManyClauses Exception. (This case depends on
BooleanQuery.maxClauseCount setting).
So I found following workaround. I index this field as sequence of
terms, each of containing single digit from needed value. (For example I
have “123214213” value that needs to be indexed. Then it will be indexed
as sequence of “1”,”2”,”3”,”2”,”1”,”4”,”2”,”1”,”3” terms.) This can be
done by custom Analyzer class.
To search in this by “wildcard” query I do search by PhraseQuery, which
contains single digit terms.
For example: to search documents which contains “321” in field named
“number” I create following PhraseQuery:
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(new Term("number ","3"));
phraseQuery.add(new Term("number ","2"));
phraseQuery.add(new Term("number ","1"));
This approach works faster in case when you need to do search by very
short expression and never run out of memory (or throws TooManyClauses
Exception).
I think this can be useful for someone who needs similar functionality.
Also any comments are appreciated.
Regards,
Volodymyr Bychkoviak
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: WildCard search replacement
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
That's a great technique - thanks for sharing it!
Erik
On Mar 14, 2005, at 6:54 AM, Volodymyr Bychkoviak wrote:
> Hi all.
>
>
> I have large index of documents (about 1.6 millions)
>
> One field (for example called “number”) contains string of digits.
>
> I need to do wildcard search on this field such as “*expression*”
> (i.e. all documents that contains “expression” in this field.
>
> When I run such search with very short expression (i.e. "*321") I get
> OutOfMemoryError or TooManyClauses Exception. (This case depends on
> BooleanQuery.maxClauseCount setting).
>
> So I found following workaround. I index this field as sequence of
> terms, each of containing single digit from needed value. (For example
> I have “123214213” value that needs to be indexed. Then it will be
> indexed as sequence of “1”,”2”,”3”,”2”,”1”,”4”,”2”,”1”,”3” terms.)
> This can be done by custom Analyzer class.
>
> To search in this by “wildcard” query I do search by PhraseQuery,
> which contains single digit terms.
>
> For example: to search documents which contains “321” in field named
> “number” I create following PhraseQuery:
>
> PhraseQuery phraseQuery = new PhraseQuery();
>
> phraseQuery.add(new Term("number ","3"));
>
> phraseQuery.add(new Term("number ","2"));
>
> phraseQuery.add(new Term("number ","1"));
>
> This approach works faster in case when you need to do search by very
> short expression and never run out of memory (or throws TooManyClauses
> Exception).
>
> I think this can be useful for someone who needs similar functionality.
>
> Also any comments are appreciated.
>
>
> Regards,
>
> Volodymyr Bychkoviak
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org