You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "heritrix.lucene" <he...@gmail.com> on 2006/09/23 11:12:51 UTC

searching for the part of a term.

Hi All,

How can i make my search so that if i am looking for the term "counting" the
documents containing "accounting" must also come up.

Similarly if i am looking for term "workload", document s containing work
also come up as a search result.

Wildcard query seems to work in the first case, but if the index size is
very big, it throws TooManyClausesException.

Is there a way to resolve this issue, apart from indexing n-grams of each
term.


Regards,

Re: searching for the part of a term.

Posted by "heritrix.lucene" <he...@gmail.com>.

Hi,
Thanks for yor reply..


> : Since the overhead in first is the speed of the system, i think adopting
> : second method will be better.


Since iMy index size is around 10GB the second method is also taking a lot
of time for queries like
"am".

One more things that i found in

http://www.gossamer-threads.com/lists/lucene/java-user/13345?search_string=Starts%20With%20x%20and%20Ends%20With%20x%20Queries;#13345

was to index rotated token of a word, and then search by the prefix query.
But i think here also i'll face the speed issue because of the prefix
query..(If i am right...)


One more thing that we can do is to prepare the n-grams of a word and then
index each of them.
This way the index size will increase by several magnitude on the cost of
speed...

Is this the way by which i can implement the fastest substring search ????


Regards....




:
> : Is there any other solution for this problem?? Am i going in right
> : direction??
>
> you're definitely on teh right path -- those are the two bigsolutions i
> can think of, which appraoch you should take really depends on the nature
> of your data, what your performance concerns are, and how much development
> time you have.
>
> Here's another good thread you may want to check out...
>
>
> http://www.nabble.com/I-just-don%27t-get-wildcards-at-all.-tf1412243.html#a3804223
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: searching for the part of a term.

Posted by Chris Hostetter <ho...@fucit.org>.

: Since the overhead in first is the speed of the system, i think adopting
: second method will be better.
:
: Is there any other solution for this problem?? Am i going in right
: direction??

you're definitely on teh right path -- those are the two bigsolutions i
can think of, which appraoch you should take really depends on the nature
of your data, what your performance concerns are, and how much development
time you have.

Here's another good thread you may want to check out...

http://www.nabble.com/I-just-don%27t-get-wildcards-at-all.-tf1412243.html#a3804223


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: searching for the part of a term.

Posted by "heritrix.lucene" <he...@gmail.com>.

Hi,
While i was searching forum for my problem for searching a substring, i got
few very good links.

http://www.gossamer-threads.com/lists/lucene/java-user/39753?search_string=Bitset%20filter;#39753
http://www.gossamer-threads.com/lists/lucene/java-user/7813?search_string=substring;#7813
http://www.gossamer-threads.com/lists/lucene/java-user/5931?search_string=substring;#5931

In first, WildcardTermEnum is used.
>>I tried with this but this takes a lot of time in searching.

The other solution i found was to create a tokenstream which splits a token
into multiple tokens, and then index those token. like : google into google,
oogle, ogle....
And then while searching make a prefix query , then search.
>>But here it seems to create a lot of tokens from one token resulting index
size multiple times bigger then if we index a single token.

Since the overhead in first is the speed of the system, i think adopting
second method will be better.

Is there any other solution for this problem?? Am i going in right
direction??

It'll be great to see your response...

Regards,









On 9/23/06, heritrix. lucene <he...@gmail.com> wrote:
>
> Hi All,
>
> How can i make my search so that if i am looking for the term "counting"
> the documents containing "accounting" must also come up.
>
> Similarly if i am looking for term "workload", document s containing work
> also come up as a search result.
>
> Wildcard query seems to work in the first case, but if the index size is
> very big, it throws TooManyClausesException.
>
> Is there a way to resolve this issue, apart from indexing n-grams of each
> term.
>
>
> Regards,
>
>
>