You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by renou oki <oc...@gmail.com> on 2008/06/25 10:21:43 UTC

Problem with search an exact word and stemming

Hello,

I have a stemmed index, but i want to search the exact form of a word.
I use French Analyzer, so for instance "progression", "progresser" are
indexed with the linguistic root "progress".
But if I want to search the word "progress" (and only this word), I have to
many hits (because of "progression", "progresser"...)
The field is indexed, tokenized and no store...

Is there a way to do this, I mean to search an exact word in a stemmed index
?
I suppose that I have to use the same analyzer for indexing and searching.


I try with a PhraseQuery, with quotes...

Ps : I use lucene 1.9.1

Thanks
Renald

Re: Problem with search an exact word and stemming

Posted by Matthew Hall <mh...@informatics.jax.org>.

Also, please note that I thought about it and realized that I mispoke 
when I sent out my original suggestion.  You don't want an untokenized 
field in your case, you want an unstemmed one instead.

This will allow you to get the functionality you are looking for.. at 
least I believe so ^^

Anyhow, best of luck!

Matt

renou oki wrote:
>  Thanks for the reply.
>
> I will try to add an other data field.
> I thought about this solution but i was not very sure. I thought that was an
> easier solution to do that...
>
>
> best regards
> Renou
>
> 2008/6/26 Matthew Hall <mh...@informatics.jax.org>:
>
>   
>> You could also add another data field to the index, with an untokenized
>> version of your data, and then use a multifield query to go against both the
>> stemmed and exact match parts of your search at the same time.
>>
>> This is a technique I've used quite often on my project with various
>> different requirements for the second field.  Mind you it makes the indexes
>> bigger, but unless your dataset is large its not really a huge problem.
>>
>> Matt
>>
>> Erick Erickson wrote:
>>
>>     
>>> The way I've solved this is to index the stemmed *and* a special
>>> token at the same position (see Synonym Analyzer). The From your
>>> example, say you're indexing progresser. You'd go ahead and index the
>>> stemmed version , "progress", AND you'd also index "progresser$"
>>> at the same offset. Now, when you want exact matches, search for
>>> the token with the $ at the end.
>>>
>>> This does make your index a bit larger, but not as much as you'd expect.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Jun 25, 2008 at 4:21 AM, renou oki <oc...@gmail.com> wrote:
>>>
>>>
>>>
>>>       
>>>> Hello,
>>>>
>>>> I have a stemmed index, but i want to search the exact form of a word.
>>>> I use French Analyzer, so for instance "progression", "progresser" are
>>>> indexed with the linguistic root "progress".
>>>> But if I want to search the word "progress" (and only this word), I have
>>>> to
>>>> many hits (because of "progression", "progresser"...)
>>>> The field is indexed, tokenized and no store...
>>>>
>>>> Is there a way to do this, I mean to search an exact word in a stemmed
>>>> index
>>>> ?
>>>> I suppose that I have to use the same analyzer for indexing and
>>>> searching.
>>>>
>>>>
>>>> I try with a PhraseQuery, with quotes...
>>>>
>>>> Ps : I use lucene 1.9.1
>>>>
>>>> Thanks
>>>> Renald
>>>>
>>>>
>>>>
>>>>         
>>>
>>>       
>> --
>> Matthew Hall
>> Software Engineer
>> Mouse Genome Informatics
>> mhall@informatics.jax.org
>> (207) 288-6012
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem with search an exact word and stemming

Posted by renou oki <oc...@gmail.com>.

 Thanks for the reply.

I will try to add an other data field.
I thought about this solution but i was not very sure. I thought that was an
easier solution to do that...


best regards
Renou

2008/6/26 Matthew Hall <mh...@informatics.jax.org>:

> You could also add another data field to the index, with an untokenized
> version of your data, and then use a multifield query to go against both the
> stemmed and exact match parts of your search at the same time.
>
> This is a technique I've used quite often on my project with various
> different requirements for the second field.  Mind you it makes the indexes
> bigger, but unless your dataset is large its not really a huge problem.
>
> Matt
>
> Erick Erickson wrote:
>
>> The way I've solved this is to index the stemmed *and* a special
>> token at the same position (see Synonym Analyzer). The From your
>> example, say you're indexing progresser. You'd go ahead and index the
>> stemmed version , "progress", AND you'd also index "progresser$"
>> at the same offset. Now, when you want exact matches, search for
>> the token with the $ at the end.
>>
>> This does make your index a bit larger, but not as much as you'd expect.
>>
>> Best
>> Erick
>>
>> On Wed, Jun 25, 2008 at 4:21 AM, renou oki <oc...@gmail.com> wrote:
>>
>>
>>
>>> Hello,
>>>
>>> I have a stemmed index, but i want to search the exact form of a word.
>>> I use French Analyzer, so for instance "progression", "progresser" are
>>> indexed with the linguistic root "progress".
>>> But if I want to search the word "progress" (and only this word), I have
>>> to
>>> many hits (because of "progression", "progresser"...)
>>> The field is indexed, tokenized and no store...
>>>
>>> Is there a way to do this, I mean to search an exact word in a stemmed
>>> index
>>> ?
>>> I suppose that I have to use the same analyzer for indexing and
>>> searching.
>>>
>>>
>>> I try with a PhraseQuery, with quotes...
>>>
>>> Ps : I use lucene 1.9.1
>>>
>>> Thanks
>>> Renald
>>>
>>>
>>>
>>
>>
>>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Problem with search an exact word and stemming

Posted by Matthew Hall <mh...@informatics.jax.org>.

You could also add another data field to the index, with an untokenized 
version of your data, and then use a multifield query to go against both 
the stemmed and exact match parts of your search at the same time.

This is a technique I've used quite often on my project with various 
different requirements for the second field.  Mind you it makes the 
indexes bigger, but unless your dataset is large its not really a huge 
problem.

Matt

Erick Erickson wrote:
> The way I've solved this is to index the stemmed *and* a special
> token at the same position (see Synonym Analyzer). The From your
> example, say you're indexing progresser. You'd go ahead and index the
> stemmed version , "progress", AND you'd also index "progresser$"
> at the same offset. Now, when you want exact matches, search for
> the token with the $ at the end.
>
> This does make your index a bit larger, but not as much as you'd expect.
>
> Best
> Erick
>
> On Wed, Jun 25, 2008 at 4:21 AM, renou oki <oc...@gmail.com> wrote:
>
>   
>> Hello,
>>
>> I have a stemmed index, but i want to search the exact form of a word.
>> I use French Analyzer, so for instance "progression", "progresser" are
>> indexed with the linguistic root "progress".
>> But if I want to search the word "progress" (and only this word), I have to
>> many hits (because of "progression", "progresser"...)
>> The field is indexed, tokenized and no store...
>>
>> Is there a way to do this, I mean to search an exact word in a stemmed
>> index
>> ?
>> I suppose that I have to use the same analyzer for indexing and searching.
>>
>>
>> I try with a PhraseQuery, with quotes...
>>
>> Ps : I use lucene 1.9.1
>>
>> Thanks
>> Renald
>>
>>     
>
>   

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem with search an exact word and stemming

Posted by Erick Erickson <er...@gmail.com>.

The way I've solved this is to index the stemmed *and* a special
token at the same position (see Synonym Analyzer). The From your
example, say you're indexing progresser. You'd go ahead and index the
stemmed version , "progress", AND you'd also index "progresser$"
at the same offset. Now, when you want exact matches, search for
the token with the $ at the end.

This does make your index a bit larger, but not as much as you'd expect.

Best
Erick

On Wed, Jun 25, 2008 at 4:21 AM, renou oki <oc...@gmail.com> wrote:

> Hello,
>
> I have a stemmed index, but i want to search the exact form of a word.
> I use French Analyzer, so for instance "progression", "progresser" are
> indexed with the linguistic root "progress".
> But if I want to search the word "progress" (and only this word), I have to
> many hits (because of "progression", "progresser"...)
> The field is indexed, tokenized and no store...
>
> Is there a way to do this, I mean to search an exact word in a stemmed
> index
> ?
> I suppose that I have to use the same analyzer for indexing and searching.
>
>
> I try with a PhraseQuery, with quotes...
>
> Ps : I use lucene 1.9.1
>
> Thanks
> Renald
>