You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sumukh <su...@gmail.com> on 2009/03/02 15:13:58 UTC

Indexing synonyms for multiple words

Hi,

I'm fairly new to Lucene. I'd like to know how we can index synonyms for
multiple words.

This is the scenario:

Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

Now assume the two words combined WORD1 WORD2 can be replaced by another
word SYN.

If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
follow SYN,
which is incorrect; and the other way round if I place it after WORD2.

If any of you have solved a similar problem, I'd be thankful if you could
share some light on
the solution.

Regards,
Sumukh

Re: Indexing synonyms for multiple words

Posted by Erick Erickson <er...@gmail.com>.
This has been discussed in the user list, so searching there
might get you answer quicker.

See: http://wiki.apache.org/lucene-java/MailingListArchives

I don't remember the results, but...

Best
Erick

On Mon, Mar 2, 2009 at 9:13 AM, Sumukh <su...@gmail.com> wrote:

> Hi,
>
> I'm fairly new to Lucene. I'd like to know how we can index synonyms for
> multiple words.
>
> This is the scenario:
>
> Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
>
> Now assume the two words combined WORD1 WORD2 can be replaced by another
> word SYN.
>
> If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
> follow SYN,
> which is incorrect; and the other way round if I place it after WORD2.
>
> If any of you have solved a similar problem, I'd be thankful if you could
> share some light on
> the solution.
>
> Regards,
> Sumukh
>

Re: Indexing synonyms for multiple words

Posted by Michael McCandless <lu...@mikemccandless.com>.
Actually, the start position of each token is stored in the "normal"
Lucene index (in the *.prx files), not using payloads.

Payloads are entirely for per-token extensibility (ie, core Lucene
doesn't use them by default): you'd have to create your own analyzer
to attach payloads to tokens, and then do something with them at
search time.

So I suggested you could store the end position of each token into the
Payload, but then you'd need to implement a Query class to use this
during searching.

Mike

Sumukh wrote:

>
> Thanks for your suggestion Michael and thanks to Uwe for clarifying.
>
> Payload is currently used to store only the start positions.
> What I gathered from your suggestion is that we could possibly
> store the end position, or span, or some other complex
> encoding in order to store the extra information.
> Am I right?
>
> --Sumukh
>
>
> Michael McCandless-2 wrote:
>>
>>
>> Since Lucene doesn't represent/store end position for a token, I  
>> don't
>> think the index can properly represent SYN spanning two positions?
>>
>> I suppose you could encode this into payloads, and create a custom
>> query that would look at the payload to enforce the constraint.
>>
>> Or, if you switch to doing SYN expansion only at runtime (not adding
>> it to the index), that might work.
>>
>> Mike
>>
>> Uwe Schindler wrote:
>>
>>> I think his problem is, that "SYN" is a synonym for the phrase  
>>> "WORD1
>>> WORD2". Using these positions, a phrase like "SYN WORD2" would also
>>> match
>>> (or other problems in queries that depend on order of words).
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>>
>>>> -----Original Message-----
>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>> Sent: Monday, March 02, 2009 4:07 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: Indexing synonyms for multiple words
>>>>
>>>>
>>>> Shouldn't WORD2's position be 1 more than your SYN?
>>>>
>>>> Ie, don't you want these positions?:
>>>>
>>>>   WORD1  2
>>>>   WORD2  3
>>>>   SYN 2
>>>>
>>>> The position is the starting position of the token; Lucene doesn't
>>>> store an ending position
>>>>
>>>> Mike
>>>>
>>>> Sumukh wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm fairly new to Lucene. I'd like to know how we can index  
>>>>> synonyms
>>>>> for
>>>>> multiple words.
>>>>>
>>>>> This is the scenario:
>>>>>
>>>>> Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
>>>>>
>>>>> Now assume the two words combined WORD1 WORD2 can be replaced by
>>>>> another
>>>>> word SYN.
>>>>>
>>>>> If I place SYN after WORD1 with positionIncrement set to 0, WORD2
>>>>> will
>>>>> follow SYN,
>>>>> which is incorrect; and the other way round if I place it after
>>>>> WORD2.
>>>>>
>>>>> If any of you have solved a similar problem, I'd be thankful if  
>>>>> you
>>>>> could
>>>>> share some light on
>>>>> the solution.
>>>>>
>>>>> Regards,
>>>>> Sumukh
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Indexing-synonyms-for-multiple-words-tp22289069p22300656.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing synonyms for multiple words

Posted by Sumukh <su...@gmail.com>.
Thanks for your suggestion Michael and thanks to Uwe for clarifying.

Payload is currently used to store only the start positions. 
What I gathered from your suggestion is that we could possibly 
store the end position, or span, or some other complex 
encoding in order to store the extra information.
Am I right?

--Sumukh


Michael McCandless-2 wrote:
> 
> 
> Since Lucene doesn't represent/store end position for a token, I don't  
> think the index can properly represent SYN spanning two positions?
> 
> I suppose you could encode this into payloads, and create a custom  
> query that would look at the payload to enforce the constraint.
> 
> Or, if you switch to doing SYN expansion only at runtime (not adding  
> it to the index), that might work.
> 
> Mike
> 
> Uwe Schindler wrote:
> 
>> I think his problem is, that "SYN" is a synonym for the phrase "WORD1
>> WORD2". Using these positions, a phrase like "SYN WORD2" would also  
>> match
>> (or other problems in queries that depend on order of words).
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>> -----Original Message-----
>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>> Sent: Monday, March 02, 2009 4:07 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Indexing synonyms for multiple words
>>>
>>>
>>> Shouldn't WORD2's position be 1 more than your SYN?
>>>
>>> Ie, don't you want these positions?:
>>>
>>>    WORD1  2
>>>    WORD2  3
>>>    SYN 2
>>>
>>> The position is the starting position of the token; Lucene doesn't
>>> store an ending position
>>>
>>> Mike
>>>
>>> Sumukh wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm fairly new to Lucene. I'd like to know how we can index synonyms
>>>> for
>>>> multiple words.
>>>>
>>>> This is the scenario:
>>>>
>>>> Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
>>>>
>>>> Now assume the two words combined WORD1 WORD2 can be replaced by
>>>> another
>>>> word SYN.
>>>>
>>>> If I place SYN after WORD1 with positionIncrement set to 0, WORD2  
>>>> will
>>>> follow SYN,
>>>> which is incorrect; and the other way round if I place it after  
>>>> WORD2.
>>>>
>>>> If any of you have solved a similar problem, I'd be thankful if you
>>>> could
>>>> share some light on
>>>> the solution.
>>>>
>>>> Regards,
>>>> Sumukh
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-synonyms-for-multiple-words-tp22289069p22300656.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing synonyms for multiple words

Posted by Michael McCandless <lu...@mikemccandless.com>.
Since Lucene doesn't represent/store end position for a token, I don't  
think the index can properly represent SYN spanning two positions?

I suppose you could encode this into payloads, and create a custom  
query that would look at the payload to enforce the constraint.

Or, if you switch to doing SYN expansion only at runtime (not adding  
it to the index), that might work.

Mike

Uwe Schindler wrote:

> I think his problem is, that "SYN" is a synonym for the phrase "WORD1
> WORD2". Using these positions, a phrase like "SYN WORD2" would also  
> match
> (or other problems in queries that depend on order of words).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Monday, March 02, 2009 4:07 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Indexing synonyms for multiple words
>>
>>
>> Shouldn't WORD2's position be 1 more than your SYN?
>>
>> Ie, don't you want these positions?:
>>
>>    WORD1  2
>>    WORD2  3
>>    SYN 2
>>
>> The position is the starting position of the token; Lucene doesn't
>> store an ending position
>>
>> Mike
>>
>> Sumukh wrote:
>>
>>> Hi,
>>>
>>> I'm fairly new to Lucene. I'd like to know how we can index synonyms
>>> for
>>> multiple words.
>>>
>>> This is the scenario:
>>>
>>> Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
>>>
>>> Now assume the two words combined WORD1 WORD2 can be replaced by
>>> another
>>> word SYN.
>>>
>>> If I place SYN after WORD1 with positionIncrement set to 0, WORD2  
>>> will
>>> follow SYN,
>>> which is incorrect; and the other way round if I place it after  
>>> WORD2.
>>>
>>> If any of you have solved a similar problem, I'd be thankful if you
>>> could
>>> share some light on
>>> the solution.
>>>
>>> Regards,
>>> Sumukh
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Indexing synonyms for multiple words

Posted by Uwe Schindler <uw...@thetaphi.de>.
I think his problem is, that "SYN" is a synonym for the phrase "WORD1
WORD2". Using these positions, a phrase like "SYN WORD2" would also match
(or other problems in queries that depend on order of words). 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Monday, March 02, 2009 4:07 PM
> To: java-user@lucene.apache.org
> Subject: Re: Indexing synonyms for multiple words
> 
> 
> Shouldn't WORD2's position be 1 more than your SYN?
> 
> Ie, don't you want these positions?:
> 
>     WORD1  2
>     WORD2  3
>     SYN 2
> 
> The position is the starting position of the token; Lucene doesn't
> store an ending position
> 
> Mike
> 
> Sumukh wrote:
> 
> > Hi,
> >
> > I'm fairly new to Lucene. I'd like to know how we can index synonyms
> > for
> > multiple words.
> >
> > This is the scenario:
> >
> > Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
> >
> > Now assume the two words combined WORD1 WORD2 can be replaced by
> > another
> > word SYN.
> >
> > If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
> > follow SYN,
> > which is incorrect; and the other way round if I place it after WORD2.
> >
> > If any of you have solved a similar problem, I'd be thankful if you
> > could
> > share some light on
> > the solution.
> >
> > Regards,
> > Sumukh
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing synonyms for multiple words

Posted by Michael McCandless <lu...@mikemccandless.com>.
Shouldn't WORD2's position be 1 more than your SYN?

Ie, don't you want these positions?:

    WORD1  2
    WORD2  3
    SYN 2

The position is the starting position of the token; Lucene doesn't  
store an ending position

Mike

Sumukh wrote:

> Hi,
>
> I'm fairly new to Lucene. I'd like to know how we can index synonyms  
> for
> multiple words.
>
> This is the scenario:
>
> Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
>
> Now assume the two words combined WORD1 WORD2 can be replaced by  
> another
> word SYN.
>
> If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
> follow SYN,
> which is incorrect; and the other way round if I place it after WORD2.
>
> If any of you have solved a similar problem, I'd be thankful if you  
> could
> share some light on
> the solution.
>
> Regards,
> Sumukh


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Indexing synonyms for multiple words

Posted by Sumukh <su...@gmail.com>.
>
> Hi,
>
> I'm fairly new to Lucene. I'd like to know how we can index synonyms for
> multiple words.
>
> This is the scenario:
>
> Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
>
> Now assume the two words combined WORD1 WORD2 can be replaced by another
> word SYN.
>
> If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
> follow SYN,
> which is incorrect; and the other way round if I place it after WORD2.
>
> If any of you have solved a similar problem, I'd be thankful if you could
> share some light on
> the solution.
>
> Regards,
> Sumukh
>
>