You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Donna L Gresh <gr...@us.ibm.com> on 2009/08/07 16:35:10 UTC

Is there a way for me to handle a multiword synonym correctly?

I saw some discussion on the board but I'm not sure I've got quite the 
same problem. As an example, I have a query that might be a technical 
skill:

SAP EM FIN AM

I would like that to match a document that has *either* SAP.EM.FIN.AM or 
"SAP EM FIN AM" (in that order and all together, not spread out through 
the document).

The approach I had tried was at index time if I saw SAP.EM.FIN.AM I would 
consider "SAP EM FIN AM" a synonym for it, using the Lucene in Action 
example. Luke shows me that I have two terms in the index for this 
document: SAP.EM.FIN.AM and "SAP EM FIN AM" (one term). Thus it appears 
differently in the index than if it had been organically found as just the 
string of tokens, in which case there would be separate terms for SAP, EM, 
and so on. 

At query time if I look for "SAP EM FIN AM" it is formed as a phrase query 
with a slop of 0 which does *not* match the one term version "SAP EM FIN 
AM". (For that matter a simple boolean query doesn't find it either) Luke 
confirms the fact that the phrase query does not find my synonym term. The 
query "SAP EM FIN AM" finds *only* documents that originally had those 
separated tokens in them.

Is there a way to handle this situation such that at index time I can turn 
SAP.EM.FIN.AM into something that will be found with a query for "SAP EM 
FIN AM"?

Thanks for any pointers

Donna

Re: Is there a way for me to handle a multiword synonym correctly?

Posted by Matthew Hall <mh...@informatics.jax.org>.

Create a field that is specifically for this type of matches.

What you could then do is at indexing time manipulate your data in such 
a way that it can be matched in a punctuation irrelevant way.

So in this field you would convert all non letter characters into 
spaces, and reduce all white space instances to single ones ("     " 
becomes " ") , you could also likely lowercase it at the same time.

Then at search time perform a special search against this field that 
does the same thing to the query string.  At this point plain old phrase 
queries should work for you.

Our corpus contains remarkably obnoxious items in it like: Rara<^tm3.1Ipc>

So we need to be able to do very similar things as you are describing, 
the above mentioned technique worked like a charm.

Matt

Donna L Gresh wrote:
> I saw some discussion on the board but I'm not sure I've got quite the 
> same problem. As an example, I have a query that might be a technical 
> skill:
>
> SAP EM FIN AM
>
> I would like that to match a document that has *either* SAP.EM.FIN.AM or 
> "SAP EM FIN AM" (in that order and all together, not spread out through 
> the document).
>
> The approach I had tried was at index time if I saw SAP.EM.FIN.AM I would 
> consider "SAP EM FIN AM" a synonym for it, using the Lucene in Action 
> example. Luke shows me that I have two terms in the index for this 
> document: SAP.EM.FIN.AM and "SAP EM FIN AM" (one term). Thus it appears 
> differently in the index than if it had been organically found as just the 
> string of tokens, in which case there would be separate terms for SAP, EM, 
> and so on. 
>
> At query time if I look for "SAP EM FIN AM" it is formed as a phrase query 
> with a slop of 0 which does *not* match the one term version "SAP EM FIN 
> AM". (For that matter a simple boolean query doesn't find it either) Luke 
> confirms the fact that the phrase query does not find my synonym term. The 
> query "SAP EM FIN AM" finds *only* documents that originally had those 
> separated tokens in them.
>
> Is there a way to handle this situation such that at index time I can turn 
> SAP.EM.FIN.AM into something that will be found with a query for "SAP EM 
> FIN AM"?
>
> Thanks for any pointers
>
> Donna 
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Is there a way for me to handle a multiword synonym correctly?

Posted by Donna L Gresh <gr...@us.ibm.com>.

I have to think about this a bit, but that may work. I just have to make 
sure no "undesirable" side effects occur. I certainly want to be able to 
search for a phrase and not have it match all the individual bits, but 
that should already work using the mechanism I already have in place.

Donna 


"Carl Austin" <Ca...@detica.com> wrote on 08/07/2009 10:50:08 AM:

> [image removed] 
> 
> RE: Is there a way for me to handle a multiword synonym correctly?
> 
> Carl Austin 
> 
> to:
> 
> java-user
> 
> 08/07/2009 10:50 AM
> 
> Please respond to java-user
> 
> I may be over simplifying here but in this case don't you just need to
> use an analyzer that breaks the word "SAP.EM.FIN.AM" on full stops and
> throws them out, so that it is indexed as terms "SAP" "EM" "FIN" "AM".
> This is the same as it will index "SAP EM FIN AM" as long as you break
> on whitespace too. I.E SimpleAnalyzer (runs of letter characters are
> tokens)
> 
> Then the query for "SAP EM FIN AM" will match both.
> 
> Carl
> 
> 
> -----Original Message-----
> From: Donna L Gresh [mailto:gresh@us.ibm.com] 
> Sent: 07 August 2009 15:35
> To: java-user@lucene.apache.org
> Subject: Is there a way for me to handle a multiword synonym correctly?
> 
> I saw some discussion on the board but I'm not sure I've got quite the
> same problem. As an example, I have a query that might be a technical
> skill:
> 
> SAP EM FIN AM
> 
> I would like that to match a document that has *either* SAP.EM.FIN.AM or
> "SAP EM FIN AM" (in that order and all together, not spread out through
> the document).
> 
> The approach I had tried was at index time if I saw SAP.EM.FIN.AM I
> would consider "SAP EM FIN AM" a synonym for it, using the Lucene in
> Action example. Luke shows me that I have two terms in the index for
> this
> document: SAP.EM.FIN.AM and "SAP EM FIN AM" (one term). Thus it appears
> differently in the index than if it had been organically found as just
> the string of tokens, in which case there would be separate terms for
> SAP, EM, and so on. 
> 
> At query time if I look for "SAP EM FIN AM" it is formed as a phrase
> query with a slop of 0 which does *not* match the one term version "SAP
> EM FIN AM". (For that matter a simple boolean query doesn't find it
> either) Luke confirms the fact that the phrase query does not find my
> synonym term. The query "SAP EM FIN AM" finds *only* documents that
> originally had those separated tokens in them.
> 
> Is there a way to handle this situation such that at index time I can
> turn SAP.EM.FIN.AM into something that will be found with a query for
> "SAP EM FIN AM"?
> 
> Thanks for any pointers
> 
> Donna 
> 
> 
> 
> This message should be regarded as confidential. If you have 
> received this email in error please notify the sender and destroy it
> immediately.
> Statements of intent shall only become binding when confirmed in 
> hard copy by an authorised signatory.  The contents of this email 
> may relate to dealings with other companies within the Detica 
> Limited group of companies.
> 
> Detica Limited is registered in England under No: 1337451.
> 
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, 
England.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>