You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Peter Hotm. N�rregaard" <ng...@hotmail.com> on 2005/04/11 15:36:01 UTC

How to include a multi-word synonym to a word when indexing?

According to "Lucene in Action" it is possible to get synonyms indexed 
together with a word by putting multiple words with the same position-id in 
the term vector.

My problem is, however, that some words needs to have alternatives where the 
word is decomposed / decompounded into two or more words:

"FooBar Corp" or "cybercafe"

should be found when searching for

"Foo Ba*" or "cyber cafe"


The reverse is also true: The "Foo Bar Corp" should be found with "Foob* 
corp".

So how do I solve this problem?



Thanks,



Peter

_________________________________________________________________
Fĺ alle de nye og sjove ikoner med MSN Messenger http://messenger.msn.dk/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to include a multi-word synonym to a word when indexing?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Apr 12, 2005, at 1:42 AM, Chris Hostetter wrote:
>
> : You'll need some kind of lookup to know how to split a token like
> : "cybercafe" into two words - once you've done that it will be easy to
> : set the position increment of them to zero so that they overlay the
> : original term.
>
> but how would you set the position increment of a multi-word synonym so
> that phrase/span queries will work?
>
> Assuming you have the following "phrase synonym" (and code that
> that can find them during Analysis)...
>
>    [CyberCafe] => [Cyber] [Cafe]
>    [IBM] => [International] [Business] [Machines]
>    [Cyber] [Cafe] => [CyberCafe]
>    [International] [Business] [Machines] => [IBM]
>
> and the source documents:
>
> 1) bob bought stock in IBM for five bucks
> 2) sue went to the cybercafe yesterday
> 3) joe was at the cafe, cyber chating yesterday
>
> ...how would you set the position incriment so that a span/phrase query
> for "stock in International Business Machines" would match document #1,
> and "cyber cafe" would match document #2 but not #3 ?

On further thought, my approach would be to handle this on the analysis 
side and not deal with position increments.  The lookup would take 
"cyber cafe" and emit the token "cybercafe".  In your #3 example, the 
tokens would be [cafe] [cyber] and would not match.  If someone issued 
a phrase query for "cyber cafe" the same analysis would turn that into 
a query for "cybercafe".

What drawbacks are there from replacing multiple words with its 
corresponding acryonym/alias during analysis?

> the only thing that's ever occured to me is to set the position 
> incriment

I can't help myself, I'm working with the spell checker as we speak.... 
incrEment :)

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to include a multi-word synonym to a word when indexing?

Posted by Chris Hostetter <ho...@fucit.org>.

: You'll need some kind of lookup to know how to split a token like
: "cybercafe" into two words - once you've done that it will be easy to
: set the position increment of them to zero so that they overlay the
: original term.

but how would you set the position increment of a multi-word synonym so
that phrase/span queries will work?

Assuming you have the following "phrase synonym" (and code that
that can find them during Analysis)...

   [CyberCafe] => [Cyber] [Cafe]
   [IBM] => [International] [Business] [Machines]
   [Cyber] [Cafe] => [CyberCafe]
   [International] [Business] [Machines] => [IBM]

and the source documents:

1) bob bought stock in IBM for five bucks
2) sue went to the cybercafe yesterday
3) joe was at the cafe, cyber chating yesterday

...how would you set the position incriment so that a span/phrase query
for "stock in International Business Machines" would match document #1,
and "cyber cafe" would match document #2 but not #3 ?

the only thing that's ever occured to me is to set the position incriment
of all the words to "0" (but that will still reseult in false positives in
the "cyber cafe" example) or to pick some high default position incriment
(bigger then the longest multi-word synonym) and use that normally, and
reserve incriments of "1" for words in a multi-word synonym.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to include a multi-word synonym to a word when indexing?

Posted by "Peter Hotm. N�rregaard" <ng...@hotmail.com>.

Good point on phrase/span queries, Hostetter.

:Assuming you have the following "phrase synonym" (and code that
:that can find them during Analysis)...
:
:  [CyberCafe] => [Cyber] [Cafe]

...
: the only thing that's ever occured to me is to set the position incriment
: of all the words to "0" (but that will still reseult in false positives in
: the "cyber cafe" example) or to pick some high default position incriment
: (bigger then the longest multi-word synonym) and use that normally, and
: reserve incriments of "1" for words in a multi-word synonym.

A good suggestion, however it does have a small side-effect: If I understand 
you correctly, that strategy will create the following token stream for 
"CyberCafe Inc.", assuming that we increment by, say, 10 per default:
[cybercafe, 1] [cyber, 1] [cafe, 2] [inc, 10]

In that case, a search for the phrase "cybercafe cafe inc" would return a 
match. In this case it is acceptable albeit a bit strange to the user, but 
then again, searching for "cybercafe cafe" IS a bit strange. However, 
situations can be constructed where the result would be a false positive. 
Also, we could end up with no match for phrase queries if the slop-factor is 
too low (e.g. 0): "Cybercafe inc" would not be found unless the same 
analyse-algorithm also is applied to both the document and the query,
And ranking could also be aversely affected.

There is no such concept as a 2-dimensional term vector?
[CyberCafe Inc] => [[cybercafe], [[cyber] [café]]] [inc]
(in theory it would have to be a directed, acyclic graph (DAG), I guess)

_________________________________________________________________
Del din verden med MSN Spaces  http://spaces.msn.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How to include a multi-word synonym to a word when indexing?

Posted by Pasha Bizhan <fc...@ok.ru>.

Hi, 

> From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 

> > My problem is, however, that some words needs to have alternatives 
> > where the word is decomposed / decompounded into two or more words:
> >
> > "FooBar Corp" or "cybercafe"
> >
> > should be found when searching for
> >
> > "Foo Ba*" or "cyber cafe"
 
> You'll need some kind of lookup to know how to split a token like 
> "cybercafe" into two words - once you've done that it will be easy to 
> set the position increment of them to zero so that they overlay the 
> original term.

What about putting all synonyms into index? 
Foo Bar Corp, FooBar Corp, FooBarCorp, cyber cafe, cybercafe etc?
In this case we do no need analyze input query.

Pasha Bizhan
 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to include a multi-word synonym to a word when indexing?

Posted by "Peter Hotm. N�rregaard" <ng...@hotmail.com>.

>What drawbacks are there from replacing multiple words with its 
>corresponding acryonym/alias during analysis?

- Wildcard search: [cyber] [ca*] would not match [cybercafe]
- Fuzzy search: [cyber] [cage~] would not match [cybercafe]

Peter

_________________________________________________________________
Log pĺ MSN Messenger direkte fra nettet  http://webmessenger.msn.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to include a multi-word synonym to a word when indexing?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Apr 11, 2005, at 9:36 AM, Peter Hotm. Nørregaard wrote:
> According to "Lucene in Action" it is possible to get synonyms indexed 
> together with a word by putting multiple words with the same 
> position-id in the term vector.
>
> My problem is, however, that some words needs to have alternatives 
> where the word is decomposed / decompounded into two or more words:
>
> "FooBar Corp" or "cybercafe"
>
> should be found when searching for
>
> "Foo Ba*" or "cyber cafe"

First, the phrase query with wildcards is not currently a built-in 
capability of Lucene.

You'll need some kind of lookup to know how to split a token like 
"cybercafe" into two words - once you've done that it will be easy to 
set the position increment of them to zero so that they overlay the 
original term.

> The reverse is also true: The "Foo Bar Corp" should be found with 
> "Foob* corp".

Again the caveat about wildcard phrase queries applies.  You'll need to 
use that same lookup mechanism to combine multiple tokens coming 
through a token filter into a single combined one.

> So how do I solve this problem?

The approach I've described above is a crude dictionary-based one, but 
I'm sure other tricks could be employed but they would be quite 
sophisticated (N-gram sequencing ala LingPipe comes to mind).

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org