You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Paul Smith <PS...@tenfold.com> on 2005/04/29 18:47:08 UTC

Re: Re[2]: multi word synonym (was Hungarian notation analyzer and phrase queries)

Indexing every multi-word synonym as a single token would introduce
spaces into the tokens. In that case searching for (java) would not
match "i love jsp and tomcat". I think that searching for (java*) would
match.

Rewriting the query is also problematic. If you search for (java
server), you don't have a rule to rewrite it to help you find jsp.

Paul

>>> sven.duzont@keljob.com 04/27/05 05:51AM >>>
Hello,

What about the solution to index every multi-word synonym as a single
token ?
Example :
Phrase to index : "i love jsp and tomcat"
Synonyms        : "jsp" = "java server pages" = "javaserver pages"
Tokens          : i love jsp               and tomcat
                         java server pages
                         javaserver pages
Position        : 0 1    2                 3   4

This solution will have the advantage to solve the phrase query
problems.

One will have also to rewrite queries before parsing it with the
QueryParser. for instance the query (tomcat jsp) will be rewrited as
(tomcat (jsp OR "java server pages" OR "javaserver pages"))

Any thoughts ?
Thanks in advance

---
 Sven





mercredi 13 avril 2005, 19:36:44, vous avez écrit:


CH> : Another approach would be to index this as:
CH> :
CH> : token:       use   power      query for advanced searches
CH> :                     powerquery
CH> : position:    0     1          2     3   4        5
CH> :
CH> : Then use phrase queries with slop=1, to permit a one-token gap
when
CH> : someone searches for "use powerquery for advanced searches".

CH> right, but in your example "1" is a magic number that works because
we are
CH> only dealing with "multi-word synonyms" of 2 words.  in general,
this
CH> approach requires that you pick some some N such that you are
garunteed no
CH> synonym contains more then N-1 words, and set the token positions
to...

CH>  token:       use   powerquery         for  advanced  searches
CH>                     power      query
CH>  position:    0     N          N+1     2N   3N        4N





CH> -Hoss


CH>
---------------------------------------------------------------------
CH> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
CH> For additional commands, e-mail: java-user-help@lucene.apache.org 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 


http://www.tenfold.com

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the TenFold Postmaster (postmaster@tenfold.com).
**********************************************************************


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re[4]: multi word synonym (was Hungarian notation analyzer and phrase queries)

Posted by Sven Duzont <sv...@laposte.net>.

Hi,

 Thanks for your reply Paul.
 Yes, this was a delicate point.
 I gave up indexing  multi-words synonyms as single token for the
 reason you pointed.

 To handle phraseQueries, i change the positions of the Terms that
 follows the synonyms.
 For instance for the PhraseQuery "jsp and vb developer", with the
 synonyms "vb:virtual basic" and "jsp:java server pages" the positions
 of the terms will be :
-Terms    : jsp vb developer
-Position : 0   3  5

 Doing like this, the queries for "jsp server" will not match.

 I have pieces of code to give if some people are interested

 Thanks again.

Sven

Le vendredi 29 avril 2005 à 21:58:54, vous écriviez :

PL> I knew there was a catch...

PL> I do think, however, that the point is a delicate one which would 
PL> consideration: multi-word synonyms are quite common!

PL> paul



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Re[2]: multi word synonym (was Hungarian notation analyzer and phrase queries)

Posted by Paul Libbrecht <pa...@activemath.org>.

I knew there was a catch...

I do think, however, that the point is a delicate one which would 
consideration: multi-word synonyms are quite common!

paul


Le 29 avr. 05, à 18:47, Paul Smith a écrit :

> Indexing every multi-word synonym as a single token would introduce
> spaces into the tokens. In that case searching for (java) would not
> match "i love jsp and tomcat". I think that searching for (java*) would
> match.
>
> Rewriting the query is also problematic. If you search for (java
> server), you don't have a rule to rewrite it to help you find jsp.
>
> Paul
>
>>>> sven.duzont@keljob.com 04/27/05 05:51AM >>>
> Hello,
>
> What about the solution to index every multi-word synonym as a single
> token ?
> Example :
> Phrase to index : "i love jsp and tomcat"
> Synonyms        : "jsp" = "java server pages" = "javaserver pages"
> Tokens          : i love jsp               and tomcat
>                          java server pages
>                          javaserver pages
> Position        : 0 1    2                 3   4
>
> This solution will have the advantage to solve the phrase query
> problems.
>
> One will have also to rewrite queries before parsing it with the
> QueryParser. for instance the query (tomcat jsp) will be rewrited as
> (tomcat (jsp OR "java server pages" OR "javaserver pages"))
>
> Any thoughts ?
> Thanks in advance
>
> ---
>  Sven
>
>
>
>
>
> mercredi 13 avril 2005, 19:36:44, vous avez écrit:
>
>
> CH> : Another approach would be to index this as:
> CH> :
> CH> : token:       use   power      query for advanced searches
> CH> :                     powerquery
> CH> : position:    0     1          2     3   4        5
> CH> :
> CH> : Then use phrase queries with slop=1, to permit a one-token gap
> when
> CH> : someone searches for "use powerquery for advanced searches".
>
> CH> right, but in your example "1" is a magic number that works because
> we are
> CH> only dealing with "multi-word synonyms" of 2 words.  in general,
> this
> CH> approach requires that you pick some some N such that you are
> garunteed no
> CH> synonym contains more then N-1 words, and set the token positions
> to...
>
> CH>  token:       use   powerquery         for  advanced  searches
> CH>                     power      query
> CH>  position:    0     N          N+1     2N   3N        4N
>
>
>
>
>
> CH> -Hoss
>
>
> CH>
> ---------------------------------------------------------------------
> CH> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> CH> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> http://www.tenfold.com
>
> **********************************************************************
> This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they
> are addressed. If you have received this email in error please notify
> the TenFold Postmaster (postmaster@tenfold.com).
> **********************************************************************
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org