You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Daniel Bigham <da...@wolfram.com> on 2016/04/28 17:26:15 UTC

Query Expansion for Synonyms

I'm investigating various ways of supporting synonyms in Lucene.

One such approach that looks potentially interesting is to do a kind of 
"query expansion".

For example, if the user searches for "us 1888", one might expand the 
query as follows:

     SpanNearQuery query =
     new SpanNearQuery(
         new SpanQuery[]
         {
             new SpanOrQuery(
                 new SpanTermQuery(new Term("Plaintext", "us")),
                 new SpanNearQuery(
                     new SpanQuery[]
                     {
                         new SpanTermQuery(new Term("Plaintext", "united")),
                         new SpanTermQuery(new Term("Plaintext", "states"))
                     },
                     0,
                     true
                 )
             ),
             new SpanTermQuery(new Term("Plaintext", "1888"))
         },
         0,
         true
     );

A couple of questions:

- Is this approach in use within the community?
- Are there "gotchas" with this approach that make it undesirable?

I've done a few quick tests wrt query performance on a test index and 
found that a query can indeed take 10x longer if enough synonyms are 
used, but if the baseline search time is around 1 ms, then 10 ms is 
still plently fast enough. (that said, my test was on a 70 MB index, so 
my 10 ms might turn into something nasty with a 7 GB index)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Query Expansion for Synonyms

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Daniel,

Since you are restricting inOrder=true and proximity=0 in the top level query, there is no problem in your particular example.

If you weren't restricting, injecting synonyms with plain OR, sometimes cause 'query drift': injection/addition of one term changes result list drastically.

When there is a big term statistics (document frequency, collection frequency, etc) difference between the injected term and the original term, there can be unexpected results.

BlendedTermQuery and SynonymQuery implementations could be used.

Ahmet

On Thursday, April 28, 2016 6:26 PM, Daniel Bigham <da...@wolfram.com> wrote:
I'm investigating various ways of supporting synonyms in Lucene.

One such approach that looks potentially interesting is to do a kind of 
"query expansion".

For example, if the user searches for "us 1888", one might expand the 
query as follows:

     SpanNearQuery query =
     new SpanNearQuery(
         new SpanQuery[]
         {
             new SpanOrQuery(
                 new SpanTermQuery(new Term("Plaintext", "us")),
                 new SpanNearQuery(
                     new SpanQuery[]
                     {
                         new SpanTermQuery(new Term("Plaintext", "united")),
                         new SpanTermQuery(new Term("Plaintext", "states"))
                     },
                     0,
                     true
                 )
             ),
             new SpanTermQuery(new Term("Plaintext", "1888"))
         },
         0,
         true
     );

A couple of questions:

- Is this approach in use within the community?
- Are there "gotchas" with this approach that make it undesirable?

I've done a few quick tests wrt query performance on a test index and 
found that a query can indeed take 10x longer if enough synonyms are 
used, but if the baseline search time is around 1 ms, then 10 ms is 
still plently fast enough. (that said, my test was on a 70 MB index, so 
my 10 ms might turn into something nasty with a 7 GB index)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org