You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Dawid Weiss <da...@cs.put.poznan.pl> on 2009/04/22 10:49:51 UTC

Synonym filter with support for phrases?

Hello everyone,

I'm looking for feedback and thoughts on the following problem (it's more of 
development than user-centered problem, hope the dev list is appropriate):

- a token stream is given,

- a set of "synonyms" is given, where synonyms are token sequences to be matched 
and token sequences to be added as synonyms.

An example to make things clearer (apologies for lame synonyms). Given a set of 
synonyms like this:

{"new", "york"} -> {
	{"big", "apple"}},

{"restaurant"}  -> {
	{"diner"},
	{"food", "place"},
	{"full", "belly"}}
}

a token stream (I try to indicate positional information here):

0 | 1   | 2          | 3  | 4   | 5
a | new | restaurant | in | new | york

would be enriched to an index of (note overlapping tokens on the same positions):

0 | 1   | 2          | 3     | 4   | 5
a | new | restaurant | in    | new | york
   |     | diner      |       | big | apple
   |     | food       | place |     |
   |     | full       | belly |     |

The point is for phrase queries to work for synonyms and for the original text 
(of course multi-word synonyms longer than the original phrase would overlap 
with the text, but this shouldn't be much of a worry).

In the current Lucene's trunk there is a synonym filter, but its implementation 
is not really suitable for achieving the above. I wrote a token filter that 
implements the above functionality, but then I thought that synonyms would be 
something frequently dealt with so my questions are:

a) are there any thoughts on how the above could be implemented using existing 
Lucene infrastructure (perhaps I missed something obvious),

b) if (a) is not applicable, would such a token filter constitute a useful 
addition to Lucene?

Dawid


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> It'd be great to get multi-word synonyms fully working...

I agree -- this is something that seems to be useful for a wider bunch of people.

> How would you change how Lucene indexes token positions to do this "correctly"?

Kirill has some interesting points to this. I have a busy day today, but I'll 
try to clean up and post the code that I put together for another project. It'll 
be a start for refining into better directions.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Earwin Burrfoot <ea...@gmail.com>.
> On Wed, Apr 22, 2009 at 5:12 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>
>> Your synonyms will break if you try searching for phrases.
>> Building on your example, "food place in new york" will find nothing,
>> because 'place' and 'in' share the same position.
>
> It'd be great to get multi-word synonyms fully working...
>
> How would you change how Lucene indexes token positions to do this "correctly"?
You need an ability to put two tokens in the same position, with
different posIncrements.

One variant from the top of my head is to introduce a notion of span,
so token becomes (text, span, incr).
(restaurant, 1, 0), (food, 0, 1), (place, 0, 1), (in, 0, 1), (new, 0,
1), (york, 0, 1)

The span affects distance calculation between this term, and some that follows.
E.g. dist(food, in) = 2, because both food and place have incr=1, but
despite restaurant and food having same start position,
dist(restaurant, in) = 1, because restaurant spans an additional
position.

With something like that I think it is possible to formulate an
algorithm for indexing and query rewriting that does "correct"
multiword synonyms.

Right now I cheat when rewriting a query. If my syngroup is a part of
the phrase, and I know that this syngroup has longer phrases than the
one currently detected, I do a span or sloppy phrase query. That
works, but theoretically could match a wrong document.

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Apr 22, 2009 at 5:12 AM, Earwin Burrfoot <ea...@gmail.com> wrote:

> Your synonyms will break if you try searching for phrases.
> Building on your example, "food place in new york" will find nothing,
> because 'place' and 'in' share the same position.

It'd be great to get multi-word synonyms fully working...

How would you change how Lucene indexes token positions to do this "correctly"?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Earwin Burrfoot <ea...@gmail.com>.
>> engine. So guys looking for "MSU CMC" really want to get "Московский
>> Государственный Университет, факультет ВМиК" and his friends.
> And? How often do they extend this particular phrase with further terms?
They don't need to. Variations of this phrase alone killed my first
several approaches to synonyms :)

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

Re: Synonym filter with support for phrases?

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> engine. So guys looking for "MSU CMC" really want to get "Московский
> Государственный Университет, факультет ВМиК" and his friends.

And? How often do they extend this particular phrase with further terms? It must 
be fun to have an index running concurrently on multi language synonyms, mixing 
the two.

>> We deviated off course with this conversation though. I see your point and I respect it.
> Hm? I just shared some experience. Will no longer steer away :)

Oh, don't get me wrong, I appreciate you talking about your experiences -- the 
way you implemented synonyms is certainly interesting. I just didn't want this 
thread to become focused on the discussion what's right and wrong because 
everything depends on the application. I'm wondering what other people did in 
similar situations, that's all.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Earwin Burrfoot <ea...@gmail.com>.
> Your example concerns phrase queries, so somebody would have to keep adding
> terms to a phrase. My experience with open search queries (I had access to a
> larger slice of queries from Microsoft Live) is that phrases are a minority
> of all searches. In the most common case, people will look for a union of
> terms, and for these queries the solution I described would work just fine.
We're a bit special. Most of our searches are ordered by date, so we
can't use relevance dependant on query term proximity, or whatever, to
boost good docs up. That has many consequences, and one of them is
that people use phrase queries a lot.

> Another thing is that my use case for "phrase synonyms" is that people would
> look for exact synonym phrases, but rarely expand them to cover something
> beyond.
We have a lot of synonyms that are more likely alternate forms rather
than synonyms, plus translations, plus abbrevs - using the same
engine. So guys looking for "MSU CMC" really want to get "Московский
Государственный Университет, факультет ВМиК" and his friends.

> We deviated off course with this conversation though. I see your point and I respect it.
Hm? I just shared some experience. Will no longer steer away :)

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Well, everyone has his own requirements for the search quality. For us
> it was a problem.

The topic is subjective... I don't see this as a deterioration in search 
quality. Let me explain.

Your example concerns phrase queries, so somebody would have to keep adding 
terms to a phrase. My experience with open search queries (I had access to a 
larger slice of queries from Microsoft Live) is that phrases are a minority of 
all searches. In the most common case, people will look for a union of terms, 
and for these queries the solution I described would work just fine.

Another thing is that my use case for "phrase synonyms" is that people would 
look for exact synonym phrases, but rarely expand them to cover something 
beyond. Therefore a phrase "big apple" would find a synonym match (which is what 
I want), but longer phrases such as "restaurants in the big apple" would not 
(like you said). The big question is, of course, if somebody asking for that 
specific phrase would be interested in finding a document where this phrase does 
not occur in its exact form (but as a synonym).

We deviated off course with this conversation though. I see your point and I 
respect it.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Earwin Burrfoot <ea...@gmail.com>.
>> Building on your example, "food place in new york" will find nothing,
>> because 'place' and 'in' share the same position.
> You're right, but is it such a big problem in real life?

Well, everyone has his own requirements for the search quality. For us
it was a problem.
User enters a query, then refines it by adding new words, then
WHIZBANG! he suddenly sees 'Nothing was found', even though he knows
matching documents exist.

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Your synonyms will break if you try searching for phrases.

Good point, I did write that filter, but I never actually got to searching for 
exact phrases in it (there was a very specific scenario and we used prefix 
queries which worked quite well).

> Building on your example, "food place in new york" will find nothing,
> because 'place' and 'in' share the same position.

You're right, but is it such a big problem in real life? What you're describing 
is searching for a phrase that spawns both the synonym and the actual token 
sequence. What I thought was: searching for phrases that were either just 
synonyms or synonyms and text with an identical position layout (which is the 
case with single-word synonyms). I dare say this covers majority of cases, 
although I have nothing to support this claim.

> While building the index, I inject synonym group ids instead of actual
> words, then I detect synonyms in queries and replace them with group
> ids too. Hard part comes after that, you have to adjust
> positionIncrements on syngroup id tokens, with respect to the longest
 > [snip]

Yep, hairy ;)

> More correct approach is to index as-is and expand queries with actual
> synonym phrases instead of ids, but then queries become really
> humongous if you have any decent synonym dictionary (I have 20+ phrase
> groups).

Query expansion is not the option for me, unfortunately -- to many synonyms. It 
would be much better to do it once at indexing time and rely on this information 
since.

Thanks for sharing your thoughts, Кирилл.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Earwin Burrfoot <ea...@gmail.com>.
> Hello everyone,
>
> I'm looking for feedback and thoughts on the following problem (it's more of
> development than user-centered problem, hope the dev list is appropriate):
>
> - a token stream is given,
>
> - a set of "synonyms" is given, where synonyms are token sequences to be
> matched and token sequences to be added as synonyms.
>
> An example to make things clearer (apologies for lame synonyms). Given a set
> of synonyms like this:
>
> {"new", "york"} -> {
>        {"big", "apple"}},
>
> {"restaurant"}  -> {
>        {"diner"},
>        {"food", "place"},
>        {"full", "belly"}}
> }
>
> a token stream (I try to indicate positional information here):
>
> 0 | 1   | 2          | 3  | 4   | 5
> a | new | restaurant | in | new | york
>
> would be enriched to an index of (note overlapping tokens on the same
> positions):
>
> 0 | 1   | 2          | 3     | 4   | 5
> a | new | restaurant | in    | new | york
>  |     | diner      |       | big | apple
>  |     | food       | place |     |
>  |     | full       | belly |     |
>
> The point is for phrase queries to work for synonyms and for the original
> text (of course multi-word synonyms longer than the original phrase would
> overlap with the text, but this shouldn't be much of a worry).
>
> In the current Lucene's trunk there is a synonym filter, but its
> implementation is not really suitable for achieving the above. I wrote a
> token filter that implements the above functionality, but then I thought
> that synonyms would be something frequently dealt with so my questions are:
>
> a) are there any thoughts on how the above could be implemented using
> existing Lucene infrastructure (perhaps I missed something obvious),
>
> b) if (a) is not applicable, would such a token filter constitute a useful
> addition to Lucene?
Your synonyms will break if you try searching for phrases.
Building on your example, "food place in new york" will find nothing,
because 'place' and 'in' share the same position.

I've implemented multiword synonyms on my project, it works, but is
really hairy.
While building the index, I inject synonym group ids instead of actual
words, then I detect synonyms in queries and replace them with group
ids too. Hard part comes after that, you have to adjust
positionIncrements on syngroup id tokens, with respect to the longest
synonym contained in that group, then you have to treat overlapping
synonyms. When query rewrite is finished, I end up with a mixture of
Term/Phrase/MultiPhrase/SpanQueries :)

More correct approach is to index as-is and expand queries with actual
synonym phrases instead of ids, but then queries become really
humongous if you have any decent synonym dictionary (I have 20+ phrase
groups).

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Synonym filter with support for phrases?

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Apologies for the delay, guys. I tried to solve certain issues that didn't pop 
up in my application (as Kirill said, the problem is indeed quite complex). I 
didn't find all the answers I had been looking for, but nonetheless -- the patch 
that works for my needs is in JIRA. I would be really interested in something 
that does a better job (see the unit tests -- there are certain comments I made 
about the current functionality and its shortcomings), but I couldn't figure out 
a way to do it better.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org