You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2011/04/25 20:21:51 UTC
Automatic synonyms for multiple variations of a word
Hi,
How do people handle cases where synonyms are used and there are multiple
version of the original word that really need to point to the same set of
synonyms?
For example:
Consider singular and plural of the word "responsibility". One might have
synonyms defined like this:
responsibility, obligation, duty
But the plural "responsibilities" is not in there, and thus it will not get
expanded to the synonyms above! That's a problem.
Sure, one could change the synonyms file to look like this:
responsibility, responsibilities, obligation, duty
But that means somebody needs to think of all variations of the word!
Is there a something one can do to get all variations of the word to map to the
same synonyms without having to explicitly specify all variations of the word?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
Re: Automatic synonyms for multiple variations of a word
Posted by Mike Sokolov <so...@ifactory.com>.
Yes, I see. Makes sense. It is a bit hard to see a "bad" case for your
proposal in that light. Here is one other example; I'm not sure whether
it presents difficulties or not, and may be a bit contrived, but hey,
food for thought at least:
Say you have set up synonyms between names and commonly-used pseudonyms
or alternate names that should not be stemmed:
Malcolm X <=> Malcolm Little
Prince <=> Rogers Nelson Prince
Little Kim <=> Kimberly Denise Jones
Biggy Smalls etc.
You don't want "Malcolm Littler" or "Littlest Kim" or "Big Small" to
match anything. And Princely shouldn't bring up the artist.
But you also have regular linguistic synonyms (not names) that *should*
be stemmed (as in the original example). So little <=> small should
imply littler <=> smaller and so on via stemming.
Ideally you could put one SynonymFilter before the stemming and the
other one after. In that case do the SynonymFilters get composed? I
can't think of a believable example where that would cause a problem,
but maybe you can?
-Mike
On 04/26/2011 04:25 PM, Robert Muir wrote:
> Mike, thanks a lot for your example: the idea here would be you would
> put the lowercasefilter after the synonymfilter, and then you get this
> exact flexibility?
>
> e.g.
> WhitespaceTokenizer
> SynonymFilter -> no lowercasing of tokens are done as it "analyzes"
> your synonyms with just the tokenizer
> LowerCaseFilter
>
> but
> WhitespaceTokenizer
> LowerCaseFilter
> SynonymFilter -> the synonyms are lowercased, as it "analyzes"
> synonyms with the tokenizer+filter
>
> its already inconsistent today, because if you do:
>
> LowerCaseTokenizer
> SynonymFilter
>
> then your synonyms are in fact all being lowercased... its just
> arbitrary that they are only being analyzed with the "tokenizer".
>
> On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov<so...@ifactory.com> wrote:
>
>> Suppose your analysis stack includes lower-casing, but your synonyms are
>> only supposed to apply to upper-case tokens. For example, "PET" might be a
>> synonym of "positron emission tomography", but "pet" wouldn't be.
>>
>> -Mike
>>
>> On 04/26/2011 09:51 AM, Robert Muir wrote:
>>
>>> On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
>>> <ot...@yahoo.com> wrote:
>>>
>>>
>>>
>>>> But somehow this feels bad (well, so does sticking word variations in
>>>> what's
>>>> supposed to be a synonyms file), partly because it means that the person
>>>> adding
>>>> new synonyms would need to know what they stem to (or always check it
>>>> against
>>>> Solr before editing the file).
>>>>
>>>>
>>> when creating the synonym map from your input file, currently the
>>> factory actually uses your Tokenizer only to pre-process the synonyms
>>> file.
>>>
>>> One idea would be to use the tokenstream up to the synonymfilter
>>> itself (including filters). This way if you put a stemmer before the
>>> synonymfilter, it would stem your synonyms file, too.
>>>
>>> I haven't totally thought the whole thing through to see if theres a
>>> big reason why this wouldn't work (the synonymsfilter is complicated,
>>> sorry). But it does seem like it would produce more consistent
>>> results... and perhaps the inconsistency isnt so obvious since in the
>>> default configuration the synonymfilter is directly after the
>>> tokenizer.
>>>
>>>
>>
Re: Automatic synonyms for multiple variations of a word
Posted by Robert Muir <rc...@gmail.com>.
Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?
e.g.
WhitespaceTokenizer
SynonymFilter -> no lowercasing of tokens are done as it "analyzes"
your synonyms with just the tokenizer
LowerCaseFilter
but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter -> the synonyms are lowercased, as it "analyzes"
synonyms with the tokenizer+filter
its already inconsistent today, because if you do:
LowerCaseTokenizer
SynonymFilter
then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the "tokenizer".
On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov <so...@ifactory.com> wrote:
> Suppose your analysis stack includes lower-casing, but your synonyms are
> only supposed to apply to upper-case tokens. For example, "PET" might be a
> synonym of "positron emission tomography", but "pet" wouldn't be.
>
> -Mike
>
> On 04/26/2011 09:51 AM, Robert Muir wrote:
>>
>> On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
>> <ot...@yahoo.com> wrote:
>>
>>
>>>
>>> But somehow this feels bad (well, so does sticking word variations in
>>> what's
>>> supposed to be a synonyms file), partly because it means that the person
>>> adding
>>> new synonyms would need to know what they stem to (or always check it
>>> against
>>> Solr before editing the file).
>>>
>>
>> when creating the synonym map from your input file, currently the
>> factory actually uses your Tokenizer only to pre-process the synonyms
>> file.
>>
>> One idea would be to use the tokenstream up to the synonymfilter
>> itself (including filters). This way if you put a stemmer before the
>> synonymfilter, it would stem your synonyms file, too.
>>
>> I haven't totally thought the whole thing through to see if theres a
>> big reason why this wouldn't work (the synonymsfilter is complicated,
>> sorry). But it does seem like it would produce more consistent
>> results... and perhaps the inconsistency isnt so obvious since in the
>> default configuration the synonymfilter is directly after the
>> tokenizer.
>>
>
Re: Automatic synonyms for multiple variations of a word
Posted by Mike Sokolov <so...@ifactory.com>.
Suppose your analysis stack includes lower-casing, but your synonyms are
only supposed to apply to upper-case tokens. For example, "PET" might
be a synonym of "positron emission tomography", but "pet" wouldn't be.
-Mike
On 04/26/2011 09:51 AM, Robert Muir wrote:
> On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>
>
>> But somehow this feels bad (well, so does sticking word variations in what's
>> supposed to be a synonyms file), partly because it means that the person adding
>> new synonyms would need to know what they stem to (or always check it against
>> Solr before editing the file).
>>
> when creating the synonym map from your input file, currently the
> factory actually uses your Tokenizer only to pre-process the synonyms
> file.
>
> One idea would be to use the tokenstream up to the synonymfilter
> itself (including filters). This way if you put a stemmer before the
> synonymfilter, it would stem your synonyms file, too.
>
> I haven't totally thought the whole thing through to see if theres a
> big reason why this wouldn't work (the synonymsfilter is complicated,
> sorry). But it does seem like it would produce more consistent
> results... and perhaps the inconsistency isnt so obvious since in the
> default configuration the synonymfilter is directly after the
> tokenizer.
>
Re: Automatic synonyms for multiple variations of a word
Posted by Robert Muir <rc...@gmail.com>.
On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> But somehow this feels bad (well, so does sticking word variations in what's
> supposed to be a synonyms file), partly because it means that the person adding
> new synonyms would need to know what they stem to (or always check it against
> Solr before editing the file).
when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.
One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.
I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.
Re: Automatic synonyms for multiple variations of a word
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Right, instead of this in synonyms file:
responsibility, obligation, duty
I could stem each of the above words/synonyms and have something like this in
synonyms file:
respons, oblig, duti
But somehow this feels bad (well, so does sticking word variations in what's
supposed to be a synonyms file), partly because it means that the person adding
new synonyms would need to know what they stem to (or always check it against
Solr before editing the file).
I've never seen anyone actually use such a synonyms file in production, have
you?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ----
> From: Lance Norskog <go...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 12:20:05 AM
> Subject: Re: Automatic synonyms for multiple variations of a word
>
> This has come up with stemming: you can stem your synonym list with
> the FieldAnalyzer Solr http call, then save the final chewed-up terms
> as a new synonym file. You then use that one in the analyzer stack
> below the stemmer filter.
>
> On Mon, Apr 25, 2011 at 9:15 PM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
> > Hi Otis & Robert,
> >
> > ----- Original Message ----
> >
> >>
> >> How do people handle cases where synonyms are used and there are multiple
> >> version of the original word that really need to point to the same set of
> >> synonyms?
> >>
> >> For example:
> >> Consider singular and plural of the word "responsibility". One might
have
> >> synonyms defined like this:
> >>
> >> responsibility, obligation, duty
> >>
> >> But the plural "responsibilities" is not in there, and thus it will not
>get
> >> expanded to the synonyms above! That's a problem.
> >>
> >> Sure, one could change the synonyms file to look like this:
> >>
> >> responsibility, responsibilities, obligation, duty
> >>
> >> But that means somebody needs to think of all variations of the word!
> >
> > Yes, that seems to be the case now, as it was in 2008:
> >
>http://search-lucene.com/m/gLwUCV0qU02&subj=Re+Synonyms+and+stemming+revisited
> > http://search-lucene.com/m/7lqdp1ldrqx (Hoss replied, but I think that
> > suggestion doesn't actually work)
> >
> >> Is there a something one can do to get all variations of the word to map
>to
> >>the
> >>
> >> same synonyms without having to explicitly specify all variations of the
> > word?
> >
> > I think this is where Robert's 2+2lemma pointer may help because the
2+lemma
> > list contains "records" where a headword is followed by a list of other
> > variations of the word. The way I think this would help is by simply
>taking
> > that list and turning it into the synonyms file format, and then merging in
>the
> > actual synonyms.
> >
> > For example, if I have the word "responsibility", then from 2+2lemma I
>should be
> > able to get that "responsibilities" is one of the variants of
>"responsibility".
> > I should then be able to take those 2 words and stick them in synonyms file
>like
> > this:
> >
> > responsibility, responsibilities
> >
> > And then append actual synonyms to that:
> >
> > responsibility, responsibilities, obligation, duty
> >
> > But I may then need to actually expand synonyms themselves, too (again
using
> > data from 2+2lemma):
> >
> > responsibility, responsibilities, obligation, obligations, duty, duties
> >
> >
> > I haven't tried this yet. Just theorizing and hoping for feedback.
> >
> > Does this sound about right?
> >
> > Thanks,
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
Re: Automatic synonyms for multiple variations of a word
Posted by Lance Norskog <go...@gmail.com>.
This has come up with stemming: you can stem your synonym list with
the FieldAnalyzer Solr http call, then save the final chewed-up terms
as a new synonym file. You then use that one in the analyzer stack
below the stemmer filter.
On Mon, Apr 25, 2011 at 9:15 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hi Otis & Robert,
>
> ----- Original Message ----
>
>>
>> How do people handle cases where synonyms are used and there are multiple
>> version of the original word that really need to point to the same set of
>> synonyms?
>>
>> For example:
>> Consider singular and plural of the word "responsibility". One might have
>> synonyms defined like this:
>>
>> responsibility, obligation, duty
>>
>> But the plural "responsibilities" is not in there, and thus it will not get
>> expanded to the synonyms above! That's a problem.
>>
>> Sure, one could change the synonyms file to look like this:
>>
>> responsibility, responsibilities, obligation, duty
>>
>> But that means somebody needs to think of all variations of the word!
>
> Yes, that seems to be the case now, as it was in 2008:
> http://search-lucene.com/m/gLwUCV0qU02&subj=Re+Synonyms+and+stemming+revisited
> http://search-lucene.com/m/7lqdp1ldrqx (Hoss replied, but I think that
> suggestion doesn't actually work)
>
>> Is there a something one can do to get all variations of the word to map to
>>the
>>
>> same synonyms without having to explicitly specify all variations of the
> word?
>
> I think this is where Robert's 2+2lemma pointer may help because the 2+lemma
> list contains "records" where a headword is followed by a list of other
> variations of the word. The way I think this would help is by simply taking
> that list and turning it into the synonyms file format, and then merging in the
> actual synonyms.
>
> For example, if I have the word "responsibility", then from 2+2lemma I should be
> able to get that "responsibilities" is one of the variants of "responsibility".
> I should then be able to take those 2 words and stick them in synonyms file like
> this:
>
> responsibility, responsibilities
>
> And then append actual synonyms to that:
>
> responsibility, responsibilities, obligation, duty
>
> But I may then need to actually expand synonyms themselves, too (again using
> data from 2+2lemma):
>
> responsibility, responsibilities, obligation, obligations, duty, duties
>
>
> I haven't tried this yet. Just theorizing and hoping for feedback.
>
> Does this sound about right?
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
--
Lance Norskog
goksron@gmail.com
Re: Automatic synonyms for multiple variations of a word
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Otis & Robert,
----- Original Message ----
>
> How do people handle cases where synonyms are used and there are multiple
> version of the original word that really need to point to the same set of
> synonyms?
>
> For example:
> Consider singular and plural of the word "responsibility". One might have
> synonyms defined like this:
>
> responsibility, obligation, duty
>
> But the plural "responsibilities" is not in there, and thus it will not get
> expanded to the synonyms above! That's a problem.
>
> Sure, one could change the synonyms file to look like this:
>
> responsibility, responsibilities, obligation, duty
>
> But that means somebody needs to think of all variations of the word!
Yes, that seems to be the case now, as it was in 2008:
http://search-lucene.com/m/gLwUCV0qU02&subj=Re+Synonyms+and+stemming+revisited
http://search-lucene.com/m/7lqdp1ldrqx (Hoss replied, but I think that
suggestion doesn't actually work)
> Is there a something one can do to get all variations of the word to map to
>the
>
> same synonyms without having to explicitly specify all variations of the
word?
I think this is where Robert's 2+2lemma pointer may help because the 2+lemma
list contains "records" where a headword is followed by a list of other
variations of the word. The way I think this would help is by simply taking
that list and turning it into the synonyms file format, and then merging in the
actual synonyms.
For example, if I have the word "responsibility", then from 2+2lemma I should be
able to get that "responsibilities" is one of the variants of "responsibility".
I should then be able to take those 2 words and stick them in synonyms file like
this:
responsibility, responsibilities
And then append actual synonyms to that:
responsibility, responsibilities, obligation, duty
But I may then need to actually expand synonyms themselves, too (again using
data from 2+2lemma):
responsibility, responsibilities, obligation, obligations, duty, duties
I haven't tried this yet. Just theorizing and hoping for feedback.
Does this sound about right?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/