You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by balaji <mc...@gmail.com> on 2011/09/06 17:22:57 UTC

Synonyms Not Working when using SRC & DEST

Hi all

   The Question might sound stupid  . I have a large synonym file and have
created the synonyms something like below

*allergy test  =>  Doctors, Doctors-Medical, PHYSICIANS, Physicians &
Surgeons
*

    I have also added the synonym to get indexed during index time like
below

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>


But when I do a search for allergy , I get 0 results

when I go to solr/admin/analysis.jsp and I could see the corresponding
synonym show up

http://lucene.472066.n3.nabble.com/file/n3313862/Screenshot-1.png 

So I couldn't find where is the problem

when i change the synonym file to a comma separated I am able to see the
results


It would be great if you can help me out with this


Thanks
Balaji

--
View this message in context: http://lucene.472066.n3.nabble.com/Synonyms-Not-Working-when-using-SRC-DEST-tp3313862p3313862.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Synonyms Not Working when using SRC & DEST

Posted by balaji <mc...@gmail.com>.
> So, if instead of:
>
> allergy test  =>  Doctors, Doctors-Medical, PHYSICIANS, Physicians &
> Surgeons
>
> You specified
>
>
> allergy test => allergy test, Doctors, Doctors-Medical, PHYSICIANS,
> Physicians & Surgeons
>
>
   I followed the above approach " allergy test => allergy test, Doctors,
Doctors-Medical, PHYSICIANS, Physicians & Surgeons " and it works as
expected , Thanks for making it more clear

Thanks
Balaji


--
View this message in context: http://lucene.472066.n3.nabble.com/Synonyms-Not-Working-when-using-SRC-DEST-tp3313862p3316691.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Synonyms Not Working when using SRC & DEST

Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.
Also, just to make one thing just a bit more clear.   You can specify two different kinds of entries in synonym files.  See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters (solr.SynonymFilterFactory)


One is replacement, where the words before the "=>" are *replaced* by the right hand side, i.e., the words on the left hand side "disappear".  This is what you are currently doing according to your original message:

#Explicit mappings match any token sequence on the LHS of "=>"
#and replace with all alternatives on the RHS.  These types of mappings
#ignore the expand parameter in the schema.
#Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit



The other is equivalence, where each term is expanded into the entire list, if you do the following, with expand set to true:

#Equivalent synonyms may be separated with commas and give
#no explicit mapping.  In this case the mapping behavior will
#be taken from the expand parameter in the schema.  This allows
#the same synonym file to be used in different synonym handling strategies.
#Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos



So, if instead of:

allergy test  =>  Doctors, Doctors-Medical, PHYSICIANS, Physicians & Surgeons

You specified


allergy test => allergy test, Doctors, Doctors-Medical, PHYSICIANS, Physicians & Surgeons 

Or 

allergy test, Doctors, Doctors-Medical, PHYSICIANS, Physicians & Surgeons

with expand set to true,  then you might get the behavior your desire:  "Allergy test" would get indexed, along with "Doctors" and all of the rest.  The difference being that in the second case, any of those terms (e.g. "Docotrs") would also get indexed as "Allergy test" which might not be what you desire, in which case the first one would do what you want.

I expect that all you really need to do is:

allergy test => allergy test, Doctors, Doctors-Medical, PHYSICIANS, Physicians & Surgeons

to solve your problem.

JRJ

-----Original Message-----
From: balaji [mailto:mcabalaji@gmail.com] 
Sent: Tuesday, September 06, 2011 7:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Synonyms Not Working when using SRC & DEST

> It won't work given your current schema.  To get the desired results, you
> would need to expand your synonyms at both index AND query time.  Right now
> your schema seems to specify it only at index time.
>

I have a very huge schema spanning up to 10K lines , if I use query time it
will be huge hit for me because one term will be mapped to multiple terms .
similar in the case of allergy

I doesn't want to go with comma separated as it will give
some erroneous results  and more over allergy and doctors are not equivalent
terms to be used in comma


>
> So, as the other respondent indicated, currently you replace allergy with
> the other list when indexing, and since allergy is not replaced during
> query, it gets no hits.
>

I replace allergy during the index with doctors , So it shouldn't be part of
the document ?


Thanks
Balaji


--
View this message in context: http://lucene.472066.n3.nabble.com/Synonyms-Not-Working-when-using-SRC-DEST-tp3313862p3315287.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Synonyms Not Working when using SRC & DEST

Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.
> I have a very huge schema spanning up to 10K lines , if I use query time it
> will be huge hit for me because one term will be mapped to multiple terms .
> similar in the case of allergy

I think maybe you mean synonym file, rather than the schema?  I doubt that the number of lines matters all that much, though undoubtedly some.  I expect that Solr loads that synonym file into some kind of hash map, rather than searching it linearly -- though I have not looked at the code for that.

> I replace allergy during the index with doctors , So it shouldn't be part of
> the document ?

Yes indeed, doctors would be in the index, and would give you a hit on that document when searched.  But because your synonym file specifies replacement, that means that allergy is *NOT* part of the index, hence, when you searched on allergy, you got no results.

As far as synonym expansion being a "huge hit", no, not really, I think.  Besides, if you are not getting what you want or need, speed becomes pretty much irrelevant.  We did some performance testing:  modest single server (i.e., a laptop running Windows XP with only 2GB total memory available), pretty much configured "out of the box" with jetty, except that we added waffle authentication.  The data was names, addresses and the like (not text) -- 7+ million rows, with considerable synonym expansion:  200 first name synonyms, 433 last name synonyms, expanded at both index time and search time.

We then did a search test driven from those same synonyms files, by randomly picking out a name from the first and last name list, the idea being that most likely names did have some synonyms.

Under Solr 3.1, once the OS file system cache got some entries in there, running with 8 concurrent client search threads sending HTTP search requests (done in perl) we averaged about .50 seconds per request, or over 55,000 searches per hour.

JRJ

-----Original Message-----
From: balaji [mailto:mcabalaji@gmail.com] 
Sent: Tuesday, September 06, 2011 7:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Synonyms Not Working when using SRC & DEST

> It won't work given your current schema.  To get the desired results, you
> would need to expand your synonyms at both index AND query time.  Right now
> your schema seems to specify it only at index time.
>

I have a very huge schema spanning up to 10K lines , if I use query time it
will be huge hit for me because one term will be mapped to multiple terms .
similar in the case of allergy

I doesn't want to go with comma separated as it will give
some erroneous results  and more over allergy and doctors are not equivalent
terms to be used in comma


>
> So, as the other respondent indicated, currently you replace allergy with
> the other list when indexing, and since allergy is not replaced during
> query, it gets no hits.
>

I replace allergy during the index with doctors , So it shouldn't be part of
the document ?


Thanks
Balaji


--
View this message in context: http://lucene.472066.n3.nabble.com/Synonyms-Not-Working-when-using-SRC-DEST-tp3313862p3315287.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Synonyms Not Working when using SRC & DEST

Posted by balaji <mc...@gmail.com>.
> It won't work given your current schema.  To get the desired results, you
> would need to expand your synonyms at both index AND query time.  Right now
> your schema seems to specify it only at index time.
>

I have a very huge schema spanning up to 10K lines , if I use query time it
will be huge hit for me because one term will be mapped to multiple terms .
similar in the case of allergy

I doesn't want to go with comma separated as it will give
some erroneous results  and more over allergy and doctors are not equivalent
terms to be used in comma


>
> So, as the other respondent indicated, currently you replace allergy with
> the other list when indexing, and since allergy is not replaced during
> query, it gets no hits.
>

I replace allergy during the index with doctors , So it shouldn't be part of
the document ?


Thanks
Balaji


--
View this message in context: http://lucene.472066.n3.nabble.com/Synonyms-Not-Working-when-using-SRC-DEST-tp3313862p3315287.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Synonyms Not Working when using SRC & DEST

Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.
It won't work given your current schema.  To get the desired results, you would need to expand your synonyms at both index AND query time.  Right now your schema seems to specify it only at index time.

So, as the other respondent indicated, currently you replace allergy with the other list when indexing, and since allergy is not replaced during query, it gets no hits.

It almost sounds like a case where you could consider synonym expansion only at query time, rather than at index time (though that is usually not advisable for reasons discussed on the Wiki).  Then Allergy would get expanded during a search, and hit the documents with Doctors, etc.

JRJ

-----Original Message-----
From: balaji [mailto:mcabalaji@gmail.com] 
Sent: Tuesday, September 06, 2011 12:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Synonyms Not Working when using SRC & DEST

Hi Chris

    The Terms Doctors , Doctors-Medical are all present in my Document body,
title fields etc..  but Allergy Test is not . So what I am doing in synonym
file is if a user searches for allergy test bring me results that match
Doctors etc.. i.e 
Explicit mappings match any token sequence on the LHS of "=>"  and replace
with all alternatives on the RHS.

    So when I do a search "allergy test" it should map with doctors and
should bring me results but it is not mapping . Is there any way I make it
work

    Hope it clarifies 


Thanks
Balaji

    

--
View this message in context: http://lucene.472066.n3.nabble.com/Synonyms-Not-Working-when-using-SRC-DEST-tp3313862p3314222.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Synonyms Not Working when using SRC & DEST

Posted by balaji <mc...@gmail.com>.
Hi Chris

    The Terms Doctors , Doctors-Medical are all present in my Document body,
title fields etc..  but Allergy Test is not . So what I am doing in synonym
file is if a user searches for allergy test bring me results that match
Doctors etc.. i.e 
Explicit mappings match any token sequence on the LHS of "=>"  and replace
with all alternatives on the RHS.

    So when I do a search "allergy test" it should map with doctors and
should bring me results but it is not mapping . Is there any way I make it
work

    Hope it clarifies 


Thanks
Balaji

    

--
View this message in context: http://lucene.472066.n3.nabble.com/Synonyms-Not-Working-when-using-SRC-DEST-tp3313862p3314222.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Synonyms Not Working when using SRC & DEST

Posted by Chris Hostetter <ho...@fucit.org>.
: *allergy test  =>  Doctors, Doctors-Medical, PHYSICIANS, Physicians &
: Surgeons
	..
: <analyzer type="index">
	...
: <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
: ignoreCase="true" expand="true"/>
	...
: But when I do a search for allergy , I get 0 results

You've configured your field so that any time the terms "allergy" 
and "test" appear in sequence in a field value you index, those terms are 
removed and replaced by new terms ("Doctors", "Doctors-Medical", etc...)

So if the term "allergy" only appears in the source text followed by the 
term "test" then it will never actually be indexed in your document, so a 
serach for it will never match.

You can see this exact behavior in the screen shot you posted of the 
analysis tool...

: http://lucene.472066.n3.nabble.com/file/n3313862/Screenshot-1.png 

...after the synonyn filter, the term "allergy" is not in your indexed 
terms.

: when i change the synonym file to a comma separated I am able to see the
: results

because when using a comma instead of "=>" you are saying "if any of these 
term sequences exist, expand it to *all* of these term sequences.

Please note the docs on SYnonymFilter, particularly the examples...

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

-Hoss