You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark Bennett <mb...@ideaeng.com> on 2009/08/24 19:47:45 UTC

Clarifications to Synonym Filter Wiki entry? (1 of 2)

There are a couple of things about the Solr Thesaurus doc that I'd like to
confirm / understand.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter

I believe the following section is a bit misleading; I'm sure it's correct
for the case it describes, but there's another case I've tested, which on
the surface seemed similar, but where the actual results were different and
in hindsight not really a conflict, just a surprise.

At the bottom of the gray Synonym file format box it shows the example:
    #multiple synonym mapping entries are merged.
    foo => foo bar
    foo => baz
    #is equivalent to
    foo => foo bar, baz

Whereas I was using non-explicit / reflexive mappings with overlapping
terms, for example:
    A, B, C, D
    A, E, I, O, U
(assume these are real non-single-letter words, the word "a" is often
stopped out of course)

Assuming expand="true", and reading the wiki, I would have thought the
groups would be merged, to be effectively:
    A, B, C, D, E, I, O, U

This is NOT the case, which is actually good in my opinion.

At index time, if an A is seen, it WILL be expanded to also include B, C, D
and E, I, O, U.  This is true even if A is not listed first.

However, if the indexer encounters B, it will ONLY be expanded with A, C and
D.  Similarly, E will be augmented with A, I, O and U.

I tested this by actually looking at the word index with Luke.

If you DID want the merged behavior, where D would expand to match all 9
letters you can either:
1: Put the synonym filter in the pipeline twice, along with the remove
duplicates filter
OR
2: Use the synonym filter at both index and query time

Does anybody disagree with this?

And what should be added to the Wiki doc?

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Clarifications to Synonym Filter Wiki entry? (1 of 2)

Posted by Chris Hostetter <ho...@fucit.org>.

: I believe the following section is a bit misleading; I'm sure it's correct
: for the case it describes, but there's another case I've tested, which on
: the surface seemed similar, but where the actual results were different and
: in hindsight not really a conflict, just a surprise.

the crux of the issue is that *lines* in the file with only commas (no =>) 
are ambiguious, and only have meaning once the "expand" property is evaluated.  
once that's done then you have a list of *mappings* ... and it's the 
mappings that get merged.

: I tested this by actually looking at the word index with Luke.

FYI: an easy way to test it would probably be the analysis.jsp page

: If you DID want the merged behavior, where D would expand to match all 9
: letters you can either:
: 1: Put the synonym filter in the pipeline twice, along with the remove
: duplicates filter
: OR
: 2: Use the synonym filter at both index and query time

using the filter at query time with expand=true would wreck havoc with 
phrase queries ... your best bet is to be more explicit when expressing 
the mappings in the file.

: And what should be added to the Wiki doc?

Add whatever you think would help ... users discovering behavior for hte 
first time are the best people to write documentation, because the devs 
who know the code really well don't apprecaite what isn't obvious.



-Hoss