You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Christoph Büscher (Jira)" <ji...@apache.org> on 2019/10/28 19:30:00 UTC

[jira] [Created] (LUCENE-9030) Solr- and WordnetSynonymParser behaviour differs

Christoph Büscher created LUCENE-9030:
-----------------------------------------

             Summary: Solr- and WordnetSynonymParser behaviour differs
                 Key: LUCENE-9030
                 URL: https://issues.apache.org/jira/browse/LUCENE-9030
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 8.2
            Reporter: Christoph Büscher


Equivalent synonyms are showing up with different token types and ordering depending on whether the Solr format or the Wordnet format is used. A synonym set like

"woods, wood, forest" in Solr format leads to the following token stream (term and type) when analyzing the term "forest":  

"forest"/word, "woods"/SYNONYM, "wood" /SYNONYM

 

The following set in Wordnet format should give the same output (all terms are in the same synset), however all tokens are of type SYNONYM here and the original input token "forest" isn't the first one:

synonyms.txt:
{code:java}
s(100000001,1,'woods',n,1,0)
s(100000001,2,'wood',n,1,0)
s(100000001,3,'forest',n,1,0){code}
Token stream (term/type) when an

woods"/SYNONYM, "wood" /SYNONYM, "forest"/SYNONYM

I don't think this is intentional and is confusing (especially because the "original" input token type gets lost). I saw that the way the synsets are added to the SynonymMap in the respective parsers differes and have a PR that changes this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org