You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Alan Woodward (Jira)" <ji...@apache.org> on 2019/11/13 15:55:00 UTC

[jira] [Resolved] (LUCENE-9030) Solr- and WordnetSynonymParser behaviour differs

     [ https://issues.apache.org/jira/browse/LUCENE-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Woodward resolved LUCENE-9030.
-----------------------------------
    Fix Version/s: 8.4
                   master (9.0)
       Resolution: Fixed

> Solr- and WordnetSynonymParser behaviour differs
> ------------------------------------------------
>
>                 Key: LUCENE-9030
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9030
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 8.2
>            Reporter: Christoph Büscher
>            Assignee: Alan Woodward
>            Priority: Minor
>             Fix For: master (9.0), 8.4
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Equivalent synonyms are showing up with different token types and ordering depending on whether the Solr format or the Wordnet format is used. A synonym set like
> "woods, wood, forest" in Solr format leads to the following token stream (term and type) when analyzing the term "forest":  
> "forest"/word, "woods"/SYNONYM, "wood" /SYNONYM
>  
> The following set in Wordnet format should give the same output (all terms are in the same synset), however all tokens are of type SYNONYM here and the original input token "forest" isn't the first one:
> synonyms.txt:
> {code:java}
> s(100000001,1,'woods',n,1,0)
> s(100000001,2,'wood',n,1,0)
> s(100000001,3,'forest',n,1,0){code}
> Token stream (term/type) when an
> woods"/SYNONYM, "wood" /SYNONYM, "forest"/SYNONYM
> I don't think this is intentional and is confusing (especially because the "original" input token type gets lost). I saw that the way the synsets are added to the SynonymMap in the respective parsers differes and have a PR that changes this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org