You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2015/04/16 23:53:59 UTC

[jira] [Resolved] (LUCENE-6400) SynonymParser should encode 'expand' correctly.

     [ https://issues.apache.org/jira/browse/LUCENE-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-6400.
----------------------------------------
       Resolution: Fixed
    Fix Version/s: 5.2
                   Trunk

Thanks Ian!

> SynonymParser should encode 'expand' correctly.
> -----------------------------------------------
>
>                 Key: LUCENE-6400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6400
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: Trunk, 5.2
>
>         Attachments: LUCENE-6400.patch, LUCENE-6400.patch, LUCENE-6400.patch, LUCENE-6400.patch, PositionLenghtAndType-unittests.patch, unittests-expand-and-parse.patch
>
>
> Today SolrSynonymParser encodes something like A, B, C with 'expand=true' like this:
> A -> A, B, C (includeOrig=false)
> B -> B, A, C (includeOrig=false)
> C -> C, A, B (includeOrig=false)
> This gives kinda buggy output (synfilter sees it all as replacements, and makes all the terms with type synonym, positionLength isnt supported, etc) and it wastes space in the FST (includeOrig is just one bit). 
> Example with "spiderman, spider man" and analysis on 'spider man'
> Trunk:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=SYNONYM*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=1*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=SYNONYM*
> You can see this is confusing, all the words have type SYNONYM, because spider and man got deleted, and totally replaced by new terms (Which happen to have the same text).
> Patch:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=word*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=2*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=word*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org