You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by da...@correo.aeat.es on 2013/09/23 08:45:27 UTC

Problems with gaps removed with SynonymFilter

Hi, 

I am having a problem applying StopFilterFactory and 
SynonimFilterFactory. The problem is that SynonymFilter removes the gaps 
that were previously put by the StopFilterFactory. I'm applying filters in 

query time, because users need to change synonym lists frequently.

This is my schema, and an example of the issue:


String: "documentacion para agentes"

org.apache.solr.analysis.WhitespaceTokenizerFactory 
{luceneMatchVersion=LUCENE_35}
position        1       2       3
term text       documentación    para   agentes
startOffset     0       14      19
endOffset       13      18      26
org.apache.solr.analysis.LowerCaseFilterFactory 
{luceneMatchVersion=LUCENE_35}
position        1       2       3
term text       documentación    para   agentes
startOffset     0       14      19
endOffset       13      18      26
org.apache.solr.analysis.StopFilterFactory {words=stopwords_intranet.txt, 
ignoreCase=true, enablePositionIncrements=true, 
luceneMatchVersion=LUCENE_35}
position        1       3
term text       documentación   agentes
startOffset     0       19
endOffset       13      26
org.apache.solr.analysis.SynonymFilterFactory 
{synonyms=sinonimos_intranet.txt, expand=true, ignoreCase=true, 
luceneMatchVersion=LUCENE_35}
position        1       2
term text       documentación   agente
        archivo         agentes
type    SYNONYM SYNONYM
        SYNONYM SYNONYM
startOffset 0           19
        0               19
endOffset 13            26
        13              26


As you can see, the position should be 1 and 3, but SynonymFilter removes 
the gap and moves token from position 3 to 2
I've got the same problem with Solr 3.5 y 4.0. 
I don't know if it's a bug or an error with my configuration. In other 
schemas that I have worked with, I had always put the SynonymFilter 
previous to StopFilter, but in this I prefered using this order because of 

the big number of synonym that the list has (i.e. I don't want to generate 

a lot of synonyms for a word that I really wanted to remove).

Thanks,

David Dávila Atienza
AEAT - Departamento de Informática Tributaria

David Dávila Atienza
AEAT - Departamento de Informática Tributaria
Subdirección de Tecnologías de Análisis de la Información e Investigación 
del Fraude
Área de Infraestructuras
Teléfono: 915831543
Extensión: 31543

Re: Problems with gaps removed with SynonymFilter

Posted by Michael McCandless <lu...@mikemccandless.com>.

Unfortunately the current SynonymFilter cannot handle posInc != 1 ...
we could perhaps try to fix this ... patches welcome :)

So for now it's best to place SynonymFilter before StopFilter, and
before any other filters that may create graph tokens (posLen > 1,
posInc == 0).

Mike McCandless

http://blog.mikemccandless.com


On Mon, Sep 23, 2013 at 2:45 AM,  <da...@correo.aeat.es> wrote:
> Hi,
>
> I am having a problem applying StopFilterFactory and
> SynonimFilterFactory. The problem is that SynonymFilter removes the gaps
> that were previously put by the StopFilterFactory. I'm applying filters in
>
> query time, because users need to change synonym lists frequently.
>
> This is my schema, and an example of the issue:
>
>
> String: "documentacion para agentes"
>
> org.apache.solr.analysis.WhitespaceTokenizerFactory
> {luceneMatchVersion=LUCENE_35}
> position        1       2       3
> term text       documentación    para   agentes
> startOffset     0       14      19
> endOffset       13      18      26
> org.apache.solr.analysis.LowerCaseFilterFactory
> {luceneMatchVersion=LUCENE_35}
> position        1       2       3
> term text       documentación    para   agentes
> startOffset     0       14      19
> endOffset       13      18      26
> org.apache.solr.analysis.StopFilterFactory {words=stopwords_intranet.txt,
> ignoreCase=true, enablePositionIncrements=true,
> luceneMatchVersion=LUCENE_35}
> position        1       3
> term text       documentación   agentes
> startOffset     0       19
> endOffset       13      26
> org.apache.solr.analysis.SynonymFilterFactory
> {synonyms=sinonimos_intranet.txt, expand=true, ignoreCase=true,
> luceneMatchVersion=LUCENE_35}
> position        1       2
> term text       documentación   agente
>         archivo         agentes
> type    SYNONYM SYNONYM
>         SYNONYM SYNONYM
> startOffset 0           19
>         0               19
> endOffset 13            26
>         13              26
>
>
> As you can see, the position should be 1 and 3, but SynonymFilter removes
> the gap and moves token from position 3 to 2
> I've got the same problem with Solr 3.5 y 4.0.
> I don't know if it's a bug or an error with my configuration. In other
> schemas that I have worked with, I had always put the SynonymFilter
> previous to StopFilter, but in this I prefered using this order because of
>
> the big number of synonym that the list has (i.e. I don't want to generate
>
> a lot of synonyms for a word that I really wanted to remove).
>
> Thanks,
>
> David Dávila Atienza
> AEAT - Departamento de Informática Tributaria
>
> David Dávila Atienza
> AEAT - Departamento de Informática Tributaria
> Subdirección de Tecnologías de Análisis de la Información e Investigación
> del Fraude
> Área de Infraestructuras
> Teléfono: 915831543
> Extensión: 31543