You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by da...@correo.aeat.es on 2013/09/23 08:45:27 UTC
Problems with gaps removed with SynonymFilter
Hi,
I am having a problem applying StopFilterFactory and
SynonimFilterFactory. The problem is that SynonymFilter removes the gaps
that were previously put by the StopFilterFactory. I'm applying filters in
query time, because users need to change synonym lists frequently.
This is my schema, and an example of the issue:
String: "documentacion para agentes"
org.apache.solr.analysis.WhitespaceTokenizerFactory
{luceneMatchVersion=LUCENE_35}
position 1 2 3
term text documentación para agentes
startOffset 0 14 19
endOffset 13 18 26
org.apache.solr.analysis.LowerCaseFilterFactory
{luceneMatchVersion=LUCENE_35}
position 1 2 3
term text documentación para agentes
startOffset 0 14 19
endOffset 13 18 26
org.apache.solr.analysis.StopFilterFactory {words=stopwords_intranet.txt,
ignoreCase=true, enablePositionIncrements=true,
luceneMatchVersion=LUCENE_35}
position 1 3
term text documentación agentes
startOffset 0 19
endOffset 13 26
org.apache.solr.analysis.SynonymFilterFactory
{synonyms=sinonimos_intranet.txt, expand=true, ignoreCase=true,
luceneMatchVersion=LUCENE_35}
position 1 2
term text documentación agente
archivo agentes
type SYNONYM SYNONYM
SYNONYM SYNONYM
startOffset 0 19
0 19
endOffset 13 26
13 26
As you can see, the position should be 1 and 3, but SynonymFilter removes
the gap and moves token from position 3 to 2
I've got the same problem with Solr 3.5 y 4.0.
I don't know if it's a bug or an error with my configuration. In other
schemas that I have worked with, I had always put the SynonymFilter
previous to StopFilter, but in this I prefered using this order because of
the big number of synonym that the list has (i.e. I don't want to generate
a lot of synonyms for a word that I really wanted to remove).
Thanks,
David Dávila Atienza
AEAT - Departamento de Informática Tributaria
David Dávila Atienza
AEAT - Departamento de Informática Tributaria
Subdirección de Tecnologías de Análisis de la Información e Investigación
del Fraude
Área de Infraestructuras
Teléfono: 915831543
Extensión: 31543
Re: Problems with gaps removed with SynonymFilter
Posted by Michael McCandless <lu...@mikemccandless.com>.
Unfortunately the current SynonymFilter cannot handle posInc != 1 ...
we could perhaps try to fix this ... patches welcome :)
So for now it's best to place SynonymFilter before StopFilter, and
before any other filters that may create graph tokens (posLen > 1,
posInc == 0).
Mike McCandless
http://blog.mikemccandless.com
On Mon, Sep 23, 2013 at 2:45 AM, <da...@correo.aeat.es> wrote:
> Hi,
>
> I am having a problem applying StopFilterFactory and
> SynonimFilterFactory. The problem is that SynonymFilter removes the gaps
> that were previously put by the StopFilterFactory. I'm applying filters in
>
> query time, because users need to change synonym lists frequently.
>
> This is my schema, and an example of the issue:
>
>
> String: "documentacion para agentes"
>
> org.apache.solr.analysis.WhitespaceTokenizerFactory
> {luceneMatchVersion=LUCENE_35}
> position 1 2 3
> term text documentación para agentes
> startOffset 0 14 19
> endOffset 13 18 26
> org.apache.solr.analysis.LowerCaseFilterFactory
> {luceneMatchVersion=LUCENE_35}
> position 1 2 3
> term text documentación para agentes
> startOffset 0 14 19
> endOffset 13 18 26
> org.apache.solr.analysis.StopFilterFactory {words=stopwords_intranet.txt,
> ignoreCase=true, enablePositionIncrements=true,
> luceneMatchVersion=LUCENE_35}
> position 1 3
> term text documentación agentes
> startOffset 0 19
> endOffset 13 26
> org.apache.solr.analysis.SynonymFilterFactory
> {synonyms=sinonimos_intranet.txt, expand=true, ignoreCase=true,
> luceneMatchVersion=LUCENE_35}
> position 1 2
> term text documentación agente
> archivo agentes
> type SYNONYM SYNONYM
> SYNONYM SYNONYM
> startOffset 0 19
> 0 19
> endOffset 13 26
> 13 26
>
>
> As you can see, the position should be 1 and 3, but SynonymFilter removes
> the gap and moves token from position 3 to 2
> I've got the same problem with Solr 3.5 y 4.0.
> I don't know if it's a bug or an error with my configuration. In other
> schemas that I have worked with, I had always put the SynonymFilter
> previous to StopFilter, but in this I prefered using this order because of
>
> the big number of synonym that the list has (i.e. I don't want to generate
>
> a lot of synonyms for a word that I really wanted to remove).
>
> Thanks,
>
> David Dávila Atienza
> AEAT - Departamento de Informática Tributaria
>
> David Dávila Atienza
> AEAT - Departamento de Informática Tributaria
> Subdirección de Tecnologías de Análisis de la Información e Investigación
> del Fraude
> Área de Infraestructuras
> Teléfono: 915831543
> Extensión: 31543