You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Marc Sturlese <ma...@gmail.com> on 2012/08/14 17:53:26 UTC

offsets issues with multiword synonyms since LUCENE_33

Has someone noticed this problem and solved it somehow? (without using
LUCENE_33 in the solrconfig.xml)
https://issues.apache.org/jira/browse/LUCENE-3668

Thanks in advance



--
View this message in context: http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: offsets issues with multiword synonyms since LUCENE_33

Posted by Michael McCandless <lu...@mikemccandless.com>.

See also SOLR-3390.

Some cases have been addressed.  Eg, if you match domain name system
-> dns, then dns will have correct offsets spanning the full phrase
"domain name system" in the input.  (However: QueryParser won't work
because a query for "domain name system" is pre-split on whitespace so
the synonym never matches).

But for the reverse case, which I call "expanding" (ie, match dns ->
domain name system), the results are not "correct" (or at least
different from the previous SynFilter impl): the three tokens are
overlapped onto subsequent tokens, resulting in highlighting the wrong
tokens. However, QueryParser will work "correctly" for the query
"domain name system"...

But, I'd like to ask: why do apps want to "expand" (replace a match
with more than one input token, ie the dns -> domain name system
case)?  Is it ONLY because of QueryParser's limitation (that it
pre-splits on whitespace)?  Or are there other realistic use cases?

Mike McCandless

http://blog.mikemccandless.com

On Tue, Aug 14, 2012 at 11:53 AM, Marc Sturlese <ma...@gmail.com> wrote:
> Has someone noticed this problem and solved it somehow? (without using
> LUCENE_33 in the solrconfig.xml)
> https://issues.apache.org/jira/browse/LUCENE-3668
>
> Thanks in advance
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: offsets issues with multiword synonyms since LUCENE_33

Posted by Konrad Lötzsch <ko...@antibodies-online.com>.

I don't know wether this was discussed previously,
but if you tell the synonmyfilter to not break your synonyms (which 
might be the default). In this case, the parts of the synonyms get new 
word positions. So you could use a Keywordtokenizer to avoid that behaviour:

         <filter class="solr.SynonymFilterFactory"
             synonyms="Synonyms.txt"
             ignoreCase="true"
             expand="false"
             tokenizerFactory="solr.KeywordTokenizerFactory"
         />

with regards,
konrad.

Am 14.08.2012 18:51, schrieb Marc Sturlese:
> Well an example would be:
> synonyms.txt:
> huge,big size
>
> The I have the docs:
> 1- The huge fox attacks first
> 2- The big size fox attacks first
>
> Then if I query for huge, the highlights for each document are:
>
> 1- The <strong>huge</strong> <strong>fox</strong> attacks first
> 2- The <strong>big size</strong> fox attacks first
>
> The analyzer looks like this:
> fieldType name="sy_text" class="solr.TextField" positionIncrementGap="100">
>        <analyzer type="index">
>          <tokenizer class="solr.StandardTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.ASCIIFoldingFilterFactory"/>
>          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="false" expand="true" />
>        </analyzer>
>        <analyzer type="query">
>          <tokenizer class="solr.StandardTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.ASCIIFoldingFilterFactory"/>
>          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="false" expand="true" />
>        </analyzer>
>      </fieldType>
>
> This was working with a previous version of Solr (couldn't make it work with
> 3.6, 4-alpha nor 4-beta).
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195p4001213.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: offsets issues with multiword synonyms since LUCENE_33

Posted by Marc Sturlese <ma...@gmail.com>.

Well an example would be:
synonyms.txt:
huge,big size

The I have the docs:
1- The huge fox attacks first
2- The big size fox attacks first

Then if I query for huge, the highlights for each document are:

1- The <strong>huge</strong> <strong>fox</strong> attacks first
2- The <strong>big size</strong> fox attacks first

The analyzer looks like this:
fieldType name="sy_text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="false" expand="true" /> 
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="false" expand="true" /> 
      </analyzer>
    </fieldType>

This was working with a previous version of Solr (couldn't make it work with
3.6, 4-alpha nor 4-beta).



--
View this message in context: http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195p4001213.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: offsets issues with multiword synonyms since LUCENE_33

Posted by Jack Krupansky <ja...@basetechnology.com>.

What is your specific example? There are lots of issues and "gotchas" with 
synonyms. Is your example exactly identical to the referenced Jira, or 
merely roughly similar. The exact example is needed to analyze these types 
of issues.

And please be specific about which term in the sequence has an incorrect 
offset, including the actual offset vs. what you expected. Unless, of 
course, your example is the exact one listed in that Jira. Sometimes bug 
fixes do get lost.

-- Jack Krupansky

-----Original Message----- 
From: Marc Sturlese
Sent: Tuesday, August 14, 2012 11:53 AM
To: solr-user@lucene.apache.org
Subject: offsets issues with multiword synonyms since LUCENE_33

Has someone noticed this problem and solved it somehow? (without using
LUCENE_33 in the solrconfig.xml)
https://issues.apache.org/jira/browse/LUCENE-3668

Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195.html
Sent from the Solr - User mailing list archive at Nabble.com.