You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2011/12/26 21:12:30 UTC

[jira] [Created] (LUCENE-3668) offsets issues with multiword synonyms

offsets issues with multiword synonyms
--------------------------------------

                 Key: LUCENE-3668
                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
             Project: Lucene - Java
          Issue Type: Bug
          Components: modules/analysis
            Reporter: Robert Muir


as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.

as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Rahul Babulal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258775#comment-13258775 ] 

Rahul Babulal commented on LUCENE-3668:
---------------------------------------

I'm using solr 3.6, and with luceneMatchVersion =3.6 in my solrconfig.xml I'm still seeing issues with highlighting. However using luceneMatchVersion=3.3 fixes my issue.

Issue Details: 

In my synonyms if I have:
nhl, national hockey league 

If I index "Australian nhl team great" and 
search-use-case 1: search for "hockey" (without quotes) in my highlighted response snippets I get "Australian nhl <em>team</em> great".
search-use-case 2: search for "league" (without quotes) in my highlighted response snippets I get "Australian nhl team <em>great</em>".

Here is my feildType and field definitions:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>        
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
		<filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

<field name="description" type="text_synonym" indexed="true" stored="true"  termVectors="true" termPositions="true"  termOffsets="true" omitNorms="false"/>
   
                
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Koji Sekiguchi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176053#comment-13176053 ] 

Koji Sekiguchi commented on LUCENE-3668:
----------------------------------------

Thank you for opening this issue, Robert!

bq. Using the old impl imo is no workaround, the offsets are crazy (each individual word gets 0-22).

Good point. Using old impl, if I search for national, the whole phrase of "national hockey league" is highlighted.

bq. But i think we should just leave it be and try to improve the new one.

+1
                
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Okke Klein (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506378#comment-13506378 ] 

Okke Klein commented on LUCENE-3668:
------------------------------------

Doesn't work for me either in Solr4. Can we revisit this issue?
                
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0-ALPHA
>
>         Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3668:
--------------------------------

    Attachment: LUCENE-3668_test.patch

here's 2 tests, one for the old impl, one for the new one:

In this case we have national hockey league = nhl.

desired behavior:
||token||posIncr||startOffset||endOffset||
|national|1|0|8|
|nhl|0|0|*22*|
|hockey|1|9|15|
|league|1|16|22|

current behavior (FST impl):
||token||posIncr||startOffset||endOffset||
|national|1|0|8|
|nhl|0|0|*8*|
|hockey|1|9|15|
|league|1|16|22|

old impl:
||token||posIncr||startOffset||endOffset||
|national|1|0|22|
|nhl|0|0|22|
|hockey|1|0|22|
|league|1|0|22|

>From the offsets perspective, nhl is only getting the offsets of national (endoffset=8) but it would be bettter if it got endoffset=22.

Using the old impl imo is no workaround, the offsets are crazy (each individual word gets 0-22). But i think we should just leave it be and try to improve the new one.
                
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-3668.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   3.6

Thanks Koji!
                
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182018#comment-13182018 ] 

Robert Muir commented on LUCENE-3668:
-------------------------------------

+1, I think this is the right solution.

If you have "a b" -> "c d e" or "a b c" -> "d e" then what we are doing now seems good and well-defined,
(i have no idea what else we could even do), but with one output we "know" and the patch improves that case.
                
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>         Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3668:
---------------------------------------

    Attachment: LUCENE-3668.patch

Patch; I think it's ready.

I fixed the syn filter to set the start/end offset only when the right-hand-side of the rule has a single token; else, it does what it did before (inherit start/end offset from the single input token the output is "on top of").
                
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>         Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506554#comment-13506554 ] 

Robert Muir commented on LUCENE-3668:
-------------------------------------

That writeup is a little off.

{quote}
Finally, and most seriously, the SynonymFilterFactory will simply not match multi-word synonyms in user queries if you do any kind of tokenization. This is because the tokenizer breaks up the input before the SynonymFilterFactory can transform it.
{quote}

Thats not correct. The bug is in QueryParser: LUCENE-2605.

                
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0-ALPHA
>
>         Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Assigned] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Michael McCandless (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-3668:
------------------------------------------

    Assignee: Michael McCandless
    
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>         Attachments: LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Comment Edited] (LUCENE-3668) offsets issues with multiword synonyms

Posted by "Okke Klein (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506378#comment-13506378 ] 

Okke Klein edited comment on LUCENE-3668 at 11/29/12 12:36 PM:
---------------------------------------------------------------

Doesn't work for me either in Solr4. Can we revisit this issue?

Perhaps this http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ can give some insight/help?
                
      was (Author: okkeklein):
    Doesn't work for me either in Solr4. Can we revisit this issue?
                  
> offsets issues with multiword synonyms
> --------------------------------------
>
>                 Key: LUCENE-3668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3668
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0-ALPHA
>
>         Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org